High memory usage in F40 on RPi 4, unable to find which process used them

Problem

After upgrading to Fedora 40 on my Raspberry Pi 4, the memory is almost fully consumed by the ghost process, possibly kernel.

uptime:

 15:10:25 up 1 day, 17:23,  6 users,  load average: 0.08, 0.19, 0.43

uname -a output:

Linux potato 6.8.7-300.fc40.aarch64 #1 SMP PREEMPT_DYNAMIC Wed Apr 17 19:53:21 UTC 2024 aarch64 GNU/Linux

free -mh output:

               total        used        free      shared  buff/cache   available
Mem:           7.5Gi       6.3Gi       599Mi       7.2Mi       947Mi       1.3Gi
Swap:          4.0Gi       145Mi       3.9Gi

htop screenshot:


top -o RES output:

top - 15:07:30 up 1 day, 17:20,  6 users,  load average: 0.28, 0.27, 0.50
Tasks: 245 total,   1 running, 244 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.5 us,  1.0 sy,  0.0 ni, 97.2 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st 
MiB Mem :   7726.9 total,    590.8 free,   6450.6 used,    948.9 buff/cache     
MiB Swap:   4096.0 total,   3950.2 free,    145.8 used.   1276.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                
   6204 wolf      20   0 4977832  47028  27120 S   0.3   0.6  14:43.83 podman                                                                                                                                 
    944 root      20   0  597712  42088  14308 S   0.0   0.5   0:03.93 firewalld                                                                                                                              
   1003 root      20   0 1315724  34440  16320 S   1.0   0.4  38:52.97 tailscaled                                                                                                                             
    626 root      20   0  103032  31324  30608 S   0.0   0.4   1:40.01 systemd-journal                                                                                                                        
   1723 root      20   0 2039764  29352  11856 S   2.0   0.4   1:16.75 dockerd                                                                                                                                
 252182 wolf      20   0  867524  27132   5380 S   0.0   0.3   0:15.59 cockpit-bridge                                                                                                                         
   1758 root      20   0 1259936  23904  10588 S   1.3   0.3  17:06.29 cloudflared
... tuncated

This is very weird cuz I didn’t face same problem before I upgrade to F40, F39 just works fine and its able to run continuously for a month without rebooting.

Cause

Unknown, fix available

Related Issues

Bugzilla report: #2275290

Workarounds

See solution.

1 Like

You have observations, but I am missing the problem statement.
Being different is not necessarily a problem.
Is there something that will not run or has broken?

After roughly two days whole system will just be filled with memory garbage and system will just freeze.

Thanks for explaining.
I would start by running top in batch mode and sample every 60s into a file.
Let the system run for a while so that the leak should be easy to see in the top output. What look like the cause?

From Proposed Common Issues to Ask Fedora

I do run top when diagnosing this issue, but the problem is I cannot find which process is using that much memory! I do attach some screenshot and a top command output, you can check them out.

You need the history of process and system memory size information.
Which is why I suggested collecting top output every 60 seconds.

With that information is should be possible to figure out what is changing over time.

Unfortunately, I don’t have a memory usage history, but it’s not fully used before. I use that RPi as a server with some Podman containers running on it, the screenshot above is the result after stopping all of them.
Since I containerize everything, stopping all containers means stopping everything running on that server, all I am left with is system processes.

Even if I don’t collect top output every 60 seconds, the output after system memory is fully used still doesn’t give any clue. Btw the top output is sorted from high to low based on memory usage.

After digging through some threads on the internet, the strange output from /proc/meminfo indicated that Slab and SUnreclaim are very high (Around 4 GiB).

Unless you plan to turn off the system you can create such a history from now until the system fails.

To track this I would collect samples of /proc/meminfo every as well as the top output.
It should be easy to watch slab usage climb and confirm that you are on to the problem.
Next will be the issue of finding out why slab is heavy used.

After digging through the whole internet for a solution, I came across a thread telling me the display might be the issue. When display output is active but no display is plugged in, the memory leak issue just spawns from nowhere.

I fixed my leaking problem by disabling HDMI hotplug in config.txt, everything seems working as intended right now

2 Likes

I’m not convinced the “solution” is actually The Solution™. I have a raspberry pi 3B+ that I use basically for testing stuff. It has F40 minimal install. I usually run it headless, and run updates on it daily, and practically nothing else.

A week or so before I upgraded from F39 to F40, I started finding it unresponsive when it came time for morning updates. It being the least important of my fleet, I didn’t attempt to track down the issue until the F40 upgrade died after having installed all the F40 rpms but before removing the F39 rpms. I eventually re-installed, in batches, all the F40 rpms, which had the nice side effect of cleaning up the F39’s, So things should be good, right?

Wrong. It still runs about 4 to 6 hours before it runs out of RAM, and the oom killer kicks in. I’ve been testing with various kernels and logging ram with timestamps (see below) hoping to find a clue. In fact, as I type (on another box), I’m currently running kernel 6.8.6-200.fc39 in single-user mode. That’s the oldest kernel I still have, but it makes no difference. I’ve stopped every process I can and still have a functioning system. In user space, nothing is left but systemd, systemd-udevd, systemd-sulogin-shell rescue, sulogin, bash, and under bash I’m running this shell script:

#!/bin/bash
while true ; do 
  (
    echo -n $(date)
    free -v -w | grep ^Mem
  ) | tee -a ${1:-free.log}
  sleep 10
done

That’s it! There are no other user-space processes. And in about 3½ hours it’s going to die an ugly death due to all RAM being exhausted.

I’ve got to think it’s either a kernel bug, a systemd bug, or a bash bug, b/c there’s nothing else running!

I’ve got a monitor attached, but I get the same results regardless. Maybe I can play with disabling HDMI hotplug in config.txt to see if that makes a difference. It it does, that makes it a kernel bug, right?

Running an RPi 4 headless and no leaks evident.
This is runing server and configured to be a router.

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           3.7Gi       362Mi       3.2Gi       1.2Mi       297Mi       3.4Gi
Swap:          3.7Gi          0B       3.7Gi

$ uptime
 22:01:40 up 4 days,  9:48,  1 user,  load average: 0.00, 0.00, 0.00

Which kernel, @barryascott ?

$ uname -r
6.8.7-300.fc40.aarch64
1 Like

Is it possible that only devices that upgrade from F39 to F40 has this issue?

It does indeed looks like kernel bug, can you try to unplug your display and disable HDMI hot plug? It’s not a real solution for this issue, it just a workaround that works for me

This is an upgraded system. I did the f39 to f40 using the dnf system-upgrade method.

What I mean before is that when there is a output active but without a display plugged in, which isn’t your case here. You can try to boot the pi with display plugged, and unplug it after it boots.

I’ll run that experiment. I’m grabbing /proc/meminfo every 10 mins to watch for a leak.

Also I had to confiugure nomodeset on the kernel command line.
Not sure if that is still needed.

I started seeing this behavior a week or two before upgrading F39->F40.