F32 grinds to a halt, slowly, usually when away, every 3-4 days

Since just before the upgrade to F32, my system has been behaving oddly. It doesn’t just hang or crash, but after I’ve been using it for several days, and usually while I’m away, which is less than 50% of the time these days, parts of the system start to disappear.

The first thing I usually notice is that BOINC isn’t running. But, sometimes that’s not the first to, but usually. The next thing to fail is the lockscreen. Sometimes the montior will not wake, but when it does I can interact but unlock fails. Then I try to login via SSH. Sometimes I cannot, no reply, but sometimes I can or it authenticates but hangs starting the shell. It seems like many subsystems are still functioning, though; no disk errors are reported after reset and sometimes I can initia a shutdown by hitting the power button, although that has never finished successfully. SysReq sometimes will show a message on the console saying SysReq is disabled.

So, it’s just some things that fail, not the whole system. And what is really weird is that there are no errors recorded. That would make sense if it was a crash or reset, but it’s not. Some other messages, like CRON, are recorded, at least for a while, but the journal seems to be one of the first casualties, too.

I’ve tried replacing all the memory and GPU. I also have been upgrading kernels, trying testing kernels and even a copr with newer kernels from upstream. The problem persists. It seems a little like hardware, but also not. For one thing, my hardware is pretty new. Also, the hardware is no more stressed when the problem occurs; it runs fine for several days. That’s unusual for a hardware thing, not to say that I haven’t see things like that before.

I ran memtester for several loops, but I haven’t tried memtest86. Someone mentioned the memory controller, and I have all new memory modules, so it is not a particular spot of memory that is bad. Do you think memtester was a good test or should I still try memtest86?

Here is my boot log: https://paste.centos.org/view/afee842c

One thing we can eliminate is the system upgrade, because this started before it happened. My first thought is that something is either running you out of RAM, or eating up all of your diskspace. As a start, please run these two commands in a terminal and post the results here:

free -h
df

This will give me something to work on and maybe find a solution for you.

Thanks for your help!

Agreed, F32 is probably not a significant factor. I upgraded hoping it would help, but it didn’t.

Well, let’s take the last one first. It is not disk because there is a lot of free space. Also, I don’t have many process that produce unlimited logs. It’s not like the journal will ever fill /var, it doesn’t work like that. I’m not sure how storage would fit into these symptoms, other than the lack of evidence in the logs. That’s an interesting idea, but I don’t see any evidence for it.

$ df -h
df: /run/user/13013/doc: Operation not permitted
Filesystem                Size  Used Avail Use% Mounted on
devtmpfs                   16G     0   16G   0% /dev
tmpfs                      16G  316M   16G   2% /dev/shm
tmpfs                      16G  2.4M   16G   1% /run
/dev/mapper/vgNew-LVroot   50G   30G   18G  62% /
tmpfs                      16G   11M   16G   1% /tmp
/dev/sde2                 526M  263M  235M  53% /boot
/dev/sde3                 549M  8.8M  541M   2% /boot/efi
/dev/mapper/vgNew-LVhome  812G  646G  166G  80% /home
/dev/mapper/vgNew-LVvar    40G   12G   27G  32% /var
tmpfs                     3.2G   22M  3.2G   1% /run/user/xxxx

Yes, I agree, memory exhaustion sounds like a good candidate, but there are several reasons why it’s not. First of all, I doubled my memory and that didn’t increase the MTBF. Secondly, I’m not using all of virtual memory when this happens. Before I upgraded memory, about 15 % of swap was in use not long before I noticed the problem. And, when I have got onto the system, swap is not full. More importantly, we have OOM now and it’s not killing any programs, so, unless that mechanism is totally broken, we can rule out any classic type of mem shortage.

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          32119       12445        4910         510       14763       18653
Swap:         14985          15       14969

On the other hand, programs are very bloated, these days, and I’ve read a lot, recently, about some long standing problems with the VMM. The impression I got was that further improvements to the VMM would be difficult and there isn’t much interested in doing it. For example, because of bloat, the kernel must deal with massively over-committed memory on every system, so options are limited. There are tuning parameters I can tweak to clamp down on over-commits, but then what? I cannot run FF. How do I live with that?

But, maybe it’s not total free memory that is running out. I wonder if it is some other, internal kernel resource that isn’t so easy to interrogate with free or vmstat. I saw some tuning recommendations for vm.min_free_kbytes, but the default in 5.7, today, already seems to be right sized. I could increase it, but I don’t really have a good reason to think that will help.

I’ve also been watching the slabinfo, lately. it appears that the total number of slab objects, as well as the amount of un-reclaimed slab mem, grow over time and without bound. They grew quickly immediately after boot, and now they are growing more slowly. But, what else would I expect for a running system, right?

Ultimately, my troubleshooting is hampered by the lack of relevant errors. Whether a processes crashes or is killed, there should be some evidence of it. This suggests that process that stop running are not crashed or killed. So, they are just no scheduled to run. Why? If they cannot get enough new memory allocated, that should cause an error. If they are swapped and cannot get into RAM to be scheduled, I don’t know what I would see; that’s what makes me suspicious of the VMM and too much used swap. But, since doubling memory didn’t prevent it, and it occurs with little or no swap, that doesn’t fit either.

I agree with you that memory and disk space can be eliminated, but that’s why I asked about them. They’re both easy to check, and if we’re lucky, we have an easy fix. I’m long retired and can’t advise you about slab issues as I’ve not run across it before. If it were me, I’d be researching slab and find out why there are so many objects unused that can’t be deallocated. I understand that in part, this is what slab is for, but if you can’t ever deallocate it, sooner or later you’re going to run out of memory and if that were true, it wouldn’t ever have been accepted, so there’s probably something wrong with it. Good luck, keep us posted and eventually we’ll all learn something new.

1 Like