Random deadlock freezes. SSD?

Good time of the day, Community.

Wanted to ask for some advice on such a topic - for several years I have been facing an issue with laptops going to full deadlock with not even the cursor moving.

I would cross out a specific distro (in terms of laptops… currently it occurred only with Ubuntu… and recently Fedora).

Same goes for a specific laptop model. I think this would be my 3rd laptop have such issues.

One thing I would outline though (and more detail on the question of WHY I made a pointer on that specific thing) is that the system has only one memory drive, and that is an SSD.

My linux knowledge isn’t yet there to know how to read journals for errors. But the last times I’ve looked (with Google’s help) I didn’t find any record that would lead me to a cause.

On wednesday, 3 days ago, I decided to move to Fedora 37 Workstation. I was moving from Ubuntu 22 LTS (an update from 20 LTS). That one, in origin, also had this problem, but it seemed to have “went away” at some time point after I added a file based swapfile (but I remember it happening after that 1 time… but not at the previous ratio of “at least once a week”).

I can’t really attach it to a specific action. Today it happened while I had a few apps opened (Intellij, vpn, postman, sublime, dbeaver, firefox… and a docker service running). I opened a new app, and at that moment a Firefox notification, originated by Google Calendar appeared.

I have a gnome top bar plugin, which shows CPU, Memory and Network load. The CPU was at 0.6, memory at 74% and Network at idle.

Before that, on wednesday I think I had a handful of similar freezes. If to think of it, they may be related to “application opening”. But on Ubuntu I do remember it sometime dying on me when I was trying to unlock my desktop from login screen.

I can tell that all systems (even Fedora) is installed on ext4 with separate /boot, /boot/efi partitions.

My current laptop is Asus Vivobook X512J

Now for the SSD part.
In my history, I think every SSD, which I owned (Kingston, WD), had a filesystem problem.

This started off with my first “good hardware” build done back in the days of i7-7700k. I bought my first SSD then. I didn’t cheap out and bought Kingston, thinking that it had decent quality.

That build had one problem - I mislooked that my mobo arrived with bent pins (and since I didn’t see it at once… I couldn’t return the mobo… or buy a new one. First job and yata-yata).

But the problem was 100% the same. I turned on the PC, when I came from work. Launched browser, music and a korean mmorpg I was playing at that time (which, as turned out later, had a rock fall of temp files being constantly written to the drive). The drive hosted both the OS and games.

At some point I bought a new mobo. But the problem remained. At some point I thought that the pins damaged the CPU. Later on I bought a new CPU… but the problem remained.

… several #$@#!# later …

After replacing absolutely everything beside the ssd (ssd’s don’t fail, don’t be silly!!!) and the case, I thought I was cursed and literally had zero ideas until a friend of mine said “hm, have you tried the ssd?”.

And, of course, it was the SSD. I even managed to reproduce the issue with removing 1 random stick of ram, loading it up to 100% (that game + chrome did well with that problem), after which opening a few more tabs just so that would go to pagefile. Few clicks and “BINGO”.

I consulted with Kingston and they said it was a filesystem problem. A zero format of the whole thing did the trick. After that I believe I did it once after a few years and by this day my OS is on that drive.

But then I bought a 1TB WD SSD (with 2 M.2 drives in the system… and they work like a charm), which was 2018sh. It didn’t even take me a day to say “AHA!” after installing that drive and getting a 100% familiar freeze. And a zero format solved the issue once again

(I believe I even advices this solution to people with similar problems and every time got a “Solved it!”).

So today I once more, after once more witnessing this problem with my laptop, I recalled “that thing with SSDs”.


Thank you in advance for any advice.

P.S. I will consider doing a zero format for the whole drive (maybe even going btrfs), but this is my daily working system. Plus it’s fully configured and polished and it pains me to do a reinstall…

Memory at 74% seems to imply the laptop may not have adequate RAM. What is the output of free when the system is running normally.

Also, the fact that having a swapfile helped earlier also indicates a potential shortage of RAM.

Fedora uses up to 50% of your ram as virtual swap (default 8GB) which reduces the total available for system usage as well.

1 Like

One thing that can happen specifically with ext4 is that you can have lock-ups with it under a lot of write IO. To speed things up, things are written to a memory buffer, which is faster than the storage, and then written to storage to prevent disk IO from otherwise slowing down your machine. If that buffer gets maxxed out, then suddenly it has to stop and write a whole bunch of stuff to the disk at once, which to the end user feels either like a lock up, skip, or otherwise performance degradation. Your memory usage seems to also support that this is what is going on.

btrfs (the default for Fedora) doesn’t currently have this issue since support for asynchronous buffered writes is very, very new. To be clear, under most normal load, this memory buffer for ext4 generally improves system performance and responsiveness, but “there’s no free lunch.”

You have some options to prevent this:

  • Get a faster SSD
  • Add more memory
  • Use btrfs instead of ext4 for your root/data file systems.
1 Like

Isn’t swap intended for this kind of problem? I have 16Gb of ram and 32 of swap.

It usually sits about 7.8Gb of physical allocated. This is with all my main stuff running.

It could be. But that also brings up the SSD issue I was facing. The “laboratory way” of reproducing the freeze was to have some running process first absorb a portion of the actual RAM, and after a while, while seeking for new resources (let it be a new chrome tab after days of chrome running), get to swap file. That usage of ssd as swap storage in parallel to normal usage triggered the issue (or at least that theory made it possible to reproduce the issue at a 99% rate).

(I need to login at askfedora from my laptop not to switch between, sorry). The average usage is about 50-60%. Laptop has 16 Gb of DDR4. Also this issue has been stalking me for about 3 laptops over the years… I think all of them had only a few things in common - Intel CPU, SSD and DDR4.

It is a possibility. But I think I wanted for even 5 minutes (tried googling the kb command to do a force reboot without pulling out the laptop itself… never worked, but I never done that before). Also, Numlock leds had no response. And the Ctrl+Alt+F1/2/3/4 for console switch did nothing.

But I can recall (maybe it isn’t related) a few times with the first laptop, running Ubuntu, some times having those drastic unexplainable UI performance drops for which only a terminal reboot would be an answer. I tried going console (waiting for eternity to get the login prompt) to htop and get the impression that nothing was actually to blame.

Darn. And there I was afraid of using btrfs, being afraid that it may have new issues, while ext4 seemed.

Sadly, this is a “customer’s laptop”, so the biggest thing I can do is to install a distro of my choice. But I will think of going btrfs the moment the issues will really get under my skin (you know the feeling when you configured everything to the smallest thing that it literally shines… and now you need to start over).

One additional thing to ask (I am new to the subject, so, apologies) - on a different forum I was suggested to enable reboot on kernel panic just to cross out a few possibilities. Did not find a proper Fedora page for such an action (again, I have a long road to learn on Linux). Can someone advise how to properly do so?

Swap uses disk IO, so swapping in this case actually contributes to the problem in this case, since you now how disk bound memory in addition to the write IO.

That is a massive swap. By default, Fedora uses zram so the swap is dynamic on demand. Even still, the old rule was somewhere between 25-50% of physical RAM supplemented by a swap partition, where you have 200%.

In general Linux speak, you should read that as 8GB of your physical RAM is a missed opportunity for being used for potential caching operations. If you’re not using zram, then the swappiness parameter in sysctl has an impact on how much swap is utilized. I think the default is 30. Again, as with btrfs, you’d had probably have seen better performance here with the Fedora defaults using zram instead.

SSDs aren’t great at being swap devices because while generally faster than HDDs this can cause them to wear prematurely. This isn’t a good real-world test for any use-case I can imagine either, but seems like an effective way to cause disk bound IO while wearing out an SSD.

Fun fact - btrfs and ext4 came out within about a year apart and had many of the same devs working on both. Btrfs isn’t new or scary. It’s the successor to the old reiserfs and was intended by ext4 devs to also be the ext4 successor where ext4 itself was a stop-gap from ext3. Nothing wrong with using ext4, but btrfs has been the default FS in Fedora for a good while now for good reason.

All the more reason to be careful about not bricking their SSD card with these swap tests.

This is probably either not possible or not a good idea. A full on kernel panic will take down the OS so there’s no way for the dead kernel to reboot itself. “It’s dead, Jim.” There are also a number of driver issues, such as Realtek wifi drivers, that can cause kernel dumps but it recovers, but to the end user, they’ll notice it as wifi dropping and rebooting over that would be far more disruptive to the end user.

1 Like

Yeah, but at the time of partitioning, I sincerely hoped that I would open a page of “finally something that works like a charm” (and not a “eeeghhh, here we go again”).

Was it 25-50%? I always remembered (at least from 2000s… and with Windows) for it to be 200% - 100% in order for hibernate to function and 100%… don’t remember even.

I also remember it as such.

Well. You live - you learn :slight_smile:
I am still getting that “WOW!” experience after Ubuntu/OpenSuse/other non-Arch distro’s I’ve tried in terms of speed (and stability, and clearly the “polished finished product”).

If I continue to have those freezes, going btrfs will be on the top of the TODOs during the next install.

I know. SSDs have limited cycle of writing. But at that same time… I’m looking at that Kingston SSD, which I’ve mentioned. I’ve had it as my OS drive (with all temps and temp related stuff being constantly downloaded there… Oh yeah, and pagefile. Another “Oh yeah” is the fact that I stored heavy IO games there, which wrote a bucket load of small files). Its health is currently 87%. If my memory doesn’t lie - I bought it 2018(or was it 2016 even) or something, making it… old. So I would say that the problem is more “on paper” in terms of regular use.

Honestly, I wish I would move to Linux as my “home pc OS”. That would really make the learning curve more deep and detailed. But, alas, Linux in terms of drivers, hardware related software and even games still “isn’t there” (I recently gave a shot with dualboot just to realize that PipeWire fully trashed the onboard profiles of my Creative AE-5, from which I had a lot of hassle reverting).

In addition to this, even being a contractor with customers always providing their hardware, in most cases I ended up with “No! We don’t know this OS! Ubuntu only!!!”, making things even worse.

So thanks for this info about btrfs :slight_smile:

You would be surprised about the “new generation of developers” and how poor they treat the hardware they are given. I remember when I was leaving my previous company. I brought the admins the laptop, which they gave me 2 years ago. They literally were surprised the laptop was in - not even a scratch. Then they showed me a “regular returned laptop”, which was close to being a scrap pile, and said “this is what we usually get from other developers…”. Remember they even gifted me with a good logitech headset I was given “for time” :slight_smile:

No, I found the sysctl to be of use for this. For the time being, I set it for only the current session (without persist) just to see if it will actually happen or not. This would at least show that the OS still has some control over the system during such problems.

Could be, but currently can’t see a way to test this theory. Want to see what fruit will the kernel.panic will bring to the table.

To be fair, you’re not using the Fedora defaults for partitioning. It is auto-magic for most users. You’ve chosen to use a custom configuration, which is perfectly fine, but that might also means that configuration might not get as much testing or perform as well.

Windows NT kernel works very differently from the Linux kernel.

It’s definitely there for me. But we definitely welcome contributors who can help us test things on more hardware.

It’s one thing to do it on your own hardware and quite another to do it on someone else’s, especially a “customer”.

Sure there is - it gets dumped to journalctl.

1 Like

You mean “automatic”? I had a partition with recovery and project data, so it was out of the question :slight_smile:

It’s not even the hardware. It’s the “manufacturer doesn’t create utilities for Linux” situation - Razer, Corsair, Roccat. For Corsair I did find a tool to configure bindings, but not for the mouse. Razer also is a 3-4 part puzzle, in which you need to find which one will kill the RGB, and which gets the custom binds for additional keys. Same for the NZXT. I have a NZXT SmartDevice, left from one of my PC cases. When I was going custom waterloop I hooked up all the fans to it, and it works like a charm. In order to do the same on Linux… I need to google, try, fail, try again. Sadly, this isn’t close to the half of things, for which my home PC is currently locked up with windows…

From one side, I would be happy to help. From the other - last time I’ve tried my soundcard with PipeWire, I came pretty much close to burning both the speakers and headset, both of which aren’t cheap.

My soundcard has a separate configuration switch for 30, 300 and 600om headphones. I have a pair of 600$sh headphones, which work at the 30om option. When I was trying dualboot, the profile was overwritten. Returning to windows I got a 100% 600om with some broken codec, which literally gave a metal screach sound going from the headphones (in panic I switched to speakers, which were originally used with 10% software volume…). Thought I’d burn both of them (or at least, damaged) at that moment.

More of a “when I was growing up, breaking something meant that I would be punished by my parents and would not get a replacement”, which clearly isn’t the case for this generation…

Thanks. Will keep an eye on that.

And I am back.
Have been working without a problem for… from my last post, 2 weeks. Then I rebooted (installed latest updates and set swapiness from 60 to 100).

I’m starting to get the impression that this is somehow related to applications, installed in “non-native” way - snap, flatpak and so on (or for some specific ones). After the reboot I started Intellij (non-flatpak), Firefox (flatpak), Slack and pritunl (vpn).

The moment I started PostMan(flatpak), which happened after a minute after the previous bunch, I got a freeze. I did have kernel.panic set to 30 (I would believe Seconds). Verified that it didn’t budge.

Two strange things to note - PostMan was the first thing to go into swap (I have a extension for gnome). The “frozen screen” was showing 256kb utilized of swap (before that I was monitoring that value and it was 0). Before (and after) the reboot, Swap was used heavily and that didn’t cause any issues.

Second was the fact that after a minute or two of the freeze, I started hearing my laptops cooling kicking in. I can guess that it was either the OS losing control and the fan returning to BIOS controlled, OR a loop, in which there was still “computation” going on.

In the end, I still had to resort to 7 second power button.

So the hunt continues…

Seems a hardware problem. Start checking memory with memtest86 and memtest86+.
If OK, then check the drive, if a nvme:
nvme smart-log /dev/nvme0
or whatever nvme it is.

This is what I’ve started from (I know that the first post is a nobrainer for TL;DR) - SSD (this would most likely be the 3rd SSD, formatting which with null format, would cure the PC). I have a strong feeling that the memory is OK, since “not the first PC/Laptop with an SSD” with a 1:1 symptoms.

I’m currently hoping to see a different solution since it is a working laptop, which I took pride in setting up my first Fedora. Setting up everything is a headache.

Also one thing that concerns me is the fact, that this seems to have strong relation to packages, installed using “wine type”(I’m trying to remember the right wording for this… and failing) layers - snap, flatpak, non-DNF/APT. I honestly can’t recall a case, where this would happen when a non-isolated app would start. It always seems to be Slack, PostMan. I know Firefox also is a flatpak now… but for now I’m not adding it to the list.