How to debug cause of system that silently died or froze

Fedora 41
Beelink desktop computer (not a laptop)

My system just silently died/froze; I’m trying to debug what happened.

I sat down at my desk (the system was not in sleep mode nor suspended nor hibernated). I moved the mouse (and I might have hit some key or keys). The screen suddenly went blank with a “no input detected” message. The power light was on.

I tried to hit the keyboard as I would to bring the system out of sleep. I tried ‘short’ presses of the power button. No response; the system seemed either frozen or crashed (not running?).

I had to long press (10-second press of the power button) to reboot it. Now I’m up again. But I’m trying to figure out what happened.

This has happened before. A few possibilities I can think of are:

  • Overheating of some chip or component
  • Hibernation error
  • Other fatal system error

I never set up hibernation, and I don’t know how to tell if it’s even set up 'out of the box" on Fedora 41. When I click on the Application Launcher I do see that “Hibernate” is an option and it is not grayed out.

If hibernation is indeed enabled, I suppose it’s possible that I accidentally hit some keyboard combination while moving the mouse that signaled hibernation. But then the question is how to restore from hibernation? I imagine that clicking the power button on the computer (not the long 10-second click to signal power shut off) would do it. I tried that and nothing happened; the system seemed to remain frozen.

This has happened before where the system just dies silently and I must reboot via a long 10-second pressing of the power button on my desktop.

I’m trying to debug; suggestions welcome…!

I see a bunch of lines in the dmesg output that I think are relevant:

[    0.021611] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.021612] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.021612] PM: hibernation: Registered nosave memory: [mem 0x09a7f000-0x09ffffff]
[    0.021613] PM: hibernation: Registered nosave memory: [mem 0x0a200000-0x0a23bfff]
[    0.021614] PM: hibernation: Registered nosave memory: [mem 0x831cb000-0x831cbfff]
[    0.021614] PM: hibernation: Registered nosave memory: [mem 0x831ce000-0x831cefff]
[    0.021615] PM: hibernation: Registered nosave memory: [mem 0x8b184000-0x8b772fff]
[    0.021616] PM: hibernation: Registered nosave memory: [mem 0x8dd8f000-0x8dd8ffff]
[    0.021616] PM: hibernation: Registered nosave memory: [mem 0x8f0e4000-0x91c41fff]
[    0.021617] PM: hibernation: Registered nosave memory: [mem 0x91c42000-0x91cb2fff]
[    0.021617] PM: hibernation: Registered nosave memory: [mem 0x91cb3000-0x96d2cfff]
[    0.021617] PM: hibernation: Registered nosave memory: [mem 0x96d2d000-0x9affefff]
[    0.021618] PM: hibernation: Registered nosave memory: [mem 0x9bff9000-0x9bffcfff]
[    0.021618] PM: hibernation: Registered nosave memory: [mem 0x9bfff000-0x9cffffff]
[    0.021619] PM: hibernation: Registered nosave memory: [mem 0x9d000000-0x9d78ffff]
[    0.021619] PM: hibernation: Registered nosave memory: [mem 0x9d790000-0x9d7effff]
[    0.021619] PM: hibernation: Registered nosave memory: [mem 0x9d7f0000-0x9d7f4fff]
[    0.021619] PM: hibernation: Registered nosave memory: [mem 0x9d7f5000-0x9fffffff]
[    0.021620] PM: hibernation: Registered nosave memory: [mem 0xa0000000-0xfedfffff]
[    0.021620] PM: hibernation: Registered nosave memory: [mem 0xfee00000-0xfee00fff]
[    0.021620] PM: hibernation: Registered nosave memory: [mem 0xfee01000-0xffffffff]

I searched /var/log/messages for the same text string “Registered nosave” and see similar entries to the above dmesg output.

Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000f
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fff
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x09a7f000-0x09ffff
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x0a200000-0x0a23bf
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x831cb000-0x831cbf
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x831ce000-0x831cef
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x8b184000-0x8b772f
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x8dd8f000-0x8dd8ff
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x8f0e4000-0x91c41f
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x91c42000-0x91cb2f
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x91cb3000-0x96d2cf
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x96d2d000-0x9affef
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x9bff9000-0x9bffcf
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x9bfff000-0x9cffff
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x9d000000-0x9d78ff
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x9d790000-0x9d7eff
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x9d7f0000-0x9d7f4f
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0x9d7f5000-0x9fffff
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0xa0000000-0xfedfff
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0xfee00000-0xfee00f
ff]
Mar 14 18:27:12 mandolin kernel: PM: hibernation: Registered nosave memory: [mem 0xfee01000-0xffffff
ff]

I’m wondering if this indicates that the system failed to hibernate. In the /var/log/messages I see 21 lines similar to the above line from dmesg:

Mar 14 18:27:12 mandolin kernel: [mem 0xa0000000-0xfedfffff] available for PCI devices
Mar 14 18:27:12 mandolin kernel: Booting paravirtualized kernel on bare hardware
Mar 14 18:27:12 mandolin kernel: clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
Mar 14 18:27:12 mandolin kernel: setup_percpu: NR_CPUS:8192 nr_cpumask_bits:16 nr_cpu_ids:16 nr_node_ids:1
Mar 14 18:27:12 mandolin kernel: percpu: Embedded 88 pages/cpu s237568 r8192 d114688 u524288
Mar 14 18:27:12 mandolin kernel: Kernel command line: BOOT_IMAGE=(hd2,gpt2)/vmlinuz-6.13.5-200.fc41.x86_64 root=UUID=4a972d32-b4a0-418e-8f6d-f1aabcbe6b5f ro rootflags=subvol=root rhgb quiet
Mar 14 18:27:12 mandolin kernel: Unknown kernel command line parameters "rhgb BOOT_IMAGE=(hd2,gpt2)/vmlinuz-6.13.5-200.fc41.x86_64", will be passed to user space.
Mar 14 18:27:12 mandolin kernel: printk: log buffer data + meta data: 262144 + 917504 = 1179648 bytes
Mar 14 18:27:12 mandolin kernel: Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes, linear)
Mar 14 1
8:27:12 mandolin kernel: Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes, linear)

I doubt the problem is available swap space. I have 32GB of RAM. And here is my swap configuration:

$ swapon --show
NAME       TYPE        SIZE USED PRIO
/dev/zram0 partition     8G   0B  100
/dev/sdb4  partition 128.5G 7.7M  200
/dev/sda4  partition 128.5G 8.1M  200
$

As far as I can tell, I have had no such system failures in these past two weeks or so since I configure swap as shown above. However, I have experienced the system silently die like this about 3 or 4 times now in the past 2 or 3 months. I’m wondering if I just have a failing hardware component or chip.

Can anyone suggest what to examine, commands, files, etc? I’ve captured all the dmesg output in a text file. And I’ve copied the /var/log/messages file so I have it all as well. I’m not sure what to look for however.

Thanks in advance,

Running Memtest86+ would be a reasonable thing to do.

Using smartctl to run a long test on sda and sdb would also be something to check, especially since you are using them for swap. I’d probably try to do an offline test from a Live image.

I think the PM … hib … nosave messages are just saying that if you hibernate this system, those address won’t be saved. You might try updating your BIOS/firmware. (Or downgrading your kernel, but since you say you’ve been seeing this for months, you might have to go back to a version that is no longer installed.)

FWIW: This thread appears to be discussing similar problems and mentioning various workarounds that have worked in some cases: https://bugzilla.kernel.org/show_bug.cgi?id=193011

1 Like

Thanks, @glb

I think you’re right about the PM… messages. I did some digging before posting my question, and there were some discussions on various sites that surmised (nothing definitive) that those messages were identifying areas of disk that would not be used for hibernation.

I think I’ll start with a check on whether there is a BIOS upgrade available for my hardware. And I will post my system’s hardware specs here today.

My two external hard drives are new (1 month old), and they seem to be working fine. However, I am suspicious of how Linux manages and uses swap.

I typically boot my system only once a month; I do a ‘dnf upgrade’ only once a month typically and then reboot. At the end of the day I just put my system in sleep mode. It’s been bothering me that, after a few days of running my system during which time I open and use a bunch of applications, I see that my system shows 1.1GB of USED swap on both external drives. I wonder how this space is not being ‘removed’ (not shown as USED) after I’ve terminated the applications and only have something minimal such as one web browser (Firefox) up and running.

Typically, I could have these applications running at once:
Firefox
Libre Office (maybe 2 documents open at once)
Thunderbird
Dolphin (2 windows)
Konsole (3 windows)
Proton VPN
Spectacle (just take a screen grab and then exit)
Okular

After closing most of the above applications except typically Proton VPN and Thunderbird and Firefox, my system can sometimes still show 1.1GB of swap in use on one of the external drives, and anywhere from 400MB to 1.0GB of swap used on the other drive.

Strange. Should the ‘swapon --show’ output reflect that the swap space is no longer in use when the applications that had swapped pages have been closed?

Anyway, I’ll update with my hardware info, and I’ll look for a BIOS upgrade as you suggested. I’m going to start by pouring over the bugzilla bug report you linked; thanks…!

Not necessarily. The algorithms might be tuned to conclude that the cost of freeing the memory from the cache/buffer/swap is more expensive than leaving it until the memory is actually needed. There is a (high) probability that memory that has been read in the recent past will be read again in the near future and, in some cases/configurations, it will be faster to load the data/files from the swap memory than from the original copies in the filesystem. (In the distant past, people would do things like locating their swap partition in the middle of the HDD where read speeds were slightly faster or use a separate drive just for swap.)

FWIW, there is some more info here:

My system’s processor:
https://www.techpowerup.com/cpu-specs/ryzen-7-7840hs.c3033

I’ll post hardware details if anyone interested. But I have some work to do to search for avaiable firmware updates so I’ll start there.

@glb A few questions and comments… Somewhere (I can’t find the thread now) someone responded to me about a month ago saying that Linux does not do swapping at all but only paging. Is this true?

In my previous comment I tried to say that the OS should not report space as ‘in use’ if it is not. The OS might not ‘zero out’ space on disk used for swap (or paging), but I can’t imagine why it would report said space as occupied or ‘in use’ if it is not.

When a process terminates, the OS reclaims all pages of RAM; they are not seen as ‘in use’ even though most OSes do not zero out the RAM page (which is why programs are supposed to initialize memory, both stack allocated and heap allocated).

I have never heard of a system that lists swap (or paged) space as ‘in use’ if it is not, regardless of whether the physical bits have been zeroed out or not. But it’s possible GNU/Linux did it a different way? I’d be very interested to learn about it if so. It might help me understand what I’m seeing on my system.

I just did some searching on this topic but didn’t find anything clear.

Well, I’m continuing with the debugging suggestions… I’ll report back when I have more information.

I think it depends on whether you have configured swap memory of some sort (either backed by ZRAM or by a swap partition on your disk (or both)).

I think all that is being reported is that the data is “there” and available. For example, say the firefox binary (and related libraries) was loaded into main memory (RAM) from secondary memory (disk) sectors 5-7, 20-40, 42 (where ever those files/file-fragments happen to be stored on disk) because you clicked on the icon to run/“use” the program. Then you minimize Firefox or otherwise cease using it for some time such that it (its memory pages) eventually get pushed out to swap memory. Then you terminate the app (Firefox). There are now two copies of the same firefox binary available in your system – the original copy from the disk sectors and the “cached” copy in your swap memory. As long as no other program needs the swap space, it is perfectly fine for the system to leave the duplicate copy in swap memory for (potentially) faster reload, even though the program is no longer “running”.

You can see this idea expressed in the description of the zone_reclaim_mode setting.

BTW, loading data from swap memory likely will be faster on your system just because you have configured two swap partitions on separate disks. Linux is clever enough to recognize that separate disks are being used for swap memory and it will “stripe” the reads and writes across the different backing devices (probably in 4K blocks). So assuming your two disks have equal bandwidth, the Firefox program would load about twice as fast when you next start the program after it has been swapped to disk.


Disclaimer, I’m no Linux kernel development expert (I took a one semester class on kernel development about a quarter century ago. :slightly_smiling_face:)

I found this page:
https://www.kernel.org/doc/Documentation/sysctl/vm.txt

I am confounded by this excerpt under “zone_reclaim_mode”

zone_reclaim_mode:

Zone_reclaim_mode allows someone to set more or less aggressive approaches to
reclaim memory when a zone runs out of memory. If it is set to zero then no
zone reclaim occurs. Allocations will be satisfied from other zones / nodes
in the system.

This is value ORed together of

1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages

zone_reclaim_mode is disabled by default. For file servers or workloads
that benefit from having their data cached, zone_reclaim_mode should be
left disabled as the caching effect is likely to be more important than
data locality.

zone_reclaim may be enabled if it’s known that the workload is partitioned
such that each partition fits within a NUMA node and that accessing remote
memory would cause a measurable performance reduction. The page allocator
will then reclaim easily reusable pages (those page cache pages that are
currently not used) before allocating off node pages.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
throttle the process. This may decrease the performance of a single process
since it cannot use all of system memory to buffer the outgoing writes
anymore but it preserve the memory on other nodes so that the performance
of other processes running on other nodes will not be affected.

Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.

I believe the 3 zone types for Linux are:

  1. DMA (direct memory access)
  2. Non-paged RAM (such as for certain the OS itself)
  3. Paged RAM

But the above excerpt confuses me. Does the above imply that the system will not reclaim virtual memory pages if zone reclaim is ‘off’? In other words, is it possible to turn of paging (or ‘swapping’ if Linux even does that – still no clear answer on this)?

I think its talking about NUMA nodes.

The statement in that article that “the caching effect is likely to be more important than data locality” is the only part of it that I was referencing earlier. I expect that statement is generally applicable.

It’s confusing, not clearly written. There’s no reference to NUMA in the section I quoted on zone_reclaim_mode. But the page does reference zone_reclaim_mode under the paragraph “min_unmapped_ratio” .

Well, I haven’t discovered anything definite with respect to the cause of my system freeze. I did read over the bug report you referenced:

I might try to add a comment on it, but it’s an old bug. If I cannot, perhaps I’ll file a new bug and provide the data I captured in dmesg, var/log/messages, dmidecode, hwinfo, lscpu, and lshw right after I rebooted my system following the freeze.

A new bug report with the details of your hardware would probably be best. But you could certainly reference the old bug and ask if it is a regression. Although, it was never really closed, so maybe regression wouldn’t be the right term. :person_shrugging:

It is hard indeed to troubleshoot an issue that rarely occurs. You might try adding “debug” on your kernel command line and leaving dmesg -w running in a terminal on a secondary monitor. Glance at it once in a while and try to spot any messages that you think might be related to the problem you are hitting.

@glb Good suggestion; thanks… :+1: