Out-of-memory and the system freezes, systemd-oomd working correctly or not?

I sometimes run out of memory (e.g. programming mistake).

The problem is that in cases like these the entire system freezes and doesn’t respond (or only very very slowly and sporadic), practically I have have to force a reboot.

ideally I would expect that the OS kills the process using most of the RAM. My understanding is that that’s the job for systemd-oomd on Fedora. However, when I look at journalctl -u systemd-oomd.service I only see start and stop messages:

-- Boot c9679079a3c943df94fa6b9083bf09a3 --
Dec 04 02:42:18 tsoj-desktop systemd[1]: Starting systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer...
Dec 04 02:42:18 tsoj-desktop systemd[1]: Started systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer.
Dec 06 11:40:28 tsoj-desktop systemd[1]: Stopping systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer...
Dec 06 11:40:28 tsoj-desktop systemd[1]: systemd-oomd.service: Deactivated successfully.
Dec 06 11:40:28 tsoj-desktop systemd[1]: Stopped systemd-oomd.service - Userspace Out-Of-Memory (OOM) Killer.
Dec 06 11:40:28 tsoj-desktop systemd[1]: systemd-oomd.service: Consumed 1min 15.133s CPU time.
-- Boot f7940714ed364fbf84323016b1d9d34a --

How can I check if systemd-oomd is configured and running correctly? Or am I misunderstanding something and is it even expected that systemd-oomd kills some processes such that the OS remains usable in an OOM situation?

System

System:
Host: tsoj-desktop Kernel: 6.6.6-200.fc39.x86_64 arch: x86_64 bits: 64
Desktop: GNOME v: 45.2 Distro: Fedora release 39 (Thirty Nine)
Machine:
Type: Desktop System: ASUS product: N/A v: N/A serial:
Mobo: ASUSTeK model: PRIME Z690-P WIFI v: Rev 1.xx
serial: UEFI: American Megatrends v: 2014
date: 10/14/2022
CPU:
Info: 24-core (8-mt/16-st) model: 13th Gen Intel Core i9-13900K bits: 64
type: MST AMCP cache: L2: 32 MiB
Speed (MHz): avg: 3445 min/max: 800/5500:5800:4300 cores: 1: 3800 2: 3800
3: 3800 4: 3779 5: 3779 6: 3779 7: 3779 8: 3779 9: 3779 10: 3779 11: 3800
12: 3800 13: 3800 14: 3800 15: 3800 16: 3800 17: 3100 18: 3100 19: 3100
20: 3100 21: 3100 22: 3100 23: 3100 24: 3100 25: 3100 26: 3100 27: 3100
28: 3100 29: 3100 30: 3100 31: 3100 32: 3100
Graphics:
Device-1: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] driver: amdgpu v: kernel
Display: wayland server: X.Org v: 23.2.3 with: Xwayland v: 23.2.3
compositor: gnome-shell driver: X: loaded: modesetting unloaded: fbdev,vesa
dri: radeonsi gpu: amdgpu resolution: 2560x1440~144Hz
API: OpenGL v: 4.6 vendor: amd mesa v: 23.3.1 renderer: AMD Radeon RX
7900 XT (radeonsi navi31 LLVM 17.0.6 DRM 3.54 6.6.6-200.fc39.x86_64)
API: EGL Message: EGL data requires eglinfo. Check --recommends.
Audio:
Device-1: Intel Alder Lake-S HD Audio driver: snd_hda_intel
Device-2: AMD Navi 31 HDMI/DP Audio driver: snd_hda_intel
API: ALSA v: k6.6.6-200.fc39.x86_64 status: kernel-api
Server-1: PipeWire v: 1.0.0 status: active
Network:
Device-1: Intel Alder Lake-S PCH CNVi WiFi driver: iwlwifi
IF: wlo1 state: up mac: d4:54:8b:29:e8:b1
Device-2: Realtek RTL8125 2.5GbE driver: r8169
IF: enp8s0 state: down mac: 50:eb:f6:b0:a2:b9
Bluetooth:
Device-1: Intel AX201 Bluetooth driver: btusb type: USB
Report: btmgmt ID: hci0 rfk-id: 0 state: down bt-service: enabled,running
rfk-block: hardware: no software: yes address: D4:54:8B:29:E8:B5 bt-v: 5.2
RAID:
Hardware-1: Intel Volume Management Device NVMe RAID Controller Intel
driver: vmd
Drives:
Local Storage: total: 1.36 TiB used: 776.75 GiB (55.6%)
ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 980 PRO 1TB size: 931.51 GiB
ID-2: /dev/sda vendor: Samsung model: SSD 850 EVO 500GB size: 465.76 GiB
Partition:
ID-1: / size: 929.93 GiB used: 776.38 GiB (83.5%) fs: btrfs
dev: /dev/nvme0n1p3
ID-2: /boot size: 973.4 MiB used: 336.7 MiB (34.6%) fs: ext4
dev: /dev/nvme0n1p2
ID-3: /boot/efi size: 598.8 MiB used: 43.9 MiB (7.3%) fs: vfat
dev: /dev/nvme0n1p1
ID-4: /home size: 929.93 GiB used: 776.38 GiB (83.5%) fs: btrfs
dev: /dev/nvme0n1p3
Swap:
ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) dev: /dev/zram0
Sensors:
System Temperatures: cpu: 60.0 C mobo: N/A gpu: amdgpu temp: 46.0 C
Fan Speeds (rpm): N/A gpu: amdgpu fan: 21
Info:
Processes: 653 Uptime: 1h 55m Memory: total: 64 GiB available: 62.53 GiB
used: 7.39 GiB (11.8%) Shell: fish inxi: 3.3.31

From what I see there i seriously doubt this is memory related. I also doubt it is overloading the cpu since you have 32 functional cores and they appear to be only lightly loaded with the reported frequencies.

More info would be needed and potentially the output of htop or iotop or top may give more info when the “freeze” seems to occur. That at least may show what process is using the most cpu times.

What i would be most concerned about is this nearly full btrfs file system which may dramatically affect system performance in some circumstances.

ID-1: / size: 929.93 GiB used: 776.38 GiB (83.5%) fs: btrfs
dev: /dev/nvme0n1p3
ID-4: /home size: 929.93 GiB used: 776.38 GiB (83.5%) fs: btrfs
dev: /dev/nvme0n1p3
1 Like

I am indeed not 100% sure that in each case it’s memory related, since I have no way of checking after the system freezes (tried ctrl+alt+F3, but even if it works, I get timed-out trying to log in).

However, sometimes I am 90% sure it’s memory related, because I am actively checking memory consumption because I am working with big amounts of data.

What i would be most concerned about is this nearly full btrfs file system which may dramatically affect system performance in some circumstances.

Thanks for the info, I didn’t know about this.

That may also be related since there may be some temporary files used to manipulate the data which could use space in the file system.

Having one or more of the ‘top’ displays running continuously before the system encounters the slow down may provide some clues.

there may be some temporary files used to manipulate the data which could use space in the file system.

If you’re referring to the full btrfs, I don’t think so. The programs that I use are using only RAM, no filesystem writes, only reads sometimes. Maybe I’ll try to write a synthetic test when I am done with the stuff thats currently running on my PC.

Do you have another machine you can ssh into your dev machine from?

If so try running some commands to see what happens leading up to the freeze.
In one ssh run sudo journalctl -f in another run top.

Then when the freeze happens you should be able to see what is going on.

1 Like

I did a few experiments. I managed to freeze the OS for some time by opening multiple vscode windows and in the integrated terminal of these vscode windows set the hash table size of the chess engine Nalwald to 10-20 GB until the RAM was full. The freeze was longer if I also open firefox with many tabs.

However, while this freeze took a minute or so to clear up, I didn’t manage to reproduce a freeze that took significantly longer than that.

Regardless, I think that the problem is indeed an out-of-memory situation with possibly multiple programs using large parts of the RAM.

What system is primarily responsible for killing out-of-memory processes on Fedora, and is there a way to take a look into the logfiles or something? journalctl -u systemd-oomd.service still isn’t showing anything, even though I witnessed multiple killings of processes because of memory usage.

Hello @tsojtsoj ,Not sure if you looked around here regarding your issue with oomd, but thought you might like the read … How to switch off systemd's out-of memory killer daemon - #16 by abdiel