Is there a way to be notified of oomd getting ready to kill things before it does it?

I had a weird issue this morning where my byobu/tmux sessions suddenly died. On checking my logs, I saw that oomd had killed the tmux server, killing my 5 tmux sessions and ~10 open vim instances with it.

I found this quite surprising—usually my browser gets killed or something else, but very rarely is it the terminal.

I wanted to see what caused this (this machine has 64GB of RAM and I wasn’t doing anything memory intensive), but all I can get from the journals is:

systemd-oomd[851]: Killed /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-e9adf34e-a33a-4cbf-aff2-89006ef5abd3.scope due to memory used (66506457088) / total (67143114752) and swap used (7932215296) / total (8589930496) being more than 90.00%

vte-spawn-e9adf34e-a33a-4cbf-aff2-89006ef5abd3.scope: systemd-oomd killed 273 process(es) in this unit.

So, I have two questions:

  • is there a way to know what caused my memory and swap usage to go this high (I wasn’t doing anything new here, so it’s probably a memory leak somewhere)
  • is there a way for me to get a notification before oomd is going to kill stuff so I can manually manage it, and maybe not loose all my work in the process?
2 Likes

I went looking and found psi-notify which seems to be exactly what I’m looking for.

Available in the repos:

So that answers part 2, still looking for an answer for question 1.

I also saw that one can modify oomd settings to prevent certain cgroups/slices from being killed:

https://fedoraproject.org/wiki/Changes/EnableSystemdOomd#How_to_test

systemd-cgls shows me that gnome-terminal has a slice for it, so I’ll try tweaking the settings for it so that oomd “avoids” killing it.

2 Likes

Well, I never had to manage this issue myself, but a quite “primitive” way I had in mind when I read your issue, is to have top running in parallel and watch how it develops. It shows you the use of memory and cpu of each process. Also, and although the output into files looks horrible (not intended for pushing it’s output to files), you can use top > file and even if top gets killed at some point, you can see at the bottom of the file the last conditions. So you can let this run in the background, and when the issue happens again, check the bottom of the file. It at least helps to identify if a specific process is responsible. But I suspect that the file will consume storage quickly (maybe man top has a solution for that).

I don’t know a way to do this locally, but in my dayjob, we have Centreon/Nagios checks for low available memory events via SNMP. You could setup an SNMP trap, but it wouldn’t tell you which process specifically is at risk.

1 Like

I haven’t attempted this, but if I wanted to do that my guess would be to create a /etc/systemd/system/systemd-oomd.service.d/override.conf file containing the following and then run systemctl daemon-reload, systemctl restart systemd-oomd.service. Then leave tail -f /var/log/oomd.log running in a terminal somewhere while trying to trigger the problem.

[Service]
ExecStart=/usr/lib/systemd/systemd-oomd --dry-run
StandardOutput=file:/var/log/oomd.log

Edit: Maybe also try to increase swap so you might have more time to troubleshoot/identify the problem before the system comes to a screeching halt due to insufficient memory.

1 Like

I’m giving psy-notify a try to see what it says for a start, and then I’ll look at configuring oomd to not kill gnome-terminal, or leave it till the end.

The machine has 64 gigs of RAM and 8 of swap. Even when I built heavy Fedora packages (like chromium/qt5-qtwebengine), this was quite enough, so I don’t want to tweak configs unless it’s absolutely necessary.