I’m pretty sure that switching off the OOM killer is a terrible idea.
I’m even more sure that the important thing you need to try is to fix the configuration of your swap space.
By default, Fedora uses
zram swap, which compresses stale pages within ram, rather than paging them out. I can see why, for a typical user with more ram than they really need, but leaving open a few too many applications that they aren’t actually using, zram can be a modest improvement to performance. But I think that is a narrow range of benefit and hard to really identify.
In general, but especially for “scientific computing”, zram is a terrible idea. You need real swap space.
After you figure out how much real swap space you need and configure it, the OOM killer will no longer kill anything, and your application will continue running.
Depending on the memory access pattern within that application, maybe the whole problem is then solved. Vs. maybe you just don’t have enough ram for that application to complete in reasonable time. Vs. maybe you have a memory leak in the application and adding swap space just delays the failure.
Another warning is that you may need to be patient when the system seems to be hung (when running a simulation with properly configured swap). You may be paging out key parts of the GUI when you stop using them for a few minutes while waiting for the simulation. Then when you try to use them again, everything seems frozen. No OS designer has put in the research needed to design getting the GUI system back from swap in reasonable time. The process that brings it back when you start using it again eventually works, but takes longer than would make any sense if you don’t understand what happens behind the scenes to make it complicated.
It is not actually counterproductive for your use case and even less so for ordinary users. If you have a memory leak in an application, you want that to bring down the application rather than lock up the whole system. If your application is simply using more memory that your ram+swap permits, you still would rather have the application fail rather than lock up the whole system (including that application).
You have described, and maybe even really have, a boundary case: The randomness you claim would imply you have just barely enough ram plus swap for what you’re trying to do. It is certainly possible in a narrow range of such boundary cases for the OOM killer to kill something when no such action was really needed. But even in that boundary case, the right answer is to increase the amount of swap so you aren’t in that boundary case.