The OOM killer never in the twelve years I've used Linux has triggered before my system grinds to a halt and never recovers. This problem has not been solved.
Pro-Tip: You can use Alt + SysRq + F to trigger the OOM-killer action immediately. It helps me avoiding pulling the plug of my desktops on multiple occasions over the years when I accidentally start RAM-eating programs. Just make sure SysRq is enabled in sysctl.
In ops, I've seen this happen very many times. A linux server running happy and free because the kernel OOM killer murdered the reason that server existed, leaving alone some side process with a memory leak (usually some external service agent or maintenance service run amok). I learned well how to fix that over the years.
(spoiler alert: it was much more often about becoming ornery when devs insisted that their JVM app could have a 12 GB heap on a machine with 12 GB of memory)
I am in the camp of having systemd or your service manager of choice restart any killed service or service failing health check and emitting an ERROR or higher log message.
I want my systems to heal themselves. More often than not these memory problems end up being slow leaks which can be effectively permanently resolved with periodic restarts, and the engineering time to fix them is appropriately not prioritized.
I want to know that there has been a problem but I would rather not be forced to do anything about it unless absolutely necessary.
That is probably being cause by your use of swap space, not an OOM issue. I've had multiple cases of the OOM killer kicking off on my system, all without it slowing way down.
Lacking swap space causes more severe symptoms in an OOM situation, not less, from my experience. I think this is because everything that can get evicted from RAM is before the OOM killer gets invoked, which means every disk access slows to a crawl.
Yep, when Chromium ate up all the memory, it just hang the whole OS, waiting for OOM killer about 10+ min, then the cursor can be moved again, then freeze again...
If you want hilarious fun: make a gl shader than takes ~30 seconds to run. GPUs are only very recently preemptable (if they are at all yet? I lose track of what is “planned” vs released).
Make it run that shader in a loop.
See how well your system appears to respond.
IIRC macOS has a 60s or something watchdog the hard resets the GPU, while the gpu is hung the screen is not updated. Everything is running fine, cpu isn’t pinned or anything, but the gpu is blocked so no compositing, and so no screen updating.
I’m not sure what Linux does in that case, and I think windows may be able to paint because the directx driver interfaces let it do ... something? I’ve always assume some way to dma straight to the framebuffer, but no real idea :)
I'm on 5.0.0 and I legit haven't noticed a difference. If I run out of RAM without swap the system freezes and if I have swap then the system freezes when both are full. The only reliable solution is having a RAM+swap usage graph on my screen at all times and then closing stuff manually.
The system freezes also without any swap enabled but much more suddenly (there's no slowdown before dying). It's really just that the OOM killer triggers way way way too late.
The OOM killer is a “solution” to a very real, and sensible design choice: not committing physical memory and swap whenever address space is mapped - there are very good (and noticeable) reasons for not eagerly committing, but fundamentally if you have done so you have to decide what to do when you end up needing more physical space than is available.
Linux went down the “if a process is trying to do this, it must be important so I’ll prioritise it and kill something else”, and alternative is to kill that process when the commit fails.
Either is a valid option, the OOM killer ran against a regular desktop user’s idea of what is the correct course of action, but for a server it might not have been.
It triggered on one of my systems yesterday, and killed the runaway process.
When we were running tests of a new distributed system on our development (slightly underspecced) cluster, it would kill the distributed system processes when they took too much RAM.
As other write, having slow or "too much" swap can delay the OOM killer from running in reasonable time.