It was just a little while past the Sunset Strip They found the girl's body in an open pit Her mouth was sewn shut, but her eyes were still wide Gazing through the fog to the other side
"Black River Killer" by Blitzen Trapper
Introduction
This one's personal! For 15 years working on DBMS systems the OOM killer has led to more than its fair share of debugging rabbit holes. Anyone who's been around the block in Linux systems programming has probably crossed paths with the Linux OOM killer. This is the part of the kernel that attempts to maintain forward progress when faced with the impossible situation of applications wanting to use more memory than the system has. The OOM killer balances a lot of competing interests but in the end, it just picks a victim process and kills it.
You're a busy professional so the TLDR is we built an OOM monitoring system called OOMProf in eBPF that profiles Go programs at the point they are OOM kill'd capturing allocations up to the bitter end to give developers a better idea of exactly what went wrong. If you're lazy or unprofessional or just want to know more about how the sausage was made read on! And as an extra bonus we've litered this blog post with stanzas from a great modern folk murder ballad for you to enjoy!
They booked me on a whim and threw me deep in jail With no bail, sitting silent on a rusty pail Just gazing at the marks on the opposite wall Remembering the music of my lover's call
The Problem With OOM Kills
The problem with OOM killed programs is that developers who are faced with explaining what happened have very little to go on. First of all the application that leads to the OOM killer firing may not be the process that gets killed. And in the more common case where the application getting killed is actually the one that did something to deserve being killed the straw that breaks the camels back may be some innocuous common allocation or even some memory allocator induced page fault and unrelated to the actual pile up.
These are often mysterious crashes that linger in the bug base and lead to many developer shoulder shrugs. Long slow leak style bugs can usually be drawn out and spotted with sampling memory profiling solutions (happily available in production environments in most ecosystems these days) but when things go off the rails quickly it can be surprisingly difficult to get clues as to what happened. Like imagine some service dies which starts causing retry loops and work queues to pile up quickly. If you have a distributed system processing millions of things at a time its amazing how quickly things can unravel when certain failures aren't planned for or handled properly. Of course it doesn't have to be complicated, OOMs can occur by simply trying to allocate some huge amount of memory by failing to validate some user input. So even a generously continuous memory profiling solution won't save you when your program dies quickly.
So you make no mistake, I know just what it takes To pull a man's soul back from heaven's gates I've been wandering in the dark about as long as sin But they say it's never too late to start again
Failure is not an option
For those who aren't familiar with the OOM killer the problem is that by default in Linux allocations never fail (overcommit), which is nice in practice because you can avoid checking every malloc for null returns but the reward for that luxury is that if Linux can't manage to satisfy your memory request it'll start nuking processes and you won't get a crash, you'll just get a vaporized process and a trail of tears in your dmesg
log. Stack trace? Fagehdaboudit! You see Linux will return large allocations as pointers to reserved virtual memory that isn't yet mapped to real memory and only later when you read or write to that memory will a page fault occur and the kernel back trace will be something super unhelpful like:
[753999.733881] Call Trace:
[753999.733884] <TASK>
[753999.733889] dump_stack_lvl+0x76/0xa0
[753999.733898] dump_stack+0x10/0x20
[753999.733901] dump_header+0x46/0x1f0
[753999.733909] oom_kill_process+0x118/0x280
[753999.733914] ? oom_kill_process+0x5/0x280
[753999.733919] out_of_memory+0x102/0x340
[753999.733922] mem_cgroup_out_of_memory+0x13b/0x170
[753999.733929] try_charge_memcg+0x40a/0x5a0
[753999.733934] __mem_cgroup_charge+0x45/0xc0
[753999.733938] alloc_anon_folio+0x206/0x430
[753999.733944] do_anonymous_page+0x170/0x420
[753999.733949] handle_pte_fault+0x1bd/0x1d0
[753999.733954] __handle_mm_fault+0x62f/0x770
[753999.733960] handle_mm_fault+0x109/0x2d0
[753999.733964] do_user_addr_fault+0x1d2/
[753999.733971] exc_page_fault+0x85/0x1c0
[753999.733978] asm_exc_page_fault+0x27/0x30
What did your application do to get here? You might as well ask how many liters are in a gallon. Nobody knows. But seriously what can you do? We already talked about how continuous profiling is a quasi solution but another one is to turn off overcommit and try to get your application to crash at the allocation site where memory ran out. However thats not a panacea because some applications rely on overcommit and the kernel is still allowed to return uncommitted memory that will be paged in on demand and you're crash might occur on an innocuous write to memory after crossing a page boundary. Another classic technique is to put tracing on heap growth (ie dump the stack every time the memory allocator calls mmap going to the OS for more memory), if its quick failures that are the problem there's a chance the call stack of the last heap growth event can give important clues to what went wrong. And when allocators grow the heap they tend to do it in meaty chunks so logging all of them is feasible. But it's a bit of a hail mary.
So I took a bus down to the Rio Grande And I shot a man down on the edge of town Then I stole me a horse and I rode it around 'Til the sheriff pulled me in and sat me down
More Problems
Another complication is that even if you magically knew to take a heap profile right before OOM death that might not help you because at least for garbage collected languages the latest heap profile can be out of date. Go for instance the heap profile will be recorded only at every sweep so any allocations made since then won't be represented (Go Sweep Accounting Diagram). What you really want is the ability to record a heap profile and do some scribbling right at the oom moment that includes information on recent allocations.
When oom killer machinery runs you also don't know what the context process will be. Like if you attach kprobes to it and look at what process is associated with the kernel calls it could be any process on the system. One of eBPF's limitations is if you want to start poking around in userspace memory you can only do it to the process that fired the probe. So program 'A' can trigger a page fault which leads to an OOM and process 'B' will selected for killing. So how do we get an eBPF program to run in 'B's address space?
He said, you make no mistake, I know just what it takes To pull a man's soul back from heaven's gates I've been wandering in the dark about as long as sin But they say it's never too late to start again
The Solution
After much soul searching and spelunking in the dark (origins) we discovered that if you just listen to the signal_deliver
tracepoint your OOM selected program will get a kill signal and you can use that as hook to analyze the programs memory. But programs are receiving signals left and right so we need some way to discern which signals are worthy of a memory profile. Originally when we were just profiling OOM's we used the kill signal number to help out here but then we realized programs can be killed outside the OOM context and it would be nice to profile programs in more situations so we went with a two step process. Step 1 is an eBPF program attached to the oom/mark_victim
tracepoint where we scribble the pid away in a map and step 2 is to attach to the signal/signal_deliver
tracepoint and check if the pid receiving the signal is the pid in the map. During an OOM the two tracepoints will fire automatically (Linux kills your program for you after picking a victim). But now we can profile a Go program at any point outside the context of OOM by scribbling its pid in the map and sending that program a signal (something harmless like SIGUSR2). The OOMProf library has a convenience function for doing just that ProfilePid. This means that the Parca Agent will soon be able to continuously monitor heaps w/o having to set up external "scraping".
Well the sheriff let me go with a knife and a song So I took the first train up to Oregon And I killed the first man that I came upon 'Cause the devil works quick, you know, it don't take long
What could go wrong?
At this point you might be asking yourself:
- Can an eBPF program even reliably read the memory of an OOM victim?
- Does eBPF have enough instruction headroom to do a reasonable job of reading a profile?
- How many yards to a mile?
And the answers of course would be, yes, most of the time, yes, we think so and nobody knows.
Reading the memory profile of a Go program requires traversing a linked list of 'mbuckets'. Each bucket read is a separate bpf_probe_read_user
call and if any of them fail we'll get an incomplete profile which could be helpful if its near the end or it could be completely useless if we miss all the important buckets. By default Go programs can have unbounded number of buckets but with a default sampling rate of 512kb even a pretty large busy program won't have millions of buckets. We've tested programs up to 100k buckets and by default the eBPF program today can handle up to 60k buckets. If you have a program that needs more than that please reach out! If the oompa
CLI tool fails to generate a profile that looks like the one you get from Go's built in pprof you might be hitting this limit. Also the parca-agent
will log a warning if this happens.
Because the program is dying we have to completely read out any information we want from the program, right now this means copying out all the memory profiling buckets to a pre-allocated eBPF map so the 60k limit is designed to put a cap on memory usage, 60k buckets translates into ~40mb of eBPF map space. Currently we can get to the 60k mark with only ~20 iterations of our program (actually tail calls). So in theory if the map was made bigger we could support up to ~100k buckets with the full 30 tail calls the kernel allows.
In order to read the profiling buckets we need the address of the runtime.mbuckets
global variable. So unfortunately we currently can't profile Go programs that are stripped (which include go test/go run temp binaries). Another caveat is that Go memory profiling isn't on by default, your program must cause the memory profiling infrastructure to be linked in somehow (usually by exposing a debug/pprof endpoint). Another technique is to just put a call to runtime.MemProfile(nil, false)
in your program. This causes the linker to bring in the profiling infrastructure.
Another problem that can foil our efforts is that the Linux kernel is rather impatient (for good reason) about getting memory freed to plow forward so by tying up the OOM killer with our profile walking eBPF program another memory request can fail and trigger another OOM killer and another kernel thread will run another OOM killer instance and it will skip the process that is currently being killed and instead find another victim. One use case in particular can be affected by this: reporting "self" OOMs. The eBPF parca-agent
itself can exhaust its resources (its often run in a memory limited container) and OOM so we want to be able to report that. But how does the reporter report its own failure? Like a wayward teenager this requires help from a "parent" process. Luckily the parca-agent
already uses this for reporting panics
and other expected crashes so we just have to wire up the parent to oomprof
and let it report the child's errant ways. But what if the situation above happens and the secondary OOM victim is the parent process? We've seen this practice with synthetic tests but unsure if it will manifest in the real world. It may be we need to have a sacrificial "canary" process (with a high oom_score_adj) that carries some memory "ballast" to make OOM reporting more predictable but our experiments with that path got messy so we're avoiding that for now and will wait to see how this thing does in the field before looking at that again.
Using OOMProf
So how do you use this thing? OOMProf is a library so it doesn't do much for you beyond turning a Go programs profiling data into pprof profiles. We added a simple CLI tool to kick in the tires but the turnkey solution is available today as part of the Parca Agent and if you are running the Parca Agent you just need to enable the OOM prof feature with the --enable-oom-prof
flag to get OOM memory profile's automatically uploaded to Polar Signals cloud or your on-prem Parca instance. But if you are interested check out the README.md for an example on how to use it directly.
Oh, when, oh, when Will the spirit come a-callin' for my soul to send? Oh, when, oh, when Will the keys to the kingdom be mine again?
Future Plans
Rome wasn't built in a day! We'd like to add support for these features:
- Add support for jemalloc,tcmalloc,mimalloc profiling for Rust/native programs
- Add support for stack/goroutine dumps, if we fail or can't gather a memory profile it would nice to have something!
If there's anything else you'd like to see stop by OOMProf and create an issue or better yet drop a PR!
Then I went to the river for to take a swim You know that black river water is as black as sin And I washed myself clean as a newborn babe And then I picked up a rock for to sharpen my blade