Python Zebra Stacks | Polar Signals

The Problem: Truncated PyTorch Stacks

PyTorch Lightning training workloads can get complicated! Like any powerful and flexible framework there's a lot of indirection, interfaces and modularity to accommodate a very diverse set of use cases and hardware. Dynamic languages can be, well, dynamic. One of our customers found that the stacks being reported were being truncated. This means we hit an error in our attempt to unwind the stack from the leaf to the root. Two problems needed to be addressed to fix it and we'll discuss both as they are kinda both interesting to BPF development. If you don't care about BPF development you might be in the wrong room!

Both problems are related to the fact that these stacks are very deep and training workloads tend to be long running.

Read errors

The first problem we encountered was an error in the python unwinder hitting bpf_probe_read. This can happen when an address is bogus but it can also happen when an address is perfectly fine and the memory in question has been paged out. Since BPF programs execute in a privileged kernel context page fault handling is disabled. In this case the address was good but it was paged out. Normally when this happens you just give up and bail but this was happening to a lot of samples, like 1/3 of them! So this wasn't some blip we could ignore. In a typical program stacks grow and shrink and as programs run pages get touched and stay in cache, but if you have long running programs the likelihood increases that stack frames high on the stack stay in place for hours or even days during long training workloads. During that time memory demands may cause things to get paged out. But since eBPF reads don't trigger page faults, every time we walk the stack we'll bail and error out on any reads that fail.

The solution in this case is to just keep going! Turns out the pointers we were faulting on aren't necessary to keep walking the stack so we can just record the pointer and walk to the next frame. The pointer in question will be read again in userspace where the OS will handle the page fault and the next time we walk the stack the read error will go away. It's a tradeoff, do we care more about getting complete stacks or do we not want the profiler to cause additional memory pressure by paging things in the program doesn't need? We think a little extra memory pressure is worth getting complete stacks and in the end the OS will find something else to evict if it has to.

All the gory details can be found in the PR.

Tail Call Limit

The second problem we faced was hitting the BPF tail call limit. To understand why we need to look at a typical PyTorch training stack:

Trainer.fit
  _PyFunction_Vectorcall          ← CPython C runtime
  _PyEval_EvalFrameDefault        ← CPython interpreter loop
  do_call_core                    ← CPython dispatch
Trainer._fit_impl
  _PyFunction_Vectorcall
  _PyEval_EvalFrameDefault
  do_call_core
Trainer._run_stage
  _PyFunction_Vectorcall
  _PyEval_EvalFrameDefault
  do_call_core
_FitLoop.run
  ...25+ more Python frames, each separated by a few native frames...

Every Python function call passes through CPython's C eval loop — _PyFunction_Vectorcall → _PyEval_Vector → _PyEval_EvalFrame → _PyEval_EvalFrameDefault → do_call_core — creating roughly 5 native frames between each Python frame. A typical ML training stack has 25+ Python frames, which means 25+ transitions between Python and native code.

The stacks were ending with unwinding aborted due to error max_tail_calls errors. To understand why, we need to understand how eBPF profilers unwind mixed-language stacks.

How eBPF Stack Unwinding Works

The eBPF profiler uses separate unwinder programs for each language runtime. There's a native unwinder that handles standard C calling convention frames using DWARF stack deltas, and interpreter-specific unwinders for Python, Ruby, Java HotSpot, V8, PHP, etc.

When a profiling sample fires, the native unwinder starts walking the stack. When it hits a frame belonging to an interpreter (detected via memory mappings), it does a BPF tail call to hand off to the appropriate interpreter unwinder. The interpreter unwinder processes its frames, then tail calls back to the native unwinder when it hits native code again.

native_unwinder → tail_call → python_unwinder → tail_call → native_unwinder → ...

Why Tail Calls?

BPF programs have a hard limit of 1 million verified instructions. The verifier statically analyzes every possible execution path through the program, and the sum of all instruction visits must stay under 1M. This is how the linux kernel knows the program will terminate and is safe to run in a privileged, interrupt-free, state.

A single unwinder program that handled every language would blow through this limit or would be extremely constrained in the number of frames it could walk. Currently our unwinders handle ~5-12 frames per program and its common for programs to have dozens or even hundreds of frames in the call stack.

Tail calls solve this by splitting the work across multiple programs. Each bpf_tail_call() transfers control to a new program with its own fresh instruction budget. The native unwinder processes a few frames, tail calls to Python, Python processes a few frames, tail calls back to native — each invocation staying well under the 1M limit.

The Tail Call Limit

But tail calls have their own limit: 33 per chain (the profiler conservatively caps at 29 to leave room for cleanup). Each Python ↔ native transition costs one tail call in each direction. With 25+ transitions in a PyTorch training stack, the profiler was hitting the tail call limit and truncating the stack mid-unwind.

The Fix: Combining the Unwinders

The solution was to combine the Python and native unwinders into a single BPF program that handles both in one loop. Instead of:

for (i = 0; i < FRAMES_PER_PROGRAM; i++) {
    unwind_one_python_frame();
    if (done) break;
}
tail_call(PROG_UNWIND_NATIVE);  // hand off to native unwinder

The combined approach uses a single loop that switches between unwinding modes without tail calls:

for (u32 t = 0; t < python_native_loop_iters; t++) {
    switch (unwinder) {
    case PROG_UNWIND_PYTHON:
        error = step_python(record, pyinfo, &py_frame, &unwinder);
        break;
    case PROG_UNWIND_NATIVE:
        error = step_native(record, &unwinder);
        break;
    default:
        goto done;
    }
}

Each iteration handles one frame — either Python or native — and the unwinder variable tracks which mode to use next. Transitions between Python and native code happen within the loop with zero tail calls.

The deep PyTorch stack that previously needed ~50 tail calls now unwinds fully within roughly a dozen program invocations.

The Instruction Budget Challenge

Combining two unwinders into one program means the BPF verifier has to analyze both unwinders' code paths in a single pass. The native unwinder alone consumes ~220K verified instructions. Inlining it into the Python unwinder, multiplied across loop iterations and branch paths, pushes the combined program past the 1M limit. We can reduce the number of frames to make the verifier happy but this leads to the inability to walk deep stacks. Can we support mixed stacks like this but not sacrifice any stack depth?

The DEBUG_PRINT Problem

A significant portion of the instruction budget goes to debug logging. The profiler uses a DEBUG_PRINT macro that compiles to bpf_trace_printk:

DEBUG_PRINT("==== unwind_native %d ====", trace->num_frames);

This expands to a runtime check of a global variable followed by the format string construction:

; if (with_debug_output) {
ldxw     r1, [r6+0x0]           ; load with_debug_output
jeq      r1, 0x0, +0x19         ; skip if debug off

; Push format string onto BPF stack 
mov64    r1, 0xa3d3d            ; "==\n\0"
stxw     [r10-0x48], r1         ; store to stack
lddw     r1, 0x3d3d206425206576 ; "ev %d =="
stxdw    [r10-0x50], r1
lddw     r1, 0x6974616e5f646e69 ; "ind_nati"
stxdw    [r10-0x58], r1
lddw     r1, 0x776e75203d3d3d3d ; "==== unw"
stxdw    [r10-0x60], r1

; Set up call arguments
ldxh     r3, [r9+0x2be]         ; trace->num_frames
mov64    r1, r10
add64    r1, -0x60              ; r1 = pointer to format string
mov64    r2, 0x1c               ; r2 = string length (28)
call     0x6                    ; bpf_trace_printk()

Because of how the DEBUG_PRINT macro works every format string must be constructed on the stack at runtime, one 8-byte immediate at a time. A single DEBUG_PRINT with a 28-byte format string compiles to 13 instructions. With dozens of debug prints in the unwinder code, each guarded by an if (with_debug_output) check, the verifier must analyze both the taken and not-taken paths.

Now to keep things efficient with_debug_output is a read-only variable (BPF .rodata), so the verifier knows its value at load time. When debug output is disabled (with_debug_output = 0), the verifier prunes the entire debug branch — it knows the condition is always false and skips analyzing the string construction and bpf_trace_printk call. When debug output is enabled, the verifier must analyze all of those instructions.

This means the verified instruction count depends on whether debug output is enabled:

Mode	Verified Instructions	% of 1M Limit
Debug OFF	476,052	47.6%
Debug ON	668,515	66.9%

(both measured at the original compile-time default of 9 loop iterations)

With debug output enabled, the verifier explores roughly 40% more instructions due to the debug branches.

Dynamic Loop Tuning via RODATA Variables

The combined unwinder's loop iteration count directly controls how many frames it can process per invocation. More iterations = more frames unwound, but more verified instructions. We needed to find the sweet spot for both debug-on and debug-off modes.

The solution was to make the loop count a RODATA variable instead of a compile-time constant:

// Old: fixed at compile time
#define PYTHON_NATIVE_LOOP_ITERS 9

// New: tunable at load time
BPF_RODATA_VAR(u32, python_native_loop_iters, 6)

At program load time, the host agent sets the loop count based on whether debug output is enabled:

if cfg.VerboseMode {
    // Debug output enabled — verifier explores debug branches,
    // leaving less room for loop iterations
    // Default of 6 is already correct
} else {
    // Debug output disabled — verifier prunes debug branches,
    // freeing up instruction budget for more iterations
    coll.Variables["python_native_loop_iters"].Set(uint32(12))
}

This gives us the best of both worlds:

Mode	Loop Iterations	Verified Instructions	Headroom
Debug ON (verbose)	6	452,849	55%
Debug OFF (production)	12	913,603	9%

In production (debug off), the profiler unwinds up to 12 frames per program invocation — a 2x improvement over the 6-frame debug mode — while staying safely under the 1M instruction limit.

Results

We validated the fix with a coredump test using a 20-level deep Python call stack (each level going through CPython's C eval loop). This creates the same interleaving pattern seen in a PyTorch workload:

Branch	Frames Unwound	Result
Main (tail call design)	Truncated	`unwinding aborted due to error max_tail_calls`
Combined loop	138 frames	Full stack through `_start`

The combined unwinder eliminates the tail call bottleneck entirely for Python workloads, while the RODATA-based loop tuning ensures the verifier instruction budget is used optimally in both debug and production modes.

But wait there's more!

No engineering blog would be complete without a section on paths not taken! At one point we considered revamping all the DEBUG_PRINT statements on the critical path with some aggressive abbreviations and error codes so that each DEBUG_PRINT statement was smaller. What a terrible idea! Luckily the dynamic looping counts killed that idea but there was still this nagging feeling about how DEBUG_PRINT abused the stack. I'd initially dismissed it as just some weird quirk of eBPF, but eventually decided to dig in. The reason our DEBUG_PRINT used the stack is that it's defined like this:

  #define printt(fmt, ...)                                                                         \
    ({                                                                                             \
      const char ____fmt[] = fmt "\n";                                                             \
      bpf_trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__);                                   \
    })

  #define DEBUG_PRINT(fmt, ...)                                                                    \
    ({                                                                                             \
      if (__builtin_expect(with_debug_output, 0)) {                                                \
        printt(fmt, ##__VA_ARGS__);                                                                \
      }                                                                                            \
    })

Because we introduce a new ___fmt variable with a newline appended and it's a local variable it gets put on the stack, but if you just add the static keyword all that data moves to rodata. The results are kind of staggering, most unwinders got ~40% smaller, full results here.

But does shrinking the eBPF programs really matter? Actually it does because it means we can handle much bigger stacks and/or have plenty of room to enhance (or combine!) our programs and stay within the 1M verifier limit. Early testing shows we can 2x our native unwinder from 5 frames per program to 10. Python goes from 12 to 90! Basically this is found by cranking up the number of loops up until we hit verifier errors. Clearly the relationship between instruction size and verifier complexity is non-linear.

Do all languages need a hybrid unwinder? Probably not, JIT'd languages (Node, .Net, Java) tend not to bounce back and forth like this, it's possible Ruby, PHP and Perl programs could benefit from such a technique.

Hopefully you enjoyed this little journey in the world of eBPF profiling, what started out as trying to support complicated PyTorch stacks led to really moving the needle on supporting large stacks for all languages!