Call Stacks and Unwinding 101

February 19, 2026

Almost every programmer is intuitively familiar with the call stack, even if they don't know that's what it's called.

When a function f calls g which in turn calls h, we know that when h returns, the computer will continue executing code from g, and that when g returns, it will continue in f. Thus, somewhere in its memory, the computer must keep track of a stack of code locations, so it knows where to return to in each case.

Furthermore, programmers rely on various tools being able to accurately find and report this call stack; for example, for debugging or profiling purposes. You might know that executing bt at a gdb command prompt prints out the current stack trace of the target process, or that sampling profilers like Parca or Polar Signals Cloud use stack traces to build a hierarchical model of code cost. But how does that actually work, in practice?

We focus here on native-compiled code on 64-bit ARM on Linux; for example, programs produced by C, C++, or Rust. The representation of call stacks for virtual-machine environments like the JVM or V8 can be considerably different. Call stacks on x86, or on platforms other than Linux, may differ in details but are conceptually similar.

Conceptual Overview

Supposing the calls described above (f calling g, which calls h), happened at lines 11 and 22 in main.c respectively, and we're now about to execute line 33, the conceptual call stack looks something like this:

function	location
h	main.c:33
g	main.c:23
f	main.c:12

When h is finished executing, the program pops its entry from the top of the stack, inspects the next-highest entry, sees that it needs to continue executing g at main.c:23, and executes an instruction telling the processor to jump to there.

Each of the rows in these table corresponds to what is called a stack frame, which is the set of information relevant to one level of nested function calling.

Various tools need to be able to recover the information in this call stack while the program is running. For example, profilers like Parca and Polar Signals Cloud need to find the chain of calls that led to the current position, in order to determine which functions are most expensive and build visualizations like flame graphs. Debuggers like gdb or lldb need to be able to print backtraces in response to user queries. And in some languages, the language runtime itself needs to be able to interpret a program's call stack, for example to support language features with stack-unwinding semantics like C++ exceptions or Rust panics.

Physical Representation

If Parca (or other software that takes stack traces) could simply read some data structure from memory that looks like the conceptual call stack depicted above, its job would be trivially easy. In reality, things are not so simple, and the physical representation differs from the conceptual one for at least two main reasons:

The physical call stack does not actually contain human-intelligible code location information like function names and source files and line numbers. These are higher-level concepts that are not relevant to the processor, which only knows about addresses. Consequently, the physical call stack only contains the 64-bit memory address of the next instruction to be executed when returning from a given frame. As long as function name symbols haven't been stripped from the binary, tools can recover human-readable function names from these addresses by examining the binary's symbol table; if it was compiled with additional debug symbols, they will also be able to recover source file and line data. This process, called symbolization (or, in some discussions, symbolication), is outside the scope of this post.
The stack does not just contain return addresses; it contains all the local data required by a function. In particular, stack frames are not all the same size.

For a concrete example, consider this function, which recursively computes the Nth element in the Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, and so on, where each element is computed by summing the two previous ones.

uint64_t fib(uint64_t n) {
    if (n <= 1)
        return n;
    uint64_t a = fib(n - 1);
    uint64_t b = fib(n - 2);
    return a + b;
}

When compiled for 64-bit ARM, a typical compiler might produce something like the following:

fib:
        ;; Subtract 32 from sp, and store the result back in sp.
        ;; Then store the pair of registers x29 and x30 at the memory
        ;; location now pointed to by sp.
        stp     x29, x30, [sp, -32]! ;; offset 0
        ;; Set x29 = sp
        mov     x29, sp              ;; offset 0x4
        ;; Store the pair of registers x19 and x20 at the memory location
        ;; pointed to by (sp + 16), without modifying sp.
        stp     x19, x20, [sp, 16]   ;; offset 0x8
        ;; Set x19 = x0
        mov     x19, x0              ;; offset 0xC
        ;; if (x0 <= 1) ...
        cmp     x0, 1                ;; offset 0x10
        ;; goto .L1
        bls     .L1                  ;; offset 0x14
        ;; set x0 -= 1
        sub     x0, x0, #1           ;; offset 0x18
        ;; call fib: that is, set x30 to the address of the following instruction,
        ;; and jump to the beginning of `fib`.
        bl      fib                  ;; offset 0x1C
        ;; set x20 = x0
        mov     x20, x0              ;; offset 0x20
        ;; set x0 = x19 - 2
        sub     x0, x19, #2          ;; offset 0x24
        ;; call fib: that is, set x30 to the address of the following instruction,
        ;; and jump to the beginning of `fib`.
        bl      fib                  ;; offset 0x28
        ;; set x0 += x20
        add     x0, x20, x0          ;; offset 0x2C
.L1:
        ;; Load the pair of registers x19 and x20 from the memory location
        ;; pointed to by (sp + 16), without modifying sp.
        ldp     x19, x20, [sp, 16]   ;; offset 0x30
        ;; Load the pair of registers x29 and x30 from the memory location
        ;; pointed to by sp. Then, set sp += 32.
        ldp     x29, x30, [sp], 32   ;; offset 0x34
        ;; return: that is, jump to the memory address in executable code pointed to by
        ;; x30.
        ret                          ;; offset 0x38

A few things to note while examining this snippet:

sp is a register that, by convention, points to the top of the call stack. Call stacks grow downward in memory, so the stack frame for the currently executing function (which we call the "top" of the stack) is actually at the lowest memory address.
Since bl (the ARM call instruction) sets x30 to the next address and jumps to the callee, and since ret simply jumps to x30, it's clear that x30 is intended to be used to represent call return addresses. The x30 register is sometimes referred to by the alternate name lr, for "link register".
By convention, x0 is used as both the first argument of a function, as well as its return value. When a function is entered, it will expect the caller to have set x0 to its first argument. At the end of function execution, after a program executes ret to jump to the return address, the caller code will expect it to have set x0 to its result.
By convention, x0 and x30 are examples of caller-saved or volatile or call-clobbered registers (all these terms mean the same thing). This means that any function is allowed to modify them however it likes; if a function f calls a function g, then f cannot rely on x0 or x30 having the same value after calling g as it had before calling g. If f wants to access that value after a call, it needs to save it somewhere else.
By convention, x19 and x20 are callee-saved or non-volatile registers. This means, unsurprisingly, the opposite of the property of x0 discussed above. If f calls g, then f is allowed to assume that the values of x19 and x20 are the same before the call as after it. If g wants to use x19 and x20 for its own purpose, then it must save the previous values somewhere, and restore them before returning control to its caller.
The function does not use x29 for any useful purpose. It just sets it to a particular value, and dutifully stores and restores it, but never actually reads its value. What is the point of this register? That will become clear in due time.

Putting this all together, we can describe in prose what the function does.

Decrements sp by 32, thus allocating a new stack frame of 32 bytes. Uses half the frame to store its return address (in x30), as well as the mysterious seemingly-unused register x29. Sets x29 to the new value of sp. This entire sequence is called the function prologue.
Saves x19 and x20 in the remaining half of the stack frame, because it plans to use them, and they're callee-saved registers so it's not allowed to modify them without storing them later.
Saves its first argument (n) in x19, because later calls will clobber its current home of x0.
If its first argument is 0 or 1, jump to the step 6 below.
Otherwise, call fib with argument n - 1, temporarily store the result in x20, then call it again with argument n - 2, and finally store the sum of the two return values in x0.
By now, the intended return value is in x0, regardless of whether we got here from the base-case handling in step 4, or have actually executed step 5. Restore the saved values of x19, x20, x29, and x30 from the stack. The first two are necessary because they are callee-saved registers by convention, and so we need to return with them in a clean state. The final one, x30, is necessary because we need its old value, which has been clobbered by intervening calls. It's not immediately clear what we need the old value of x29 for. After restoring these, deallocate the stack frame by adding 32 to sp, thus restoring its previous value. Finally, execute ret, which returns control to the address in x30. This entire sequence is called the function epilogue.

Assume that within main, we call fib(3). Assume that the top of the stack was at 0x10000 when we made our first call, that the return address in main is 0x10000000 and that the code for fib is loaded at 0x10001000.

Once we get to fib(1) (via the second call in fib(3): uint64_t b = fib(n - 2);) and are at the start of the epilogue, the stack will look like this:

address	value	notes
0xFFC0	0xFFE0	old value of `x29` in `fib(3)`, saved in `fib(1)`
0xFFC8	0x1000102C	return address in `fib(3)`, x30 saved in `fib(1)`
0xFFD0	3	old value of `x19` in `fib(3)` (stashed copy of argument), saved in `fib(1)`
0xFFD8	1	old value of `x20` in `fib(3)` (return value of `fib(2)`), saved in `fib(1)`

0xFFE0	0x10000	value of `x29` in `main`, saved in `fib(3)`
0xFFE8	0x10000000	return address in `main`, `x30` saved in `fib(3)`
0xFFF0	???	old value of `x19` in `main`, saved in `fib(3)`
0xFFF8	???	old value of `x20` in `main`, saved in `fib(3)`

0x10000..	...	More values follow, for the stack frame of `main`, its caller, and so on.

And the values of the registers are:

register	value	notes
`x0`	1	This will be interpreted by the caller as the return value: `fib(1)` equals 1
`x19`	1	Used as scratch; will be overwritten in the epilogue
`x29`	0xFFC0	Will be overwritten in the epilogue
`sp`	0xFFC0	Top of stack; happens to equal `x29` in this case.
`pc`	0x10001030	Program counter; the next instruction to be executed

(x30 and x20 were not used in this frame, but will be restored in the epilogue anyway).

Given the information we know so far, you could have filled in all the values in these tables by carefully tracing through the assembly code, with the exception of the value 0x10000 at 0xFFE0. To determine that, you'd have to know the purpose of the register x29, which we will discuss now.

The Simplest Unwinding Technique: Frame Pointers

It's now time to discuss the purpose of the mysterious and apparently unused x29 register. If the program is interrupted by a profiler or debugger at the point mentioned previously; that is, in fib(1) called by fib(3) immediately before executing the epilogue, it can read x29 from the target process's context to see the value 0xFFC0. This points to the location where the old value of x29 was saved on the stack in the prologue. It can then read the memory at that location to see the value 0xFFE0, which points to where the previous frame stored the value of x29 on its entry. This location holds 0x10000, which points to where main stored x29 in its frame, and so on.

And now the meaning of x29 becomes clear: it is the initial pointer to a linked list of stack frames. For this reason, x29 is often called the frame pointer register. The profiler or debugger need only walk this linked list to discover all the frames, from the currently executing code all the way back to the initial pre-main startup function. It can use the fact that by convention, the return address (x30) is always stored in the memory immediately following x29 on the stack to discover the location of the code that invoked each frame.

In our example, sp is always equal to x29 in the function body, but this doesn't necessarily have to be true, for example because different code paths might dynamically allocate different amounts of stack space. But, assuming the compiler follows the frame pointer convention, x29 will always point to the location of the old x29 on the stack, allowing this simple stack unwinding technique to work.

A Slightly Less Simple Technique: Unwinding via `.eh_frame`

Now that we know about frame pointers, it seems that writing a working stack unwinder isn't too difficult. Are we done now? Unfortunately, not quite.

Although frame pointers are useful for profilers, debuggers, or any other tools that need to unwind stacks, they are under most circumstances not useful to the program itself. While programs do sometimes need to know how to unwind their own stacks (for handling C++ exceptions or Rust panics, or for printing a backtrace in an error message), it's not something that's usually necessary in normal, happy-path code.

For that reason, engineers often choose not to follow the frame pointer convention, by building code in such a way that frame pointers are not used (for example, by compiling with GCC or Clang's -fomit-frame-pointer flag). This can be a slight space and time optimization: the program doesn't need to reserve stack space for the frame pointers or execute instructions to save and restore them, and it can use x29 as a general register for any purpose. Without any other way of unwinding stacks, this micro-optimization would come at a severe cost: it would prevent profilers and debuggers from getting useful stack traces, and prevent C++ exceptions from working at all.

To avoid these significant downsides, another way of unwinding stacks had to be developed. On typical Linux systems, this is done by including a section in the binary called .eh_frame -- the "eh" stands for "Exception Handling". This section contains instructions for each function in the program that describe how to unwind a stack from any point in the program.

I won't go into the details of the structure of .eh_frame here. Suffice it to say that it contains information like "when stopped at location 0x1C0 in the binary, the previous value of sp can be found by adding 0x2C to the current value of sp. The previous value of x30 can be found at the address pointed to by sp + 0x14." This specific example is made up, but should demonstrate how the section gives enough information to find old values for registers and old stack frames without a frame pointer being present. Because the compiler, while compiling, knows where everything is on the stack, it is able to generate this data precisely and correctly.

Stack unwinding then becomes a matter of parsing the .eh_frame data for the target program, and using the instructions found there to figure out how to unwind stacks. Despite being more complex, this technique works well enough that it's the primary way native-code stacks are unwound in Parca, which doesn't even attempt to follow frame pointers.

Conclusion

Hopefully, it should now be clear what a call stack is, how programs use it for call-and-return flow, and how tools like profilers and debuggers can "walk" or "unwind" the stack in order to accurately report stack traces in terms of code addresses.

A few questions might remain: how are profilers actually invoked? That is, what actually causes the profiler to "wake up" several times a second and start walking the stack? And how do profilers or debuggers do "symbolization"; that is, translating code addresses to human-readable function names?

Perhaps most interestingly, how do profilers report call stacks for virtual-machine environments like V8, Python, or the JVM, where the code might not correspond in any straightforward way to physical machine code, and which certainly don't have to follow the platform's native-code ABI conventions?

Touching on those topics would have made this post too long, but we hope to cover them in a future episode of "How Profilers Work". Stay tuned!

AI Usage Disclaimer: This post was reviewed for accuracy by Claude Code, which made a few technical and stylistic corrections, but no substantial changes to wording or content.

Discuss: