Almost every programmer is intuitively familiar with the call stack, even if they don't know that's what it's called.
When a function f calls g which in turn calls h, we know that when h returns, the computer will continue executing code from g, and that when g returns, it will continue in f. Thus, somewhere in its memory, the computer must keep track of a stack of code locations, so it knows where to return to in each case.
Furthermore, programmers rely on various tools being able to accurately find and report this call stack; for example, for debugging or profiling purposes. You might know that executing bt at a gdb command prompt prints out the current stack trace of the target process, or that sampling profilers like Parca or Polar Signals Cloud use stack traces to build a hierarchical model of code cost. But how does that actually work, in practice?
We focus here on native-compiled code on 64-bit ARM on Linux; for example, programs produced by C, C++, or Rust. The representation of call stacks for virtual-machine environments like the JVM or V8 can be considerably different. Call stacks on x86, or on platforms other than Linux, may differ in details but are conceptually similar.
Conceptual Overview
Supposing the calls described above (f calling g, which calls h), happened at lines 11 and 22 in main.c respectively, and we're now about to execute line 33, the conceptual call stack looks something like this:
| function | location |
|---|---|
| h | main.c:33 |
| g | main.c:23 |
| f | main.c:12 |
When h is finished executing, the program pops its entry from the top of the stack, inspects the next-highest entry, sees that it needs to continue executing g at main.c:23, and executes an instruction telling the processor to jump to there.
Each of the rows in these table corresponds to what is called a stack frame, which is the set of information relevant to one level of nested function calling.
Various tools need to be able to recover the information in this call stack while the program is running. For example, profilers like Parca and Polar Signals Cloud need to find the chain of calls that led to the current position, in order to determine which functions are most expensive and build visualizations like flame graphs. Debuggers like gdb or lldb need to be able to print backtraces in response to user queries. And in some languages, the language runtime itself needs to be able to interpret a program's call stack, for example to support language features with stack-unwinding semantics like C++ exceptions or Rust panics.
Physical Representation
If Parca (or other software that takes stack traces) could simply read some data structure from memory that looks like the conceptual call stack depicted above, its job would be trivially easy. In reality, things are not so simple, and the physical representation differs from the conceptual one for at least two main reasons:
- The physical call stack does not actually contain human-intelligible code location information like function names and source files and line numbers. These are higher-level concepts that are not relevant to the processor, which only knows about addresses. Consequently, the physical call stack only contains the 64-bit memory address of the next instruction to be executed when returning from a given frame. As long as function name symbols haven't been stripped from the binary, tools can recover human-readable function names from these addresses by examining the binary's symbol table; if it was compiled with additional debug symbols, they will also be able to recover source file and line data. This process, called symbolization (or, in some discussions, symbolication), is outside the scope of this post.
- The stack does not just contain return addresses; it contains all the local data required by a function. In particular, stack frames are not all the same size.
For a concrete example, consider this function, which recursively computes the Nth element in the Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, and so on, where each element is computed by summing the two previous ones.
uint64_t fib(uint64_t n) {
if (n <= 1)
return n;
uint64_t a = fib(n - 1);
uint64_t b = fib(n - 2);
return a + b;
}
When compiled for 64-bit ARM, a typical compiler might produce something like the following:
fib: ;; Subtract 32 from sp, and store the result back in sp. ;; Then store the pair of registers x29 and x30 at the memory ;; location now pointed to by sp. stp x29, x30, [sp, -32]! ;; offset 0 ;; Set x29 = sp mov x29, sp ;; offset 0x4 ;; Store the pair of registers x19 and x20 at the memory location ;; pointed to by (sp + 16), without modifying sp. stp x19, x20, [sp, 16] ;; offset 0x8 ;; Set x19 = x0 mov x19, x0 ;; offset 0xC ;; if (x0 <= 1) ... cmp x0, 1 ;; offset 0x10 ;; goto .L1 bls .L1 ;; offset 0x14 ;; set x0 -= 1 sub x0, x0, #1 ;; offset 0x18 ;; call fib: that is, set x30 to the address of the following instruction, ;; and jump to the beginning of `fib`. bl fib ;; offset 0x1C ;; set x20 = x0 mov x20, x0 ;; offset 0x20 ;; set x0 = x19 - 2 sub x0, x19, #2 ;; offset 0x24 ;; call fib: that is, set x30 to the address of the following instruction, ;; and jump to the beginning of `fib`. bl fib ;; offset 0x28 ;; set x0 += x20 add x0, x20, x0 ;; offset 0x2C .L1: ;; Load the pair of registers x19 and x20 from the memory location ;; pointed to by (sp + 16), without modifying sp. ldp x19, x20, [sp, 16] ;; offset 0x30 ;; Load the pair of registers x29 and x30 from the memory location ;; pointed to by sp. Then, set sp += 32. ldp x29, x30, [sp], 32 ;; offset 0x34 ;; return: that is, jump to the memory address in executable code pointed to by ;; x30. ret ;; offset 0x38
A few things to note while examining this snippet:
spis a register that, by convention, points to the top of the call stack. Call stacks grow downward in memory, so the stack frame for the currently executing function (which we call the "top" of the stack) is actually at the lowest memory address.- Since
bl(the ARM call instruction) sets x30 to the next address and jumps to the callee, and sinceretsimply jumps to x30, it's clear thatx30is intended to be used to represent call return addresses. Thex30register is sometimes referred to by the alternate namelr, for "link register". - By convention,
x0is used as both the first argument of a function, as well as its return value. When a function is entered, it will expect the caller to have setx0to its first argument. At the end of function execution, after a program executesretto jump to the return address, the caller code will expect it to have setx0to its result. - By convention,
x0andx30are examples of caller-saved or volatile or call-clobbered registers (all these terms mean the same thing). This means that any function is allowed to modify them however it likes; if a functionfcalls a functiong, thenfcannot rely onx0orx30having the same value after callinggas it had before callingg. Iffwants to access that value after a call, it needs to save it somewhere else. - By convention,
x19andx20are callee-saved or non-volatile registers. This means, unsurprisingly, the opposite of the property ofx0discussed above. Iffcallsg, thenfis allowed to assume that the values ofx19andx20are the same before the call as after it. Ifgwants to usex19andx20for its own purpose, then it must save the previous values somewhere, and restore them before returning control to its caller. - The function does not use
x29for any useful purpose. It just sets it to a particular value, and dutifully stores and restores it, but never actually reads its value. What is the point of this register? That will become clear in due time.
Putting this all together, we can describe in prose what the function does.
- Decrements
spby 32, thus allocating a new stack frame of 32 bytes. Uses half the frame to store its return address (inx30), as well as the mysterious seemingly-unused registerx29. Setsx29to the new value ofsp. This entire sequence is called the function prologue. - Saves
x19andx20in the remaining half of the stack frame, because it plans to use them, and they're callee-saved registers so it's not allowed to modify them without storing them later. - Saves its first argument (
n) inx19, because later calls will clobber its current home ofx0. - If its first argument is 0 or 1, jump to the step 6 below.
- Otherwise, call
fibwith argumentn - 1, temporarily store the result inx20, then call it again with argumentn - 2, and finally store the sum of the two return values inx0. - By now, the intended return value is in
x0, regardless of whether we got here from the base-case handling in step 4, or have actually executed step 5. Restore the saved values ofx19,x20,x29, andx30from the stack. The first two are necessary because they are callee-saved registers by convention, and so we need to return with them in a clean state. The final one,x30, is necessary because we need its old value, which has been clobbered by intervening calls. It's not immediately clear what we need the old value ofx29for. After restoring these, deallocate the stack frame by adding 32 tosp, thus restoring its previous value. Finally, executeret, which returns control to the address inx30. This entire sequence is called the function epilogue.
Assume that within main, we call fib(3). Assume that the top of the stack was at 0x10000 when we made our first call, that the return address in main is 0x10000000 and that the code for fib is loaded at 0x10001000.
Once we get to fib(1) (via the second call in fib(3): uint64_t b = fib(n - 2);) and are at the start of the epilogue, the stack will look like this:
| address | value | notes |
|---|---|---|
| 0xFFC0 | 0xFFE0 | old value of x29 in fib(3), saved in fib(1) |
| 0xFFC8 | 0x1000102C | return address in fib(3), x30 saved in fib(1) |
| 0xFFD0 | 3 | old value of x19 in fib(3) (stashed copy of argument), saved in fib(1) |
| 0xFFD8 | 1 | old value of x20 in fib(3) (return value of fib(2)), saved in fib(1) |
| 0xFFE0 | 0x10000 | value of x29 in main, saved in fib(3) |
| 0xFFE8 | 0x10000000 | return address in main, x30 saved in fib(3) |
| 0xFFF0 | ??? | old value of x19 in main, saved in fib(3) |
| 0xFFF8 | ??? | old value of x20 in main, saved in fib(3) |
| 0x10000.. | ... | More values follow, for the stack frame of main, its caller, and so on. |
And the values of the registers are:
| register | value | notes |
|---|---|---|
x0 | 1 | This will be interpreted by the caller as the return value: fib(1) equals 1 |
x19 | 1 | Used as scratch; will be overwritten in the epilogue |
x29 | 0xFFC0 | Will be overwritten in the epilogue |
sp | 0xFFC0 | Top of stack; happens to equal x29 in this case. |
pc | 0x10001030 | Program counter; the next instruction to be executed |
(x30 and x20 were not used in this frame, but will be restored in the epilogue anyway).
Given the information we know so far, you could have filled in all the values in these tables by carefully tracing through the assembly code, with the exception of the value 0x10000 at 0xFFE0. To determine that, you'd have to know the purpose of the register x29, which we will discuss now.
The Simplest Unwinding Technique: Frame Pointers
It's now time to discuss the purpose of the mysterious and apparently unused x29 register. If the program is interrupted by a profiler or debugger at the point mentioned previously; that is, in fib(1) called by fib(3) immediately before executing the epilogue, it can read x29 from the target process's context to see the value 0xFFC0. This points to the location where the old value of x29 was saved on the stack in the prologue. It can then read the memory at that location to see the value 0xFFE0, which points to where the previous frame stored the value of x29 on its entry. This location holds 0x10000, which points to where main stored x29 in its frame, and so on.
And now the meaning of x29 becomes clear: it is the initial pointer to a linked list of stack frames. For this reason, x29 is often called the frame pointer register. The profiler or debugger need only walk this linked list to discover all the frames, from the currently executing code all the way back to the initial pre-main startup function. It can use the fact that by convention, the return address (x30) is always stored in the memory immediately following x29 on the stack to discover the location of the code that invoked each frame.
In our example, sp is always equal to x29 in the function body, but this doesn't necessarily have to be true, for example because different code paths might dynamically allocate different amounts of stack space. But, assuming the compiler follows the frame pointer convention, x29 will always point to the location of the old x29 on the stack, allowing this simple stack unwinding technique to work.
A Slightly Less Simple Technique: Unwinding via .eh_frame
Now that we know about frame pointers, it seems that writing a working stack unwinder isn't too difficult. Are we done now? Unfortunately, not quite.
Although frame pointers are useful for profilers, debuggers, or any other tools that need to unwind stacks, they are under most circumstances not useful to the program itself. While programs do sometimes need to know how to unwind their own stacks (for handling C++ exceptions or Rust panics, or for printing a backtrace in an error message), it's not something that's usually necessary in normal, happy-path code.
For that reason, engineers often choose not to follow the frame pointer convention, by building code in such a way that frame pointers are not used (for example, by compiling with GCC or Clang's -fomit-frame-pointer flag). This can be a slight space and time optimization: the program doesn't need to reserve stack space for the frame pointers or execute instructions to save and restore them, and it can use x29 as a general register for any purpose. Without any other way of unwinding stacks, this micro-optimization would come at a severe cost: it would prevent profilers and debuggers from getting useful stack traces, and prevent C++ exceptions from working at all.
To avoid these significant downsides, another way of unwinding stacks had to be developed. On typical Linux systems, this is done by including a section in the binary called .eh_frame -- the "eh" stands for "Exception Handling". This section contains instructions for each function in the program that describe how to unwind a stack from any point in the program.
I won't go into the details of the structure of .eh_frame here. Suffice it to say that it contains information like "when stopped at location 0x1C0 in the binary, the previous value of sp can be found by adding 0x2C to the current value of sp. The previous value of x30 can be found at the address pointed to by sp + 0x14." This specific example is made up, but should demonstrate how the section gives enough information to find old values for registers and old stack frames without a frame pointer being present. Because the compiler, while compiling, knows where everything is on the stack, it is able to generate this data precisely and correctly.
Stack unwinding then becomes a matter of parsing the .eh_frame data for the target program, and using the instructions found there to figure out how to unwind stacks. Despite being more complex, this technique works well enough that it's the primary way native-code stacks are unwound in Parca, which doesn't even attempt to follow frame pointers.
Conclusion
Hopefully, it should now be clear what a call stack is, how programs use it for call-and-return flow, and how tools like profilers and debuggers can "walk" or "unwind" the stack in order to accurately report stack traces in terms of code addresses.
A few questions might remain: how are profilers actually invoked? That is, what actually causes the profiler to "wake up" several times a second and start walking the stack? And how do profilers or debuggers do "symbolization"; that is, translating code addresses to human-readable function names?
Perhaps most interestingly, how do profilers report call stacks for virtual-machine environments like V8, Python, or the JVM, where the code might not correspond in any straightforward way to physical machine code, and which certainly don't have to follow the platform's native-code ABI conventions?
Touching on those topics would have made this post too long, but we hope to cover them in a future episode of "How Profilers Work". Stay tuned!
AI Usage Disclaimer: This post was reviewed for accuracy by Claude Code, which made a few technical and stylistic corrections, but no substantial changes to wording or content.