Profiling Internals: Hardware Timers and eBPF
Introduction
In a past episode of this series we discussed how stack unwinding works: how a CPU profiler, when invoked, gets the current call stack in order to render flame graphs and other visualizations. But we glossed over the question of how the profiler is invoked; what causes the stack unwinder to be invoked several times a second, and most importantly, how this can be done in such a way that the overhead is minimal? Both of these questions will be addressed in this article.
The Main Profiling Mechanism: Hardware Timers
All mainstream CPUs (that is, anything that might be used in a server or laptop, not random microcontrollers) provide a mechanism for counting elapsed time at a high degree of accuracy. Modern operating systems like Linux have long relied on this ability: first of all for the obvious purpose of implementing APIs like nanosleep, but more relevantly for our purposes, to implement non-cooperative multitasking.
A Digression on Multitasking
A typical processor can only execute a strictly constrained number of threads at a time; for example, on the laptop I'm using to write this, that number is 16. But it's possible to have thousands (or more) tasks that appear to be running simultaneously. How does this work?
On some historic consumer OSes, like Windows 3.1, this worked by cooperative multitasking: when a task declares itself to be idle, for example by waiting for user input or other I/O, the OS kernel removes it from the CPU until the I/O result is ready and it's time for it to run again. Only "runnable" tasks; that is, those with work to be done on the CPU, are actually scheduled. If the user had, for example, a web browser and a text editor open simultaneously, even on a single-core machine, and the user typed a character into the editor or clicked a link in the browser, the corresponding process would be scheduled to handle that input, but they would otherwise be idle.
As the name suggests, cooperative multitasking suffers from a major limitation: it relies on user-mode code to cooperate with the OS, by yielding control of the CPU (e.g., by making an I/O call) periodically. A poorly-written (or malicious) program that simply runs a hot loop of CPU-bound code and never yields could prevent itself from ever being evicted from the CPU.
To address this limitation, modern OSes (in addition to traditional cooperative techniques) rely on hardware timers. The OS instructs the CPU to wait until a particular amount of time has elapsed (for example, something like 10 milliseconds) and then, regardless of what thread is running on the CPU, halt it and jump to a particular code address in the kernel. At this address lives the scheduler, which then has the control to run something else if it so chooses. Thus the illusion of many simultaneously executing tasks can be maintained even if some of them are uncooperative.
Hardware Timers and Profiling
Digression over, now back to the issue at hand...
Given that such a large set of the processors on which Linux runs support high-resolution hardware timers (for scheduling purposes), they were a natural choice when implementing perf events on Linux for measuring elapsed time. Perf events is a Linux kernel feature that has existed since 2009 and which essentially provides the following interface: "when some event happens some number of times, run some code". The set of available events varies by processor; on modern high-end x86 and Arm devices, it is very extensive, allowing profiling of a rich set of low-level events like cache misses or instruction pipeline stalls.
But for our purposes, we don't need such advanced hardware performance capabilities, but only the presence of a hardware timer. As discussed previously, this is available on every modern mainstream architecture.
Putting it all together, the triggering mechanism for time profiling on Linux works like this:
- The profiler registers a perf event for the counter
PERF_COUNT_SW_CPU_CLOCK(this is the counter based on the CPU timer described earlier) and configures it to fire and run some code every time some fixed amount of time is elapsed, for example 19 times a second. - The profiler installs this event on every logical CPU core visible to the system.
- In response to the event firing, the profiler runs some code; for example, to unwind the stack of the currently-executing process and aggregate the data.
This example program demonstrates the principle. It does some CPU-intensive work (calculating the 40th fibonacci number repeatedly, using a naive algorithm) and "profiles" itself. In reality it doesn't actually collect any data; it just prints something 19 times a second, but it should illustrate the principle:
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>
static void handler(int sig)
{
const char msg[] = "Interrupt fired!\n";
write(STDERR_FILENO, msg, sizeof(msg) - 1);
}
static long fib(int n)
{
if (n <= 1)
return n;
return fib(n - 1) + fib(n - 2);
}
int main(void)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_SOFTWARE,
.config = PERF_COUNT_SW_CPU_CLOCK,
.size = sizeof(attr),
.sample_freq = 19,
.freq = 1,
.disabled = 1,
};
int fd = syscall(SYS_perf_event_open, &attr, 0, -1, -1, 0);
if (fd < 0) {
perror("perf_event_open");
return 1;
}
signal(SIGIO, handler);
fcntl(fd, F_SETOWN, getpid());
fcntl(fd, F_SETFL, O_ASYNC);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
for (;;)
printf("%ld\n", fib(40));
}
Profiling in eBPF
The Problem
The previous example omits something rather fundamental: what does the profiler actually do when it's invoked? In the example it just prints a line to stderr, which isn't particularly useful. The previous article discusses most of what the profiler does (collect stack traces), but it misses a crucial piece of the puzzle: how do we make it fast (and reliable!).
In the toy example above, "profiling" happens in the context of the process being profiled. handler isn't running in some other process's address space, which makes things fast: context switches are slow. But in system-wide profiling applications, we have one separate profiler program; we're neither willing nor able to inject profiling code into every target program on the system.
When using frame-pointer-based unwinding as described in the previous article, the downside is limited: walking the frame chain is simple enough that the kernel can do it by itself, and deliver the already-walked stack trace to our separate profiler program. This does incur the unfortunate context switch, but at least the size of the data being transferred between contexts is minimal.
For DWARF-based unwinding, which is what we want to use (because DWARF unwinding data is much more likely to be available in even heavily optimized builds than frame pointers), the situation is even more dire. Parsing DWARF tables and unwinding using them is far too complicated for maintainers to want to include it in the kernel. The best solution, then, is for the kernel to copy user-mode stacks and deliver them wholesale to the profiler when the perf events fire -- the profiler can then save the stacks and later extract a stack trace from them at its leisure. This not only has a performance cost (it's more data being copied into a different process context), but it also fails to record the frames near the root of deep enough stacks, because the amount of data that can be copied via this mechanism is strictly limited.
Nevertheless, this is how traditional profilers like perf work. Can we do better? Thanks to something called eBPF, it turns out we can.
What is eBPF?
eBPF1 began its life as classic BPF, which was a mechanism for programs that do network analysis or filtering to run small, simple programs that take various actions in response to the arrival of network packets.
Eventually, it was recognized that this idea of adding custom logic to the kernel from user mode has much more general potential, and beginning around 2014 classic BPF evolved into what is now called eBPF (the "e" standing for "extended"). This is, in my opinion, the most important change to the Linux kernel in the 21st century so far2. The key benefit of eBPF is that it allows a user-mode program (usually with root privileges) to attach nearly arbitrary programs to respond to any of a very broad class of events in the kernel. These programs are expressed in a machine-independent "eBPF bytecode", which is JIT-compiled by the kernel into machine code for the system's native architecture. In practice, they are usually written in C (though other languages like Rust are also possible) and compiled to eBPF bytecode using Clang.
I said the programs can be nearly arbitrary, and that language was no mistake. Arbitrary programs are not possible; in fact, eBPF is not even a Turing-complete system. A verifier steps through the bytecode when it's loaded, evaluating every possible branch. If it can't prove that the program terminates after a reasonable number of instructions, and without causing any undefined behavior (for example, reading any uninitialized memory or writing to random things out of its scope), then it rejects it.
This verifier is one of the keys to eBPF's advantage over kernel modules, the traditional way of adding custom code to the kernel. Loading a kernel module is a dangerous proposition: now you're really running arbitrary code, which can cause your kernel to malfunction in arbitrary ways. As a Parca team member, I really don't want any mistake I make to be able to crash or hang your kernel. And neither do you!
Parca and eBPF
Modern eBPF-based profilers, including Parca, unwind stacks in eBPF programs. This fulfills all the requirements discussed previously:
- eBPF is capable enough on modern kernels that it is possible to implement a working DWARF-based unwinder. We don't need to rely on frame pointers, which are often disabled in practice.
- One of the broad set of scenarios in which eBPF programs are allowed to run is in response to perf events.
- Since eBPF programs run in the kernel, not in a separate process, the program actually executes in the currently running process's address space. There is no need for a context switch, and no need to copy anything anywhere; we just read the bits of memory we need directly, and pass the accumulated stacks into a buffer that our user-mode profiling process can read at its leisure.
- The constraints imposed by the verifier mean that if we write bugs, the only impact on the user is possibly missing or incorrect profiling data. Barring the presence of kernel bugs, it is theoretically impossible to take down a system by running Parca.
The constrained nature of eBPF of course means that we can't write our entire profiler in eBPF, but merely the component that has to run in response to events and directly access the memory of the target program; that is, the stack unwinder. Our collector (parca-agent) is split into two pieces: an eBPF program written in C that unwinds the stack and stuffs the result into a buffer. And a user-space Go program that reads the stacks from the buffer and collates them, sends them over the network to a backend, parses the DWARF unwind tables for new processes and communicates them in a packed format to the C layer, and generally orchestrates and manages everything.
And that's it! I hope after reading this article you understand a bit more about how modern CPU profilers like Parca work, and how both one ancient technology (hardware timers) and one modern one (eBPF) make it all possible.
AI Usage Disclaimer:
I used Claude, with the Opus 4.6 model, during the preparation of this article in the following ways:
- Researching certain technical details in the Linux kernel codebase
- Checking the article for technical accuracy, and
- Generating the C example program to demonstrate hardware-timer-based perf events.
No AI was used to generate, edit, or influence the prose of the document in any way, except in pointing out two minor technical mistakes of only peripheral importance, and one spelling mistake.
Footnotes
- A pedantic note on terminology: what is popularly called "eBPF" is now officially just called "BPF", having completely replaced "classic BPF" (now officially called "cBPF") in the kernel. Nevertheless, for clarity we continue calling it "eBPF" in this article due to the continued widespread colloquial popularity of that term. Officially, "BPF" stands for nothing, despite historically standing for "Berkeley Packet Filter". ↩
- Others will undoubtedly disagree and give that title to the features that enable containerization. Also a reasonable choice. ↩