The Problem
eBPF and in particular uprobes allow basically limitless ability to dig into your software to trace what is going on. Think about it, you can attach a probe to basically any instruction in any function in any executable. That's a lot of probe points! But in practice it can be challenging because you may not be familiar with the code you want to trace and it may be hard to discover what can be traced if binaries are stripped. Turns out reading binaries is difficult! But lets say you know what you want to trace (i.e. you have a function name in hand) and the binary isn't stripped so translating function name to address is straight-forward. The next task you might want to do is extract some value from the program, an integer that represents the size of something, a string that contains a user or application name etc. How do you know where that thing lives? Is it on the stack? Is it in a register? Is it a couple pointer dereferences away in some object that would require type layout information to drill into?
Things get tricky pretty fast, and if you manage to get your probe working a new release may come out and invalidate all the assumptions you had to make. Even the seemingly mundane task of knowing which process to attach probes to can be difficult, is the function you want in a binary or in a supporting shared library? What's the name of the binary, what's the name of the shared library, what version numbers may or may not be attached to the name? And don't get me started on tricks compilers play, maybe the function you want to probe is inlined and sometimes not, maybe its been compiled with PGO information and been split in to hot/cold parts.
TL;DR:
At Polar Signals we think user-level statically defined tracepoints (USDTs) represent a solution to all these problems, to put our money where our mouths are we developed a USDT based solution for doing GPU profiling for CUDA that was previously discussed here. But that use case is just the tip of the iceberg. To learn more about the ins and outs of USDT probes and what we plan to do with them read on!
Introduction
USDTs allow a binary (executable or shared library) to declare to the world where some interesting information can be found and how to get it with zero overhead. These let a tracing application detect these points of interest and take all the guess work out of employing uprobes to get at interesting information. To start with they are discoverable, just run readelf -n and you'll see the STAP notes section that shows you all the USDTs the binary provides. Try this on your Linux system:
$ readelf -n $(which python3) Displaying notes found in: .note.stapsdt Owner Data size Description stapsdt 0x00000040 NT_STAPSDT (SystemTap probe descriptors) Provider: python Name: import__find__load__start Location: 0x00000000004a773f, Base: 0x00000000009b11a3, Semaphore: 0x0000000000ba6bfa Arguments: 8@%rax stapsdt 0x00000047 NT_STAPSDT (SystemTap probe descriptors) Provider: python Name: import__find__load__done Location: 0x00000000004a7813, Base: 0x00000000009b11a3, Semaphore: 0x0000000000ba6bfc Arguments: 8@%rax -4@%edx stapsdt 0x00000033 NT_STAPSDT (SystemTap probe descriptors) Provider: python Name: audit Location: 0x00000000004b95dc, Base: 0x00000000009b11a3, Semaphore: 0x0000000000ba6bfe Arguments: 8@%r13 8@%r12 stapsdt 0x00000030 NT_STAPSDT (SystemTap probe descriptors) Provider: python Name: gc__done Location: 0x00000000004c1935, Base: 0x00000000009b11a3, Semaphore: 0x0000000000ba6bf8 Arguments: -8@%r15 stapsdt 0x00000037 NT_STAPSDT (SystemTap probe descriptors) Provider: python Name: gc__start Location: 0x00000000004c19c9, Base: 0x00000000009b11a3, Semaphore: 0x0000000000ba6bf6 Arguments: -4@-228(%rbp)
You can see that python provides a number of USDTs that you can attach to. You can also see where they are located in the binary (the Location field) and what arguments they provide (the Arguments field). The arguments field is particularly interesting because it tells you how to extract values from the program when the USDT is hit. The gc__start USDT provides a single argument that is a signed 4 byte value (i.e. int) located on the stack at an offset of -228 bytes from the base pointer (rbp). Now to usefully trace something all we need to know is the USDT name and the what the arguments are. We don't have to know the address/location, we don't have to worry about where the arguments live, we don't even have to care about what the binary is called, where it lives or whether its coming from an executable or a shared library. All that information can be pulled out of the STAP note section in the binary and all we need to know is the provider ("python") and the USDT name. If you have bpftrace (be sure its a recent version!) and some python running on your system try this little one liner:
$ sudo bpftrace -e 'usdt:/usr/bin/python3:python:gc__start { printf("GC start: generation=%d\n", arg0); } usdt:/usr/bin/python3:python:gc__done { printf("GC done: collected=%ld\n", arg0); }' [sudo] password for tpr: Attaching 2 probes... GC start: generation=0 GC done: collected=0 GC start: generation=0 GC done: collected=24 ...
Here's a small list of the things bpftrace did under the covers to accomplish this:
- Find the python binary and parse the
.note.stapsdtsection - Use LLVM to build eBPF programs to print the GC generation and number of objects collected from the GC probes (yep that's right, bpftrace is linked against libLLVM.so so it can do this)
- Parse the probe locations and arguments to figure out how where to attach the program
- Issue syscalls to the eBPF subsystem to load these programs as well as create and initialize all the supporting ebpf maps/ringbuffers
Sounds simple doesn't it? The reality is there's a fair bit of complexity underlying the simple facade of USDTs. For parca-agent in particular we are coming at this from the starting point of a Go based profiling agent built on Cilium's eBPF library. This library is great because it allows us to avoid using CGO to talk to C code (a previous incarnation of our profiler used libbpfgo). Unfortunately unlike libbpf, Cilium's Go library supports uprobes but doesn't support USDTs so some elbow grease must be applied.
Probes
But before we get into that its nice to start off with how do we get USDT probes into our program in the first place? This is a fairly big topic but the simplest solution for C family programs is to include sys/sdt.h and use the STAP_PROBE macros (DTRACE_PROBE is also available for code that wants to support stap and dtrace). Our CGO based unit tests can be seen here. These macros will expand to assembly instructions to put the stap notes into the binary. And what code is put at the address of the probe into your program? A single NOP instruction. That's it, basically zero overhead, when the kernel attaches a probe to it that NOP will be replaced with an interrupt that will jump to the eBPF subsystem to load/verify/JIT/execute your eBPF program. Note that when attached uprobes are not zero overhead. Essentially its like a context switch from your program to the kernel so you can think about the overhead as similar to a system call or other context switch. This is on the order of thousands of instructions and typically takes a small number of microseconds (how long will it be before that statement is no longer valid I wonder, this is why we put dates on our blogs!).
USDT Attach Location
Okay so some funky assembly code generated our stap notes and put a NOP instruction in just the right spot so the first thing we have to figure out is what address to attach to. We can see in the note above that the "Location" is provided but some massaging is required to turn this into an actual address that can be passed to the kernel for attachment. The details are here and our implementation can be found here. It's not very exciting but we promised a deep dive!
USDT Arg Handling
Now that we know what address to attach our probe to we have to figure out how to deal with arguments. An eBPF program basically gets a context that includes all the registers but we need some extra information, a USDT arg can be a constant, a register or a memory reference relative to a register. We could hardcode the eBPF code to lookup these things but that's a fragile solution that can break if the argument locations change. Different compilers lay things out differently and these things can change with different optimization levels so a better solution is to store a "spec" describing where the argument is and store that information in a map so the eBPF program can look it up at runtime. This is the solution that libbpf uses and we liked it so much that we pretty much copied its solution whole sale. For comparison hard coded looks like this:
#if defined(__aarch64__)
// ARM64: Arguments: 4@[sp, 36]
u64 addr = ctx->sp;
err = bpf_probe_read_user(&correlation_id, sizeof(correlation_id), (void *)(addr + 60));
if (err)
return err;
err = bpf_probe_read_user(&cbid, sizeof(cbid), (void *)(addr + 32));
if (err)
return err;
#else
// AMD64: Arguments: 4@-36(%rbp)
u64 rbp = ctx->bp;
err = bpf_probe_read_user(&correlation_id, sizeof(correlation_id), (void *)rbp - 44);
if (err)
return err;
err = bpf_probe_read_user(&cbid, sizeof(cbid), (void *)rbp - 64);
if (err)
return err;
#endif
And the generic "spec" based solution looks like this:
SEC("usdt/parcagpu/cuda_correlation")
int BPF_USDT(cuda_correlation, u32 correlation_id, u32 cbid) {
...
}
Very nice! That BPF_USDT macro is doing a lot of work but basically it expands into a call to bpf_usdt_argN which implements reading the argument value by consulting the argument spec map, the key is a "spec" id which is just a counter generated when we attach the USDT probe and its passed from the agent to the eBPF program via the bpf_get_attach_cookie helper. You can see it in all its glory here. So exciting I know! In truth its worse than this we recently added CUDA 13 support to our GPU tracer and some of the USDT arguments changed so we'd have add even more conditionals to support all 4 variations (amd64/arm64 * CUDA12/13).
Some astute readers may be wondering if this dynamic eBPF code to resolve USDT args is just unnecessary overhead, like couldn't we just patch the eBPF code at runtime before loading the program to "hardcode" the correct sequence lookup? Sounds good in theory and we thought about it for a bit but couldn't convince ourselves that we could make it fool proof or that the tiny improvement in efficiency was worth it so we decided not to pursue that approach. But in a future world where there are good libraries for converting eBPF code into IR designed for this kind of manipulation and then back out to eBPF we may revisit this decision, probably not anytime soon. Remember kids, premature optimization is the root of all evil.
Attach to PID?
When attaching a uprobe you can attach to all processes matching the binary or a specific PID. For a continuous system-wide profiler like parca-agent we generally want to attach system wide but when attaching to shared libraries on older kernels system-wide attachment doesn't work so we have to fall back to attaching to each pid separately in those cases. That can be expensive so we prefer system wide attachment when it is available.
MultiProbes FTW!
Speaking of efficiency the whole load/verify/JIT process with eBPF is expensive (even if its a one time cost) so if I want to attach to thousands of different probe points but run them all through one "multi" eBPF program wouldn't that be nice? Well it turns out you can! This multi-probe support was added in the 6.6 Linux kernel and our implementation automatically uses it when available. The catch is that you have to author your eBPF programs twice, once to be called as a single stand alone eBPF program and again in the context of the multi-probe program which is basically just a big switch statement. Luckily we have more macro magic to make this relatively straight forward:
// single probe impl
SEC("usdt/parcagpu/cuda_correlation")
int BPF_USDT(cuda_correlation, u32 correlation_id, u32 cbid) {
...
}
// multiprobe impl
SEC("usdt/cuda_probe")
int cuda_probe(struct pt_regs *ctx) {
u32 cookie = bpf_get_attach_cookie(ctx);
switch (cookie) {
case 'c': return BPF_USDT_CALL(cuda_correlation, correlation_id, cbid);
...
Basically each probe attach point gets a different "cookie" which the kernel passes to the program so it knows which attach point is being invoked. The BPF_USDT_CALL macro expands to a call to inlined body of the cuda_correlation probe defined earlier by the BPF_USDT macro. Its a little tricksy but in practice looks pretty clean!
So where are all the USDTs?
I know what you're thinking, USDTs are the coolest thing since sliced cheese (sorry bread, keto rules!) so why aren't they everywhere? Good question! Some software has good support for them, postgres has dozens, python has a couple we've seen, glibc has some. But in general there hasn't been widespread adoption of them. Who knows, maybe some people see the word DTRACE and think "Oh that's something Oracle got from Sun and is now a techno-fossil" or "Oh that's the thing Mac OSX kinda supported but handicapped behind SIP". Who knows but modern Linux kernels have killer support for these things and everyone should be using them! Once you experience the power of uprobes its hard to not think of interesting ways to utilize them and USDT probes are basically just uprobes that grew up and put on a suit and got a real job.
Another things that hurts USDT usage is some distros don't include them, like the python exercise above might have failed for you if your distro didn't include --with-dtrace. I guess that 1 NOP instruction was a bridge too far for some folks. Same with glibc and the Debian distro, over 80 USDTs that debian related to runtime library loading, threading, locking, memory allocations that Debian just disables (Ubuntu, Fedora have them). Shame!
The shim library use case we did for the CUDA profiling was nice but another tantalyzing idea is to use USDTs to create a formal/declared boundary ABI between tracer/profilers like parca-agent and programs that people want to trace. Today we exert a lot of effort to extract offsets of types so we can trace the stacks of V8/Java/.Net/Python etc. and everytime those runtimes change these offsets can change and break things. We've even resorted to disassembly to find this information. Yuck. Wouldn't it be nice if programs could just include a set of STAP_PROBES who's arguments passed us this information? They wouldn't even need to be attached to!
But my personal pipe dreams aside, we also see great potential in USDT probes being used to help fight the uprobes are too expensive problem. Like once you have this nice rich conduit from your program to your profiler/tracer you can have it be smart about how frequently its called. Like you wouldn't want to build a memory profiler by sticking a uprobe on malloc/free, it would be too expensive. But what if you made malloc/free (or a LD_PRELOADED shim library wrapper) that watched the allocation volume and fired the USDT probe only after every 1MB of allocations (ie sample the allocation stream). That could be used to implement something like jemalloc's memory profiler or Go's built in profiler. In fact you could probably have one shim library that supported many memory allocators, and even better if the allocators just embedded these USDTs into themselves doing away with the shim library.
Another cool possibility is to have the ability to have tracing programs like parca-agent automatically attach to USDT probes of interest and record a timestamp and a stack trace when they fire, we could even automatically stick some value from one of the USDT args in a custom label attached with the sample.
Conclusion
USDT probes solve fundamental observability problems that have plagued systems engineers for years: how to make programs traceable without fragile hacks, how to maintain that traceability across updates and architectures, and how to do it with minimal overhead. They transform uprobes from a powerful but unwieldy debugging tool into a stable, discoverable interface that programs can expose intentionally.
The technical hurdles we overcame to add USDT support to parca-agent - parsing STAP notes, handling cross-architecture argument specifications, implementing multiprobe support - are all one-time costs that pay dividends every time we (or you!) need to trace something new. Our CUDA GPU profiling use case demonstrates the power of this approach: a shim library with strategically placed USDTs gave us deep visibility into GPU operations without any kernel modifications or brittle binary parsing.
But we're just scratching the surface. USDTs deserve to be a first-class citizen in the observability toolkit. They should be in your application framework, your runtime, your libraries. They should replace the brittle offset extraction and disassembly we resort to today. And with modern Linux kernel support (multiprobes, attach cookies, and all the eBPF goodness), there's never been a better time to add them.
So here's the challenge: next time you're building software that others might want to observe - whether it's a web framework processing requests, a database executing queries, or a runtime managing memory - consider adding a few USDTs. Make your software observable by design, not by accident. Your users (and their debugging sessions at 3am) will thank you.
Where do you see USDTs being potentially useful? What would you trace if it were this easy? Let us know! Reach out on Discord or email me directly at tr@polarsignals.com. And if you want to dig into our implementation, check out the opentelemetry-ebpf-profiler repository - all the USDT handling code we discussed is there waiting for you to explore, fork, and improve.