October 17, 2023

Nearly a year ago, we added support for DWARF Unwinding in Parca Agent to supercharge our pre-existing frame-pointer-based Unwinding. This was done for x86 architecture.

To our delight, the community flourished and soon there was a request from end-users for adding support for Arm64 workloads as well. So, of course, we delivered!

Starting in Parca Agent v0.25.0 it is possible to profile ARM64 workloads without adding extra flags to the Parca Agent.

This blog post is an announcement of Arm64 support in our Profiler and a bit about the Why and How of the efforts that went into implementing the Arm64 DWARF Unwinder in the Polar Signals Profiler!

Why

In 1980s a group of fewer than 10 engineers prototyped a new chip architecture in interpreted BASIC on a very tight R&D budget and schedule to keep up with market competitors. The chip was called Acorn RISC Machine or ARM.

They reduced the number of basic instructions to simplify calculations on the silicon chip level. Although this resulted in more instructions, each instruction consumed much less CPU clock cycles. This meant remarkably less power consumption which impressed engineers at Apple.

They invested in the chip and fast-forward a few decades- Arm64 powers the most perfomant and powerful personal computers through Apple's M1 and M2 chips. ARM chips lead the smartphone processor industry and are widely used in IoT devices.

It is, therefore, no surprise that they also power the major cloud providers in the industry- Google Cloud Platform (GCP) and Amazon Web Services (AWS).

Our main goal at Polar Signals is making Continuous Profiling a one-step no-brainer for our users- deploying without worrying about instrumentation. As a company in the cloud native space, we want to support all architectures that are used by major Cloud providers so full Arm64 support has always been on our roadmap. To that end, we already have frame-pointers based profiling on Arm64.

So when some of our clients and end-users wished for Arm64 support for stripped binaries, we jumped right into it. Here's what we did to add full-fledged support in our Agent for DWARF Unwinding in Arm64.

A short Primer on Registers, Unwinding, DWARFs and ELFs

A stacktrace is largely a collection of function memory addresses associated with a process. Each frame in the stack is an address representing a function calling the next function, which is again represented by a frame holding the address for that function.

Unwinding a stack mostly involves reading these memory addresses. To do this, we largely care about 3 CPU registers that help us navigate stacks for each process. They essentially act as markers to keep track of where a stack begins and where it ends, and when we reach the `main` function so we know how to stop unwinding. The registers are:

1. Program Counter(`pc`)/ Instruction Pointer(`ip`/ `rip`) : Register stores the memory address for the current instruction being executed.

2. Stack Pointer(`sp`/`rsp`) : Register stores the next available stack address and points to the beginning/top of the stack frame (or the lower bound)

3. Frame Pointer(`fp`/`rbp`) : Register points to the beginning of the last frame that was executed . It moves as frames are updated and points to the bottom of the stack frame (or the upper bound).

4. Return Address(`ra`) / Link Register(`lr`) : Address of the caller function of current stack, also the saved `ip` of the previous function

Stack Layout showing the Frame Pointer(fp) and the Stack Pointer (sp)

Often, compilers and production applications strip the frame pointers in binaries to save space or gain optimisations by freeing up the `fp` register. This means we need to fall back on the `debug information (debuginfo)` we can obtain from these stripped binaries to unwind the stack.

Binaries today are largely compiled in a standardised ELF format and have debugging information encoded in the DWARF spec- we call this DWARF Unwinding. If you are curious to know more about this, my coworker Kemal already has you covered in this wonderful article.

In the absence of `frame pointers`, we use `Return Addresses(RAs)` to build the stacktraces. We calculate the return addresses using `SP`/`FP` register information in the `.eh_frame` sections of binaries.

At a very fundamental level, a stacktrace is a collection of frames- each frame representing the address of an instruction that calls the next frame- or, TL;DR: a stacktrace is an array of saved return addresses.

We do this by taking the binaries we want to profile and putting all the necessary information about the CPU registers into `Unwind Tables`.

How ?

Here's what the unwind table for the `x86` architecture looks like

=> Function start: 293c0, Function end: 29427
	pc: 293c0 cfa_type: 2  rbp_type: 0  cfa_offset: 8    rbp_offset: 0   
	pc: 293c5 cfa_type: 2  rbp_type: 0  cfa_offset: 16   rbp_offset: 0   
	pc: 293cb cfa_type: 2  rbp_type: 0  cfa_offset: 32   rbp_offset: 0   
	pc: 29408 cfa_type: 2  rbp_type: 0  cfa_offset: 16   rbp_offset: 0   
	pc: 29409 cfa_type: 2  rbp_type: 0  cfa_offset: 8    rbp_offset: 0   
	pc: 2940e cfa_type: 2  rbp_type: 0  cfa_offset: 32   rbp_offset: 0

compact unwind table snippet for `libc` on x86

We know from the x86 spec that the return address is always 8 bytes ahead of the previous stack pointer- this makes is easy to calculate the saved return address once we have the value for the stack pointer.

However, this principle is not specified in the Arm64 ABI spec.

Now, let us take a look at what an Arm64 Unwind Table looks like:

=> Function start: 26c00, Function end: 26c80
	pc: 26c00 cfa_type: 2  rbp_type: 0  cfa_offset: 0    rbp_offset: 0   lr_offset: 0   
	pc: 26c04 cfa_type: 2  rbp_type: 1  cfa_offset: 48   rbp_offset: -48 lr_offset: -40 
	pc: 26c14 cfa_type: 2  rbp_type: 1  cfa_offset: 48   rbp_offset: -48 lr_offset: -40 
	pc: 26c6c cfa_type: 2  rbp_type: 0  cfa_offset: 0    rbp_offset: 0   lr_offset: 0   
	pc: 26c70 cfa_type: 2  rbp_type: 1  cfa_offset: 48   rbp_offset: -48 lr_offset: -40 
=> Function start: 26a00, Function end: 26a0c
	pc: 26a00 cfa_type: 2  rbp_type: 0  cfa_offset: 0    rbp_offset: 0   lr_offset: 0   
	pc: 26a04 cfa_type: 2  rbp_type: 1  cfa_offset: 16   rbp_offset: -16 lr_offset: -8

compact unwind table snippet for `libc` on Arm64

Do you see the extra column with the `lr_offset` field? It keeps tracks of the offset that we read from the `lr/ra` register in Arm64, which we need to calculate the return address. But the principle we apply above for x86 is not something we can apply to achieve that.

However, we already have the information we need from the`ra` register, saved in the `lr_offset` field!

Okay, but how do we use it? Let us look at another code snippet - this time from our BPF Unwinder for x86 :

// HACK(javierhonduco): This is an architectural shortcut we can take. As we
// only support x86_64 at the minute, we can assume that the return address
// is *always* 8 bytes ahead of the previous stack pointer.
#if __TARGET_ARCH_x86
    u64 previous_rip_addr = previous_rsp - 8;
    int err = bpf_probe_read_user(&previous_rip, 8, (void *)(previous_rip_addr));
    if (err < 0) {
      LOG("\n[error] Failed to read previous rip with error: %d", err);
    }
    LOG("\tprevious ip: %llx (@ %llx)", previous_rip, previous_rip_addr);
#endif

As mentioned above, to calculate the current `ra`(or the `previous_rip`), we only need to read the address 8 bytes ahead(`previous_rip_addr`) of our previous stack pointer (`previous_rsp`).

We then use the `previous_rip` value to update our unwind state and add the new stackframe.

That's just for x86 though. What do we do for Arm64 ?

#if __TARGET_ARCH_arm64
    // For the leaf frame, the saved pc/ip is always be stored in the link register itself
    if (found_lr_offset == 0) {
      previous_rip = PT_REGS_RET(&ctx->regs);
    } else {
      u64 previous_rip_addr = previous_rsp + found_lr_offset;
      int err = bpf_probe_read_user(&previous_rip, 8, (void *)(previous_rip_addr));
      if (err < 0) {
        LOG("\n[error] Failed to read previous rip with error: %d", err);
      }
      LOG("\tprevious ip: %llx (@ %llx)", previous_rip, previous_rip_addr);
    }
#endif

The Arm64 ABI uses the `ra` register to give us an `lr_offset` to calculate the `previous_rsp`. For the leaf frame, the link register itself stores the return address. For the remaining frames, we add the `found_lr_offset` to obtain the address (`previous_rip_addr`) which stores the return address value. Then we read that value into `previous_rip` and can now update our unwind state to hold the new frame!

From here on, the rest of the Unwind Process is the same as for x86. And just like that, we have our stacktraces!

What's Next?

It has been nearly a month since we added Arm64 support for our DWARF Unwinder in v0.25.0 of Parca Agent and so far it has been showing great results! If you are curious about more nitty-gritty details, here is the PR for Adding Unwind Tables and here is the PR for implementing the Arm64 BPF DWARF Unwinder.

Do try it out with our latest Parca Agent release and let us know your thoughts! We are looking forward to your feedback and happy to answer questions on either GitHub or our Discord!

We just launched General Availability of our Polar Signals Cloud product and are offering a 14-day free trial, so all you need to do to give Arm64 Profiling a spin is Sign Up and deploy the Agent to a Linux machine- you’ll be profiling Arm64 workloads in no time to ship faster software and reduce your cloud bill while you're at it!

Discuss:

Profiling Arm64 with eBPF in Parca Agent

Why

A short Primer on Registers, Unwinding, DWARFs and ELFs

How ?

What's Next?