Introducing Off-CPU Profiling

How Off-CPU profiling works and how to get the most out of it

July 30, 2025

Up until now, the Parca Agent, the eBPF-based profiler Polar Signals maintains, supported only On-CPU profiling, meaning when profiling workloads users would only see the work that their applications do on the CPU. When optimizing for cost or latency, On-CPU profiling is arguably one of the most important pieces of data to look at, however, especially when optimizing latency, it is also important to understand why our process is not performing work on the CPU (for example when it is waiting for I/O, like network requests or reading from disk to complete).

Today we are launching Off-CPU Profiling to combat this.

Collecting the Raw Data

Since the profiler already had the ability to unwind stacks for arbitrary languages, the important part for Off-CPU profiling was: When should samples be collected? The original proposal discussed a few options, ultimately landing on tracing the Linux Kernel's tracepoint:sched:sched_switch function to track when a process is taken off the CPU, and tracing kprobe:finish_task_switch.isra.0 to know when the task is put back onto the CPU. This combines two crucial pieces of information:

  1. What is the code that caused the task to be taken off the CPU?
  2. How long was the task not on the CPU?

One difficulty with Off-CPU profiling is that it is not unusual for the Kernel to do this many thousands of times per second. This can cause significant overhead on workloads, therefore Off-CPU profiling is off by default, and only enabled using the --off-cpu-threshold flag, which represents how many out of 1000 Off-CPU events should be sampled.

All credits for implementing collection of Off-CPU profiling data go to Florian Lehner.

Making use of Off-CPU data

After successfully producing Off-CPU profiling data, we excitedly turned it on for our Go and Rust workloads only to see this.

By far the largest amount from the Go code comes from runtime.usleep (the last of the user-space frames in blue). The reason we're seeing this in Off-CPU profiling data is because the go runtime's sysmon goroutine intentionally sleeps periodically to yield to the rest of the program and then to periodically check whether it needs to perform any house keeping tasks. So this is both expected and something we can't do anything about.

To combat this, we introduced the "not contains" filter for stacks. So we can filter out any stacks that we are not interested in seeing. When we apply a "not contains" filter to the above profiling data it looks like the following:

A little better, however, one of the major stacks left is still related to runtime.futex, so the next feature we needed to introduce was the ability to provide multiple filters. However, we didn't want to stop there, and we have also introduced what we call "Filter Presets". So now all a user needs to do is select the "Go Runtime Expected Off-CPU".

Now we can see, that the majority of time that this Prometheus server is taken off the CPU is simply because it has spent a lot of time in garbage collection and a timer fires and the Kernel takes it off the CPU. This is also an interesting insight, most of the time when optimizing allocations, we only think about the On-CPU cost related to it, but it turns out perform enough allocations and the kernel also decides to take processes off of the CPU potentially causing additional latency to end users.

Let's go one step further, and ignore garbage collection stacks.

Aha! Now we can see some non-timer related reasons, there is EpollWait and syscall.Write ultimately leading to entry_SYSCALL_64_after_hwframe (on ARM64 the equivalent would be el0_svc). These are wait times actually happening because of I/O, more specifically based on the stacks we can also see it's all network traffic related. Neat!

Presets

For a start we've added filter presets for the Go runtime and the async runtime Tokio, from the Rust ecosystem. If you discover further stacks that should be ignored, feel free to let us know and we can add them to the presets. Or is there a language or asynchronous runtime that you think we should be adding next?

Conclusion

With Off-CPU profiling we're giving our users another tool to deeply understand their systems.

Let us know if you try Off-CPU profiling and give us your feedback on Discord!

Discuss:
Sign up for the latest Polar Signals news