System-wide profiling in Parca Agent

TL;DR: In this post, we describe the changes we made to Parca Agent to profile every process running on a machine, as well as why we made these changes and some of the challenges discovered along the way. For those of you that haven't heard about Parca Agent before, it's a BPF based, always-on profiler with zero code changes or restarts needed.

Host-wide or system-wide profiling typically refers to profilers that can gather samples from every process running on a system as opposed to profiling a single process or some subgroup.

This new architecture offers a complete view of how every process is using resources by profiling the whole system, beyond the applications you are running.

The previous architecture

Before we re-architected Parca Agent, it had a push-based design around what we called service discovery mechanisms. This happened in two steps. First, we would discover cgroups to profile. Cgroups, or control groups is a Linux kernel feature that it's used to implement containers. This was done via the Kubernetes API as well as by adding cgroups names to the Agent's arguments. Then, we would attach a CPU profiler (perf event) to each cgroup.

Each of the profilers had a couple of housekeeping and operational data structures, and most importantly, they attached a new perf event for each cgroup.

A key observation of this design is that each profiler had metadata about that group of processes. For example, workloads running on Kubernetes would have the pod name among many other pieces of information added to it. Thanks to this, in the UI, we can slice and dice by any of these labels.

Limitations of the previous design

While this served us well for a while, it was starting to be a limiting factor for several reasons:

We wanted to be able to get visibility across all the processes in the machine, even the ones that weren't orchestrated by Kubernetes or different container management systems. By doing this we achieve a higher degree of performance observability. Sometimes the performance change is not in our application code, but in a sidecar container, an application that's not containerised, our runtime or scheduler, or even in the kernel.
The architecture we had at the beginning wasn't flexible enough to support adding new profilers. Adding new profilers was awkward, and code reuse was low. One of the other issues we were facing was that for each profiler we had a set of bookkeeping data structures we had to keep in memory, increasing its overhead. This is a limitation we wanted to eliminate, as we are planning to add new profilers soon.
As we were creating a CPU profiler (perf event) per cgroup, many workloads would use more Performance Monitoring Counters (PMUs) than we had, forcing the kernel to multiplex them. Modern CPUs have special registers that we leverage to profile the CPU. They are limited, typically there are 4-10 available [0], and if we want to use more than are available, the kernel will swap them periodically to be able to count all the events that have been requested, reducing their accuracy, and increasing the work that the kernel has to do.

Benefits of the new design

Rather than creating a perf event for each cgroup, now we are creating one that is not attached to any cgroup, effectively flipping the service discovery mechanism inside out.

We still subscribe to Kubernetes events, and we create a mapping of PIDs to the metadata that we use later on to enrich the profiles.

Just after we finish a profiling round, where we collect all the stacks, generate the pprof profiles, find the metadata associated with the PIDs and enrich the profiles with it. Once this is done, the profiles are sent to the server, where they are stored, symbolised, and indexed.

Adding metadata providers

Thanks to this new more flexible architecture, we have added metadata not just for Kubernetes, but also for any systemd unit running on your host, for example, most cron-like binaries, and other services, as well as any other process spawned in your terminal from a box that uses systemd.

Not relying on the cgroup file system

Before, we had to find the path for the cgroup we wanted to profile, as we had to open a file descriptor of it and pass it to the perf_event_open(2) system call. This was causing some issues as the cgroup hierarchy is not standardised across Linux distributions and we had to have some heuristics to find the right path, which was quite tricky in practice.

Simplicity & reliability

As a side-effect, the new design follows a more natural model where profilers are spawned first, metadata is collected, and data is stitched together, increasing the decoupling between these two components. If there's no metadata for a given process, nothing will be added to its profile, but at least, we will have performance data.

The mental load when setting up the Agent for the first time is reduced as well, as less effort has to be put into the initial configuration. Just decide if you prefer to profile Kubernetes or not and everything will be done for you!

We believe that the new architecture is easier to understand for newcomers to the project, and allows more flexibility when adding new profilers or metadata providers, such as the compiler data metadata source.

Improved testability

Finally, it also simplifies testing in several ways:

As developers, we don't have to test as much as we had to across Linux distributions due to the cgroup filesystem having a different shape, as well as not needing to rely on cgroupsv1 (mostly used by Kubernetes [1] and older Docker containers) vs cgroupV2 (which Systemd and Podman have been using for some years) differences. This caused issues for some of our users running less popular distributions.
We've made the code easier to test by adding interfaces, for example.
Before these changes, a process had to run in a cgroup and we had to find its path, which is something some folks struggled with.

Challenges

Before we embarked on this project, we anticipated that we would add more stress to different parts of the system, both in the Agent and in Parca itself.

As system-wide profiling has the potential to send more data, this also means that the window for race conditions that were either latent or rarely exercised is dramatically reduced. We debugged some issues related to this in various components:

A debuginfo extraction race in the Agent: https://github.com/parca-dev/parca-agent/pull/444
Debuginfo races in the server, which kickstarted the debuginfo metadata effort https://github.com/parca-dev/parca/pull/1136
Race condition in the profile's buffer: https://github.com/parca-dev/parca-agent/pull/641

All the issues mentioned above resulted in data being corrupt, and in some cases, such as in the last one, profiles were lost.

Not only did we address these, but we also added the phenomenal Go's race condition that we are going to soon enable in our integration tests and that we now run while developing locally here and here.

Last, but not least, a previous design of this project used the `bpf_get_current_cgroup_id()` BPF helper. While it worked without any problems under cgroupsv2, we quickly learnt that it doesn't work at all in cgroupsv1, which prompted us to reevaluate the design.

Takeaways

Rearchitecting Parca Agent to gather system-wide profiles required a lot of groundwork before we were able to reach the finish line. Thanks to tackling this work incrementally, we increased our chances of success. For example, by noticing which parts required more work so we could focus on them and continue improving the overall reliability and scalability of Parca and Parca Agent.

Collaborating with several team members on this project allowed us to find better designs and improve the overall architecture, and made sure we wouldn't paint ourselves into a corner with a new system that wasn't extensible or scalable.

Finally, having a full-blown test environment, running Kubernetes in a VM was invaluable to finding out problems that otherwise would have taken way longer to uncover and that would have delayed the project.

What's next?

We plan to stabilise system-wide in the next couple of weeks, but for that, your help will be invaluable. Give it a try and let us know of any bugs you might bump into or feedback you might have. It's currently available in the `main` branch. Feel free to leave any feedback on this GitHub discussion.

As we are sending more data and processing more profiles, there might be increased resource usage. In this regard, we are actively trying to reduce Parca Agent's footprint as well. We, of course, profile Parca with Parca. Stay tuned for a forthcoming blog post on the topic.

References

[0]: The PAPI project allows us to easily show information on CPU's performance counters:

# on Fedora and similar distributions
$ dnf install papi
$ papi_avail | grep Counters
Number Hardware Counters : 10
Max Multiplex Counters   : 384

[1]: As released yesterday, Kubernetes v1.25.0 stabilised cgroupsv2!