Offline Mode for the Parca Agent
Introduction
CPU profiling with Parca involves two main components: the Parca agent, which runs
as a process called parca-agent
on the host being profiled, and the
Parca backend, which runs as a process called parca
on any host;
possibly not the same one. The agent continuously collects stack
traces of the code scheduled on the CPU. It then periodically sends
the collected stack traces to the backend, where they are stored in a
database for future retrieval.
The design of Polar Signals Cloud is exactly analogous: it features a more performant and scalable backend, but it communicates with the Parca agent in the same way. Thus, in this post, "the backend" should be taken to mean either the Parca backend or the Polar Signals Cloud backend.
The Motivation for Offline Mode
In the past, the agent has typically communicated with the backend over the network, and a lost network connection usually implies that data collected while the network was down will not be reported. In a typical modern server workload, this is acceptable: a host losing network connectivity is a rare scenario that means the host is basically useless anyway.
But the world of computing is broader than just servers, and the Parca team would like our software to be useful in other kinds of deployments as well. In the modern world, many computerized devices are either never connected to the internet or only unreliably connected: this includes everything from smartphones to autonomous vehicles.
Thus, we decided to develop Offline Mode: a new feature for the Parca agent allowing it to save data locally and upload it for further processing later.
How It Works
Recording the Data
In traditional operation ("online" mode), the agent communicates with the backend via the following stateful protocol: first, it uploads a list of stack IDs (computed by hashing the stacks themselves) along with a count of how many times each stack ID occurred. The backend then responds with a list of IDs that it needs the full stack trace for, and finally, the agent responds with these. This allows the backend to cache stacks it's already seen, decreasing network traffic.
In offline mode, every five seconds, rather than sending anything to
the backend, we record two records in a file (each prefixed with
their size in bytes): first, the stack IDs and
their counts; second, the full stacks for any IDs that have not yet
been recorded in the same file. We then call fsync
to ensure data
persistence, and finally, update the count of batches in the header of
the file.
This format is self-describing and resistent to crashes: since the batch count is not updated until after the batch is synced to disk, an attempt to read a partially-written file will only see atomically written batches (though it might miss an entire final batch if it was in the process of being written when the agent process terminated).
Every ten minutes, the storage file is rotated: it is compressed using
ZSTD to reduce storage cost, and a new file is started. The files are
saved with the scheme {timestamp}-{pid}.padata
so that later they
can be read in timestamp order.
Uploading the Data
Later, data may be uploaded whenever and from wherever the user chooses to do so; this does not necessarily have to be done on the same machine where the data was recorded, as long as the uploading machine has access to the storage directory where the files were written.
The uploader reads files from the storage in the order they were written (sorting using the timestamp in the filename). It uploads samples to the backend using the same protocol as the agent does during normal operation, using the full stacks (the second record in each batch) to respond to responses from the backend requesting them. After each file is successfully uploaded, the uploader removes it from the storage directory, so it can pick up where it left off if it's interrupted.
Try It Out
If you want to profile an x86-64 or aarch64 Linux installation that
has reliable access to storage but not to the network, the Parca
agent's offline mode might be just what you're looking for. To try it
out, run parca-agent
with
--offline-mode-storage-path=/path/to/storage
to begin collecting
profiling data locally. The agent will create .padata
files in the
specified directory, rotating and compressing them every 10 minutes by
default.
When you're ready to upload the collected data, run parca-agent
with
both --offline-mode-storage-path=/path/to/storage
and
--offline-mode-upload
, along with your usual backend configuration
flags (like --remote-store-address
). The uploader will process all
files in timestamp order and remove them after successful upload. This
doesn't have to be done on the same machine as collection: nothing
stops you from copying the /path/to/storage
directory to anywhere
that is capable of maintaining a network connection and running parca-agent
.
We hope this is useful. Happy profiling!