Offline Mode for the Parca Agent
Introduction
CPU profiling with Parca involves two main components: the Parca agent, which runs
as a process called parca-agent on the host being profiled, and the
Parca backend, which runs as a process called parca on any host;
possibly not the same one. The agent continuously collects stack
traces of the code scheduled on the CPU. It then periodically sends
the collected stack traces to the backend, where they are stored in a
database for future retrieval.
The design of Polar Signals Cloud is exactly analogous: it features a
more performant and scalable backend, but it communicates with the
Parca agent in the same way. Thus, in this post, "the backend" should
be taken to mean either the Parca backend or the Polar Signals Cloud backend.
The Motivation for Offline Mode
In the past, the agent has typically communicated with the backend
over the network, and a lost network connection usually implies that
data collected while the network was down will not be reported. In a
typical modern server workload, this is acceptable: a host losing
network connectivity is a rare scenario that means the host is
basically useless anyway.
But the world of computing is broader than just servers, and the Parca
team would like our software to be useful in other kinds of
deployments as well. In the modern world, many computerized devices
are either never connected to the internet or only unreliably
connected: this includes everything from smartphones to autonomous
vehicles.
Thus, we decided to develop Offline Mode: a new feature for the
Parca agent allowing it to save data locally and upload it for further
processing later.
How It Works
Recording the Data
In traditional operation ("online" mode), the agent communicates with
the backend via the following stateful protocol: first, it uploads a list of
stack IDs (computed by hashing the stacks themselves) along with a
count of how many times each stack ID occurred. The backend then
responds with a list of IDs that it needs the full stack trace for,
and finally, the agent responds with these. This allows the backend to
cache stacks it's already seen, decreasing network traffic.
In offline mode, every five seconds, rather than sending anything to
the backend, we record two records in a file (each prefixed with
their size in bytes): first, the stack IDs and
their counts; second, the full stacks for any IDs that have not yet
been recorded in the same file. We then call fsync to ensure data
persistence, and finally, update the count of batches in the header of
the file.
This format is self-describing and resistent to crashes: since the
batch count is not updated until after the batch is synced to disk, an
attempt to read a partially-written file will only see atomically
written batches (though it might miss an entire final batch if it was
in the process of being written when the agent process terminated).
Every ten minutes, the storage file is rotated: it is compressed using
ZSTD to reduce storage cost, and a new file is started. The files are
saved with the scheme {timestamp}-{pid}.padata so that later they
can be read in timestamp order.
Uploading the Data
Later, data may be uploaded whenever and from wherever the user
chooses to do so; this does not necessarily have to be done on the
same machine where the data was recorded, as long as the uploading
machine has access to the storage directory where the files were
written.
The uploader reads files from the storage in the order they were
written (sorting using the timestamp in the filename). It uploads samples
to the backend using the same protocol as the agent does during normal
operation, using the full stacks (the second record in each batch) to
respond to responses from the backend requesting them. After each file
is successfully uploaded, the uploader removes it from the storage
directory, so it can pick up where it left off if it's interrupted.
Try It Out
If you want to profile an x86-64 or aarch64 Linux installation that
has reliable access to storage but not to the network, the Parca
agent's offline mode might be just what you're looking for. To try it
out, run parca-agent with--offline-mode-storage-path=/path/to/storage to begin collecting
profiling data locally. The agent will create .padata files in the
specified directory, rotating and compressing them every 10 minutes by
default.
When you're ready to upload the collected data, run parca-agent with
both --offline-mode-storage-path=/path/to/storage and--offline-mode-upload, along with your usual backend configuration
flags (like --remote-store-address). The uploader will process all
files in timestamp order and remove them after successful upload. This
doesn't have to be done on the same machine as collection: nothing
stops you from copying the /path/to/storage directory to anywhere
that is capable of maintaining a network connection and running parca-agent.
We hope this is useful. Happy profiling!