Sampling PGO

May 20, 2025

Introduction

As purveyors of profiling software Polar Signals has always been interested in what value we can extract from profiling data beyond the typical workflow of identifying hotspots and making changes to optimize them. One of the more interesting ideas is to use profiling data to drive decisions compilers use to optimize code.

Sounds great in theory but in reality PGO (profile guided optimization, sometimes FDO, feedback directed optimization) has some significant difficulties. The first one being how do you create your training data? You typically need a synthetic "benchmark" that accurately represents your target workloads but is easy to run with some regularity and doesn't take too long. Matching the users workload can be hard! It's been our customer's experience that when they turn on continuous profiling in production on real user workloads there's always a couple surprise CPU consumers that are ripe, low hanging fruit for optimization.

PGO is a bit of a young art so its a roll of the dice how well sampling works in practice, but in a perfect world one could use profiling data from actual production systems profiled with something like Polar Signals' continuous profiler and skip the laborious task of conjuring up a representative synthetic workload. So let's find out how perfect our world is!

New horizons!

In a previous blog we talked about how to apply profiles to Go's compiler. In typical Go fashion its very straightforward and "just works", the input to the compiler is just a pprof profile that one can get from Go's builtin profiler or a Polar Signals query. However this time we want to take next step and see if the same can be done for native applications. Since we seem to have more Rust than C++ customers these days we decided to start with Rust, although its kinda arbitrary since Rust sits on top of LLVM the same basic machinery applies to all native languages LLVM supports.

Background

Traditional compilers (as opposed to say a JIT compiler) are often flying blind when it comes to applying optimizations, they don't know anything about how the code will run so they can't organize code to favor the common paths, they don't know which functions are really hot and might make good candidates for inlining, and if you inline too much, the code can get really big and can actually get slower. So its a tradeoff and guessing game. By giving the compiler an idea of what parts of a binary are frequently executed the compiler can do a much better job at this task. It also makes life easier on the developer as they no longer have to think too deeply about this problem and littering code with inline suggestions or branch prediction hints (likely(), unlikely() macros etc) becomes less important. Inlining is just the tip of the iceberg, by the way, here's a shortlist of common techniques but a modern compiler will have many dozens, if not hundreds, of different optimizations that can be done, or done better, with profiling data.

The standard approach for PGO with LLVM is to make a build of your software with "instrumentation" turned on, run the instrumented binary on various training workloads and then process the instrumentation data into single profdata file and then rebuild with that. But we want to use real world profiling data! Luckily there's been a lot of work done to enable a simpler workflow by Google 1 and Facebook 2 on just that. Typically this is called "sampling" PGO to differentiate from "instrumentation" PGO. Here we get rid of the instrumented training build requirement and generate profdata directly by sampling the application using traditional profiling tools. These sampling PGO tools aren't as mature as the instrumentation approach so we're gonna explore how well they work in practice.

Selecting the right guinea pig

Our goal is to find some software where we can demonstrate meaningful speed ups with instrumentation based PGO and then see if sampling based profiles can match those results. In the literature with good sample data -2% 1 to 90% 3 of the gains from instrumentation PGO can be had with sampling PGO. Yeah that's right, sampling PGO can actually make things worse!

One difficulty is finding a program than can benefit from PGO, doesn't take long to build or run and has ready to go real world workloads for generating training profile data. Our plan of attack was to find the simplest microbenchmark and apply PGO and see if it helps and then repeat with sampling data and see the discrepancy but this yielded no fruit as simple benchmarks are rarely improved with PGO. When the entire program is in one .rs file the compiler pretty much nails it from an optimization perspective. It seems PGO doesn't start to shine until you throw it at larger more meaningful programs with lots of compilation units.

At this time we stumbled upon the awesome PGO resource called awesome PGO which pointed out that Rust Analyzer had success with PGO. So much so that recent release builds are built with PGO. So here we have a real world project that builds in minutes and thousands of people depend on that is already using PGO for a 15%-20% performance boost. From inception to actually shipping PGO binaries took many years but hopefully PGO will be easier to adopt as more folks use it. I think there's a notion amongst developers out there that PGO is a just trying to get a free lunch and the only real way to make software faster is with smarter algorithms and better data structures but as the number of PGO powered optimizations grow and the delta between PGO and non-PGO performance grows PGO will probably come to be seen as table stakes just like JIT's have become for bytecode languages. Probably more of a "when" question than an "if".

So how does it work?

Rust analyzer builds PGO support into its build process using Rust build tasks. Basically you add --pgo argument to the dist build task, ie: cargo dist --pgo clap-rs/clap. This will build a instrumentation build and run rust-analyzer's analysis-stats command on a github repo of your choosing (by default it uses clap-rs/clap).

$ target/x86_64-unknown-linux-gnu/release/rust-analyzer analysis-stats -q --run-all-ide-things /home/tpr/src/rust-analyzer/rust-analyzer-pgo/clap-rs-clap

Once that's done we have to run llvm-profdata tool to merge the raw profile files into one the compiler can use.

$ /home/tpr/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin/llvm-profdata merge /home/tpr/src/rust-analyzer/rust-analyzer-pgo/default_11812833856764408396_0.profraw -o /home/tpr/src/rust-analyzer/rust-analyzer-pgo/merged.profdata

Then all you have to do is build with that artifact passed as profile-use to RUSTFLAGS.

Pretty straight forward!

Some difficulties

Today the sampling based PGO workflows don't work in stable Rust. The profile-sample-use argument one uses to pass sampling data to rustc is only available in nightly (instrumentation uses a different argument, profile-use available in stable). So that belies some immaturity here, not sure if that's in the generation of the profdata from the sampling data or in LLVM's ability to effectively consume this data, or both.

Another issue is that there are two types of sampling data that can be used to create profdata we can pass to 'profile-sample-use', LBR sampling data and regular old stack trace sampling data like what Go uses. LBR stands for Last Branch Record and it refers to some CPU's ability to keep a small buffer (8-32) of the last branches taken. Each records is a "from" and "to" instruction pointer telling you where the most recent jumps where. This is useful data because its information you can't really get from a stack sample. A stack sample tells you where you were in a particular function and what the calling functions were. A branch record tells you the last N branches taken which may be basic blocks within a function or may be tail calls (raw jmps) that wiped any record of being on the stack. Maybe we should explain some of this in more detail.

Basically all programs are comprised of basic blocks which are just sequences of instructions and they terminate at edges which connect jumps and function calls to basic block starting points. For simple if/else style conditionals you can either jump to later in the function or fall through to the next basic block. If you record every single edge and build a weighted graph of all the basic blocks in a program you'd have what "instrumentation" builds generate, they literally inject your program with code to up a counter for every branch. The data is actually much more complicated, LLVM supports value profiling, type profiling, and virtual table information.

With sampling profiling we don't have counters for every edge in our program, we have to take the knowledge that the program was at a particular basic block and infer what the preceding and succeeding basic block could be by looking at the actual instructions of the program. Some jumps are direct and easy to see where they go but some are indirect (ie a virtual function call) where the program calculates an address and jumps to it. With stack sampling data we just have to hope that we have enough samples to paint a fuzzy picture of what generally happens in the program but with LBR sampling we get a crisper more accurate picture.

Benchmarking

So before we get into the numbers we should talk about benchmarking. Getting consistent numbers from benchmarks so that one can make believable claims about performance is one of the trickier parts of computer science. Just writing benchmarks that are predictable and deterministic can be hard to do but assuming we have that how do we isolate our software from other things going on. A linux kernel thats busy running a browser, an IDE and dealing with hardware accelerated graphics duties will take our precious benchmark program off the CPU and stick it in a corner occasionally. Or it might decide the battery is low so move it to a slower CPU or a CPU running at a lower clock speed. My ThinkPad X1 has 3 classes of processor:

$ sudo cpupower frequency-info -o
          minimum CPU frequency  -  maximum CPU frequency  -  governor
CPU  0       400000 kHz (  8 %)  -    5000000 kHz (100 %)  -  powersave
CPU  1       400000 kHz (  8 %)  -    5000000 kHz (100 %)  -  powersave
CPU  2       400000 kHz (  8 %)  -    5000000 kHz (100 %)  -  powersave
CPU  3       400000 kHz (  8 %)  -    5000000 kHz (100 %)  -  powersave
CPU  4       400000 kHz (  7 %)  -    5200000 kHz (100 %)  -  powersave
CPU  5       400000 kHz (  7 %)  -    5200000 kHz (100 %)  -  powersave
CPU  6       400000 kHz (  7 %)  -    5200000 kHz (100 %)  -  powersave
CPU  7       400000 kHz (  7 %)  -    5200000 kHz (100 %)  -  powersave
CPU  8       400000 kHz (  8 %)  -    5000000 kHz (100 %)  -  powersave
CPU  9       400000 kHz (  8 %)  -    5000000 kHz (100 %)  -  powersave
CPU 10       400000 kHz (  8 %)  -    5000000 kHz (100 %)  -  powersave
CPU 11       400000 kHz (  8 %)  -    5000000 kHz (100 %)  -  powersave
CPU 12       400000 kHz ( 10 %)  -    3900000 kHz (100 %)  -  powersave
CPU 13       400000 kHz ( 10 %)  -    3900000 kHz (100 %)  -  powersave
CPU 14       400000 kHz ( 10 %)  -    3900000 kHz (100 %)  -  powersave
CPU 15       400000 kHz ( 10 %)  -    3900000 kHz (100 %)  -  powersave
CPU 16       400000 kHz ( 10 %)  -    3900000 kHz (100 %)  -  powersave
CPU 17       400000 kHz ( 10 %)  -    3900000 kHz (100 %)  -  powersave
CPU 18       400000 kHz ( 10 %)  -    3900000 kHz (100 %)  -  powersave
CPU 19       400000 kHz ( 10 %)  -    3900000 kHz (100 %)  -  powersave

So depending on how the scheduler works we can get drastically different outcomes!

In order to get consistent numbers from rust-analyzer benchmarks I found that I had to do the following to my development machine (Thinkpad X1 running Ubuntu24):

Boot into run level 3 (no GUI)
Enable performance mode on cpupower program (sudo cpupower frequency-set -g performance)
Disable bluetooth, network, containerd, cron, cups etc.
Renice anything that wasn't easy to stop or couldn't be stopped (i.e. anti-virus and endpoint security software).

Full script here. You're friendly neighborhood LLM can help further with this or you might opt to try to run benchmarks on some dedicated cloud/hosted servers that are a little easier to lock down. But be careful, its a pretty deep rabbit hole and if you find yourself trying to run your software in ring 0 and fiddling with BIOS settings and brick your computer don't say I didn't warn you! For our purposes here if hyperfine isn't complaining about statistical outliers we're good to go. I didn't explore using taskset to isolate to a set of CPUs or nice to up the priority but those are additional tools to get even tighter variances.

Training data

As we already saw rust-analyzer's existing PGO support trains itself on the clap-rs/clap repository. So we first attempt to use this for LBR and stack sampling data. First the commands we run:

perf record -Fmax -b -e BR_INST_RETIRED.NEAR_TAKEN:uppp -- ./target/dev-rel/rust-analyzer analysis-stats -q --run-all-ide-things ./rust-analyzer-pgo/clap-rs-clap
llvm-profgen-19 --binary=./target/dev-rel/rust-analyzer --perfdata=perf.data --output=$OUT/lbrperf.profdata

We use the dev-rel profile because release doesn't have symbols. Unsure whether debug level of 1 or 2 is better for PGO but we err on the side of more information is better. When we run it once and run it through llvm-profgen we get:

[ perf record: Captured and wrote 54.610 MB perf.data (69951 samples) ]
warning: Sample PGO is estimated to optimize better with 28.9x more samples. Please consider increasing sampling rate or profiling for longer duration to get more samples.

Turns out we have to run rust-analyzer at least 10 times to get enough samples for llvm-profgen to stop complaining.

Tricks to getting more samples!

After running some very long training runs I discovered I can make perf generate events faster using these commands:

echo 99 | /proc/sys/kernel/perf_cpu_time_max_percent
echo 50000 | /proc/sys/kernel/perf_event_max_sample_rate

Perf will modulate the same rate down to meet the CPU max so in practice I see this:

info: Using a maximum frequency rate of 20,000 Hz

But that's way better than the default of 1000!

create_llvm_prof is AutoFDO's tool for processing perf sample data into LLVM's profile data format. It wants a quiet system too:

[ERROR:/home/tpr/src/autofdo/third_party/perf_data_converter/src/quipper/perf_parser.cc:301] Only 94% of samples had all locations mapped to a module, expected at least 95%

I couldn't figure out how to filter out non rust-analyzer events, my training scripts run multiple rust-analyzer runs for each perf invocation so using a pid filter isn't straightforward but that's probably the way to go.

Profile data sizes

One interesting thing about this is how big the training artifact is. Our training script resulted in these files:

Artifact	Size
Instrumentation	29MB
LBR samples	15MB
Raw samples	9.7MB

The LBR data was distilled down (by llvm-profdata merge) from 8 12MB files and the raw sample data from 12 35MB raw profile files (which each came from ~200MB perf.data files).

Results Please!

So now that we've done our training, built fresh binaries with the training data and quiesced our machine we can run our benchmarks. We use hyperfine to run each one 3 times.

results

First on the nomenclature:

ipgo means instrumentation PGO
lbrpgo is build trained on LBR sample data
spgo is build trained on raw instruction pointer sample data
stock is built with no PGO
fat vs thin is the argument to CARGO_PROFILE_RELEASE_LTO, see here

Raw results and scripts for this experiment can be found here.

Other notables

There are ~1600 #[inline] declarations in the rust-analyzer local crates so whatever magic PGO is doing is on top of some significant manual inlining (although a good chunk of these are in generated code).

Fat LTO builds take roughly 2x the time of thin builds and use way more memory, for projects larger than rust-analyzer fat is probably a non-starter. However if binary size is of concern fat LTO binaries are like 25% smaller than thin LTO binaries. So fat is small and thin is large. Your mileage may vary!

Conclusions

Clearly instrumentation PGO + thin LTO is a winner. For all others (LBR, raw sampling and stock) PGO using fat helps. Its almost like fat linking undoes some of the instrumentation PGO magic. But clearly rust-analyzer is using the right thing. Why is PGO for sampling data no better than stock? That is the million dollar question, when I did this experiment with much smaller training runs the PGO builds were actually worse so having an adequate amount of sample data seems important, but why bother if you can't beat stock? And most sadly spgothin actually makes performance worse!

Great questions but this is already too long so we're gonna leave that question for another blog where we will dive into:

What's in these profdata files exactly?
What parts of the instrumentation data contribute to the gains? Is it purely the weighted edge graph or is it the other stuff?
Does sample profiling have any hope of catching instrumentation PGO or do things like value profiling and virtual table profiling that instrumentation profiling has and sample profiling doesn't drive most of the gains?
Can sample profiling be improved to include or make up some of this extra data? What's the delta between the instrumentation based edge weights and the sampling based edge weights?
Exactly which compiler optimizations account for ipgo's gains?
The CSSPGO paper 2 talks about using perf record -g --call-graph fp and disabling frame pointer omission, does doing that improve things? Also need to explore CSSPGO's combined LBR and stack approach.
Why were the LBR and sample profdata artifacts so much smaller?
How will Batman escape this time?
What is the weight of an unladen swallow?

Stay tuned! This is an open research problem we're working on, if you have any thoughts or comments please let us know!

Discuss: