Optimizing SLO Performance at Scale: How We Solved Pyrra’s Query Performance on Thanos

December 30, 2025

The Cross-Zone Traffic Problem That Was Costing Us a Fortune

At Polar Signals, we run our infrastructure on Google Cloud Platform with a multi-zone Thanos deployment for high availability. We are constantly running kubezonnet to monitor cross-zone traffic, since it costs 1 cent per gigabyte.

After successfully rolling out our new database, our cross-zone traffic for Thanos was by far the biggest culprit, and it was all coming from one place: SLO monitoring with Pyrra.

Why Thanos was causing so much cross-zone traffic

Every time a Thanos Ruler evaluated an Service Level Objective (SLO) recording rule, it ran queries through the Thanos querier, which needed to fetch millions of raw samples from each Prometheus instance via remote-read and fetch historical data from the Thanos Store. With high-cardinality gRPC metrics tracking thousands of service endpoints, we were essentially streaming gigabytes of data across zones every few minutes (at least 2.5 minutes to be precise) for each SLO evaluation.

We needed a solution that would keep data processing close to where the data lived - on the Prometheus instances themselves - rather than constantly shuffling raw samples between zones. This is the story of how we reduced cross-zone traffic by 90% while also dramatically improving query performance.

The Challenge: When SLOs Meet High Cardinality

SLOs are critical for modern reliability engineering. They help teams quantify and track the reliability of their services. However, calculating SLOs over long time windows with high-cardinality metrics presents a significant computational challenge.

Let's talk about one SLO as an example from our production environment, which we were dealing with:

340 time series from

grpc_server_handled_total{
  job="api",
  grpc_service="parca.profilestore.v1alpha1.ProfileStoreService"
}

4-week observation windows for comprehensive reliability tracking
Up to 3x network bandwidth consumption as raw samples flowed from each Prometheus and Thanos Store to each querier
Significant cross-zone data transfer costs over time on GCP
CPU and memory increase on Thanos querier, Thanos Store, and Prometheus nodes, processing all those raw samples

The recording rules looked deceptively simple:

# Calculate availability over 4 weeks
  1
-
    sum(
      increase(
        grpc_server_handled_total{grpc_code=~"Aborted|Unavailable|Internal|Unknown|Unimplemented|DataLoss",grpc_method="WriteRaw",grpc_service="parca.profilestore.v1alpha1.ProfileStoreService",job="api",namespace="api"}[4w]
      )
    )
  /
    sum(
      increase(
        grpc_server_handled_total{grpc_service="parca.profilestore.v1alpha1.ProfileStoreService",job="api"}[4w]
      )
    )

But under the hood, these queries were processing millions of raw samples across the entire 4-week window for 340 time series.

On a personal note, I really wish that Thanos would handle this case a lot better. I imagine Thanos could do quite a lot more to push down the aggregation to Thanos components and Prometheus, sparing us of implement everything we’re about to discuss going forward.

While there have been extensive efforts with a new Thanos PromQL query engine, as far as I am aware, it still doesn't fully solve the underlying problem.

Understanding the Root Cause: The Data Transfer Problem

The fundamental issue lies in how Prometheus and Thanos evaluate range queries over extended periods, and more importantly, where that processing happens.

When you write increase(metric[4w]), here's what was happening in our architecture:

Prometheus scrapes metrics and stores them locally (15-second intervals)
Thanos Sidecar uploads raw samples from each Prometheus to object storage
Thanos Querier needs to evaluate the 4-week range query, so it:

Fetches all raw samples from object storage via the Thanos Store
Fetches all recent raw samples from Prometheus HA pairs via remote-read (cross-zone traffic)
Transfers them to the querier node (cross-zone traffic)
Processes millions of samples to calculate the increase
Aggregates the results

With a 15-second scrape interval over 4 weeks, that's at least 161,280 samples per series that need to flow from storage to the querier. It could be even higher when samples overlap and are deduplicated by the Thanos querier. With 340 time series, we were processing approximately 55 million samples and transferring gigabytes of data across zones for each evaluation cycle (at a cost of $0.01 per gigabyte).

Looking at the architecture with the big picture in mind, we were doing all the computation far from where the data lived, necessitating massive data transfers.

The Solution: Subquery Optimization

Earlier this year, in March 2025, Wikimedia actually opened an issue about the Pyrra SLO performance at their scale.

We developed a solution that fundamentally changes how these long-range queries are evaluated. Instead of processing raw samples across the entire window, we use Prometheus subqueries to pre-aggregate data into manageable chunks.

The Magic Formula

The optimization transforms queries from:

increase(metric[4w])

Into:

sum_over_time(increase(metric[5m])[4w:5m])

This small change has profound implications:

Pre-computation: The increase(metric[5m]) is evaluated as a recording rule every 30 seconds on each Prometheus, examining only the last 5 minutes of data. We have perfect data locality.
Efficient aggregation: The sum_over_time(...[4w:5m]) then aggregates these pre-computed 5-minute buckets
Dramatic reduction: Instead of 161,280 raw samples, we process just 8,064 pre-aggregated values (4 weeks x 7 days x 24 hours x 12 five-minute buckets per hour)

Experimental implementation in Pyrra

We added a new configuration option to make this optimization opt-in:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: profilestore-errors
spec:
  target: "99.9"
  window: 4w
  performanceOverAccuracy: true  # Enable the optimization
  indicator:
    ratio:
      errors:
        metric: grpc_server_handled_total{job="api",grpc_service="parca.profilestore.v1alpha1.ProfileStoreService",grpc_code!="OK"}
      total:
        metric: grpc_server_handled_total{job="api",grpc_service="parca.profilestore.v1alpha1.ProfileStoreService"}

When performanceOverAccuracy is enabled, Pyrra automatically rewrites the queries to use the subquery pattern, creating two sets of recording rules, one for Thanos and one for Prometheus.

Leveraging Prometheus and Thanos

One of the key insights was to split the work between Prometheus and Thanos based on their strengths:

Prometheus

Handles the increase([5m]) calculations on fresh data
These queries are local and lightning-fast
Continuously creates pre-aggregated 5-minute buckets

Thanos

Processes the [4w:5m] subqueries over pre-computed buckets
Instead of fetching raw samples from object storage, it works with compact, pre-aggregated data
Reduces network traffic and storage I/O dramatically

This architectural split is crucial for performance. It's the difference between:

Before: Thanos fetching millions of raw samples from object storage
After: Thanos fetches thousands of pre-computed values

The Math Behind the Magic

Let's break down the sample reduction with concrete numbers:

With a 15-second scrape interval:

5 minutes of raw data: 300 seconds ÷ 15 seconds = 20 samples
5-minute pre-aggregated bucket: 1 value

This represents a 20:1 reduction in the data that needs to be processed for historical queries.

Performance Results

At Polar Signals, we saw immediate benefits in reduced cross-zone traffic and CPU usage.

Infrastructure Impact

After rolling out the change we saw the almost 20x reduction in traffic!
Simultaneously, fewer samples being sent across the wire resulted in a nice drop in CPU and memory usage!

The CPU and memory usage within the first hour after deploying the enhancement

I'm hopeful we can get the feature merged into Pyrra sooner than later. It will be exicting to get the Wikimedia Foundation to validate this approach with their high-cardinality Istio metrics.

The Trade-off: Accuracy vs. Performance

Engineering is about trade-offs, and this optimization is no exception. Our testing revealed approximately a 1% difference in accuracy compared to the ground-truth calculations. This small discrepancy occurs because:

The 5-minute buckets might not perfectly align with counter resets
Interpolation at bucket boundaries can introduce minor variations
The subquery evaluation uses fixed steps that might not capture all nuances

Given that SLOs are often in the 99% range and above, we don't feel comfortable enabling this feature by default. For most use cases, paying for more traffic and compute resources should be preferred.

However, we made this an opt-in feature because we believe in giving users control over their trade-offs, and in some cases, as discussed above, it's worth paying the 1% accuracy penalty for availability reporting.

Alerting is unchanged

It's important to point out that the multi-burn rate alerts are NOT affected by any of these changes! The entire alerting pipeline is left unchanged. The entire point of this blog post is availability reporting!

Looking Forward

This optimization is currently being tested in our production environment at Polar Signals with excellent results. The implementation is available in pull request #1607 and is currently on a feature branch awaiting final tweaks before being merged into Pyrra. Once released, this feature will be available to the entire Pyrra community.

At Polar Signals, we believe that performance shouldn't be a barrier to comprehensive observability. Whether you're monitoring a handful of services or thousands, your SLO calculations should be fast, reliable, and resource-efficient.

Coming Soon

Once this feature is released, you'll be able to enable it in your Pyrra SLO definitions with the performanceOverAccuracy flag.

spec:
  performanceOverAccuracy: true  # Coming soon to Pyrra

In the meantime, you can follow the progress on PR #1607 or try the feature branch's docker image if you're comfortable testing unreleased code.

Discuss: