The Cross-Zone Traffic Problem That Was Costing Us a Fortune
At Polar Signals, we run our infrastructure on Google Cloud Platform with a multi-zone Thanos deployment for high availability. We are constantly running kubezonnet to monitor cross-zone traffic, since it costs 1 cent per gigabyte.
After successfully rolling out our new database, our cross-zone traffic for Thanos was by far the biggest culprit, and it was all coming from one place: SLO monitoring with Pyrra.
Why Thanos was causing so much cross-zone traffic
Every time a Thanos Ruler evaluated an Service Level Objective (SLO) recording rule, it ran queries through the Thanos querier, which needed to fetch millions of raw samples from each Prometheus instance via remote-read and fetch historical data from the Thanos Store. With high-cardinality gRPC metrics tracking thousands of service endpoints, we were essentially streaming gigabytes of data across zones every few minutes (at least 2.5 minutes to be precise) for each SLO evaluation.
We needed a solution that would keep data processing close to where the data lived - on the Prometheus instances themselves - rather than constantly shuffling raw samples between zones. This is the story of how we reduced cross-zone traffic by 90% while also dramatically improving query performance.
The Challenge: When SLOs Meet High Cardinality
SLOs are critical for modern reliability engineering. They help teams quantify and track the reliability of their services. However, calculating SLOs over long time windows with high-cardinality metrics presents a significant computational challenge.
Let's talk about one SLO as an example from our production environment, which we were dealing with:
- 340 time series from
grpc_server_handled_total{ job="api", grpc_service="parca.profilestore.v1alpha1.ProfileStoreService" }
- 4-week observation windows for comprehensive reliability tracking
- Up to 3x network bandwidth consumption as raw samples flowed from each Prometheus and Thanos Store to each querier
- Significant cross-zone data transfer costs over time on GCP
- CPU and memory increase on Thanos querier, Thanos Store, and Prometheus nodes, processing all those raw samples
The recording rules looked deceptively simple:
# Calculate availability over 4 weeks 1 - sum( increase( grpc_server_handled_total{grpc_code=~"Aborted|Unavailable|Internal|Unknown|Unimplemented|DataLoss",grpc_method="WriteRaw",grpc_service="parca.profilestore.v1alpha1.ProfileStoreService",job="api",namespace="api"}[4w] ) ) / sum( increase( grpc_server_handled_total{grpc_service="parca.profilestore.v1alpha1.ProfileStoreService",job="api"}[4w] ) )
But under the hood, these queries were processing millions of raw samples across the entire 4-week window for 340 time series.
On a personal note, I really wish that Thanos would handle this case a lot better. I imagine Thanos could do quite a lot more to push down the aggregation to Thanos components and Prometheus, sparing us of implement everything we’re about to discuss going forward.
While there have been extensive efforts with a new Thanos PromQL query engine, as far as I am aware, it still doesn't fully solve the underlying problem.
Understanding the Root Cause: The Data Transfer Problem
The fundamental issue lies in how Prometheus and Thanos evaluate range queries over extended periods, and more importantly, where that processing happens.
When you write increase(metric[4w]), here's what was happening in our architecture:
- Prometheus scrapes metrics and stores them locally (15-second intervals)
- Thanos Sidecar uploads raw samples from each Prometheus to object storage
- Thanos Querier needs to evaluate the 4-week range query, so it:
- Fetches all raw samples from object storage via the Thanos Store
- Fetches all recent raw samples from Prometheus HA pairs via remote-read (cross-zone traffic)
- Transfers them to the querier node (cross-zone traffic)
- Processes millions of samples to calculate the increase
- Aggregates the results
With a 15-second scrape interval over 4 weeks, that's at least 161,280 samples per series that need to flow from storage to the querier. It could be even higher when samples overlap and are deduplicated by the Thanos querier. With 340 time series, we were processing approximately 55 million samples and transferring gigabytes of data across zones for each evaluation cycle (at a cost of $0.01 per gigabyte).
Looking at the architecture with the big picture in mind, we were doing all the computation far from where the data lived, necessitating massive data transfers.
The Solution: Subquery Optimization
Earlier this year, in March 2025, Wikimedia actually opened an issue about the Pyrra SLO performance at their scale.
We developed a solution that fundamentally changes how these long-range queries are evaluated. Instead of processing raw samples across the entire window, we use Prometheus subqueries to pre-aggregate data into manageable chunks.
The Magic Formula
The optimization transforms queries from:
increase(metric[4w])
Into:
sum_over_time(increase(metric[5m])[4w:5m])
This small change has profound implications:
- Pre-computation: The
increase(metric[5m])is evaluated as a recording rule every 30 seconds on each Prometheus, examining only the last 5 minutes of data. We have perfect data locality. - Efficient aggregation: The
sum_over_time(...[4w:5m])then aggregates these pre-computed 5-minute buckets - Dramatic reduction: Instead of 161,280 raw samples, we process just 8,064 pre-aggregated values (4 weeks x 7 days x 24 hours x 12 five-minute buckets per hour)
Experimental implementation in Pyrra
We added a new configuration option to make this optimization opt-in:
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
name: profilestore-errors
spec:
target: "99.9"
window: 4w
performanceOverAccuracy: true # Enable the optimization
indicator:
ratio:
errors:
metric: grpc_server_handled_total{job="api",grpc_service="parca.profilestore.v1alpha1.ProfileStoreService",grpc_code!="OK"}
total:
metric: grpc_server_handled_total{job="api",grpc_service="parca.profilestore.v1alpha1.ProfileStoreService"}
When performanceOverAccuracy is enabled, Pyrra automatically rewrites the queries to use the subquery pattern, creating two sets of recording rules, one for Thanos and one for Prometheus.
Leveraging Prometheus and Thanos
One of the key insights was to split the work between Prometheus and Thanos based on their strengths:
Prometheus
- Handles the
increase([5m])calculations on fresh data - These queries are local and lightning-fast
- Continuously creates pre-aggregated 5-minute buckets
Thanos
- Processes the
[4w:5m]subqueries over pre-computed buckets - Instead of fetching raw samples from object storage, it works with compact, pre-aggregated data
- Reduces network traffic and storage I/O dramatically
This architectural split is crucial for performance. It's the difference between:
- Before: Thanos fetching millions of raw samples from object storage
- After: Thanos fetches thousands of pre-computed values
The Math Behind the Magic
Let's break down the sample reduction with concrete numbers:
With a 15-second scrape interval:
- 5 minutes of raw data: 300 seconds ÷ 15 seconds = 20 samples
- 5-minute pre-aggregated bucket: 1 value
This represents a 20:1 reduction in the data that needs to be processed for historical queries.
Performance Results
At Polar Signals, we saw immediate benefits in reduced cross-zone traffic and CPU usage.
Infrastructure Impact
After rolling out the change we saw the almost 20x reduction in traffic!
Simultaneously, fewer samples being sent across the wire resulted in a nice drop in CPU and memory usage!
I'm hopeful we can get the feature merged into Pyrra sooner than later. It will be exicting to get the Wikimedia Foundation to validate this approach with their high-cardinality Istio metrics.
The Trade-off: Accuracy vs. Performance
Engineering is about trade-offs, and this optimization is no exception. Our testing revealed approximately a 1% difference in accuracy compared to the ground-truth calculations. This small discrepancy occurs because:
- The 5-minute buckets might not perfectly align with counter resets
- Interpolation at bucket boundaries can introduce minor variations
- The subquery evaluation uses fixed steps that might not capture all nuances
Given that SLOs are often in the 99% range and above, we don't feel comfortable enabling this feature by default. For most use cases, paying for more traffic and compute resources should be preferred.
However, we made this an opt-in feature because we believe in giving users control over their trade-offs, and in some cases, as discussed above, it's worth paying the 1% accuracy penalty for availability reporting.
Alerting is unchanged
It's important to point out that the multi-burn rate alerts are NOT affected by any of these changes! The entire alerting pipeline is left unchanged. The entire point of this blog post is availability reporting!
Looking Forward
This optimization is currently being tested in our production environment at Polar Signals with excellent results. The implementation is available in pull request #1607 and is currently on a feature branch awaiting final tweaks before being merged into Pyrra. Once released, this feature will be available to the entire Pyrra community.
At Polar Signals, we believe that performance shouldn't be a barrier to comprehensive observability. Whether you're monitoring a handful of services or thousands, your SLO calculations should be fast, reliable, and resource-efficient.
Coming Soon
Once this feature is released, you'll be able to enable it in your Pyrra SLO definitions with the performanceOverAccuracy flag.
spec:
performanceOverAccuracy: true # Coming soon to Pyrra
In the meantime, you can follow the progress on PR #1607 or try the feature branch's docker image if you're comfortable testing unreleased code.