How Continuous Profiling reduces Engineering Debt and increases Application Performance

June 30, 2025

How Continuous Profiling helps with Engineering Debt and increases application performance

When it comes to Application Performance, engineering debt is a real issue affecting the efficiency of all enterprises. Agile processes don’t help and make it even worse.

Engineering debt affects Application Performance and causes significantly more resource consumption of CPU, GPU and Memory without becoming a priority, as we are mostly focused on downtime and new features. Yet, expected to be the solution, Agile processes, surprisingly, elevate the issue.

I like how Warren Buffett hires: Intelligence, Integrity, and Energy. Your apps should be the same. "Does my application run as it is supposed to? Does it do the right things? Does it run as fast as needed? In software terms, make it run, make it right, and make it fast.

life of a product manager

As a salesperson with a technical background, I have worked with over a thousand companies throughout my professional career. I have rarely seen a company say, “We are pausing development of new features and focusing on making our platform more efficient.” To reiterate, I’m not suggesting fixing bugs or complete refactoring/rewriting. Most of the time, a complete revamp happens as it is considered a fully planned project. However, when it comes to performance, addressing shortcomings and fixing fundamentals on an ongoing project is rarely seen and doing it consistently requires the right tools and mindset. Turbopuffer is one of the example organizations we have witnessed at Polar Signals, where continuous focus on performance is prioritized.

Think about the last time you heard, “In the coming epic/sprint, we will make our application 10% faster”.

There are several reasons, spanning from financial to cultural. My highlights:

Incentives and lack of focus: Consider how long it took DevSecOps to gain traction. Are we creating 100% secure code today? No, have we improved over time? Definitely yes. Thanks to the Cybersecurity industry and the billions of dollars poured in. On the DevOps side, are we improving the performance of our code? Are we even measuring how our apps are getting faster, slower, more reliable, or not at every release, and comparing them the right way? Does your CI/CD pipeline have any steps for this?
Culture: The Majority of engineering teams don’t question application performance unless there is a significant jump. Most of the time, bumping up the cloud bill solves (!) the issue. Performance engineering is also perceived as an overly complex and highly sophisticated skill that isn’t for the majority of developers. However, reality is similar to the OWASP Top 10; following the most common performance best practices yields the highest returns. Like at S2, where a single line change doubled the throughput performance and eventually sliced the CPU usage by 50%.
Functional vs Non-functional requirements: Defining what a system should do -a functional requirement- is easy. Imagine a non-functional requirement of a 20% better UI. Luckily, some of the non-functional requirements are easier to measure, such as performance and reliability.
Human bias on Speed over Quality: We are outcome-driven creatures and need that constant dopamine boost. More output, more dopamine, success, and promotion on paper. The mantra of Agile processes also doesn’t help, especially when the number of releases shipped measures an organization’s performance and agility.

➡️ If you've made it this far, you've read about my market thesis for joining Polar Signals.

Traditional observability has three pillars. Logs, Metrics, and Traces. Since this is a personal opinion post, I could share that I have yet to see an organization able to gain the insight they are looking for from observability resources. One of the most common problems is the disconnect between the code and operations.

This disconnect makes it very hard to answer questions like;

“Why are my pods extremely slow to start?”
“Does the problem occur at the platform level or the app level”
“What is the line of code that causes the application issue?” or,
“What resource consumption does this specific transaction with this transaction ID have?”

Practitioners use Application Performance Monitoring (APM) tools to have performance visibility for Microservices-related issues; however, when a granularity beyond Microservice communication, which involves network performance, is required, APM tools lack the capabilities.

The missing link is the fourth pillar of observability, Profiling which is nothing new. Every major programming language or platform has its own.

Using a proprietary profiler means:

Analysis will be limited to the environment/language/platform profiler runs on and will be missing the dependencies, interactions with other elements of the system.
Profiler will only catch the performance and reliability issues during the time it runs, and profilers are run when code is actively worked on - debugging, testing, etc. This also means sporadic events (like performance regressions, Out-Of-Memory errors, memory leaks) will be missed, and no live issue can be resolved by going back in time.
Heavy resource utilization of profilers (around 10% overhead) won’t allow them to run in production.
The required instrumentation will make them a tricky tool to use for only a small fraction of developers.

Disruption to traditional profiling is, thanks to eBPF, Zero Instrumentation, Continuous Profiling that consumes less than 1% CPU. The difference between Profiling and Continuous Profiling could be perceived as taking a thread dump for a specific component versus being able to access a live thread dump, which is saved historically across the entire system, including external components, libraries, and databases, regardless of the programming language and environment they run on.

Crucial point regarding Continuous Profiling is, it is supplementary to the other three pillars of Observability. OpenTelemetry shares the connection of three pillars of traditional Observability to Profiling as such:

Metrics to profiles: You will be able to go from a spike in CPU usage or memory usage to the specific pieces of the code that are consuming that resource.

Traces to profiles: You will be able to understand not just the location of latency across your services, but when that latency is caused by pieces of the code, it will be reflected in a profile attached to a trace or span.

Logs to profiles: Logs often give the context that something is wrong, but profiling will allow you to go from just tracking something (Out Of Memory errors, for example) to seeing exactly which parts of the code are using up memory resources.

Beyond Profiling vs Continuous Profiling, many professionals believe metrics are enough to pinpoint application issues. Unfortunately, Function/method-level problems are in no way near easy to detect via metric detection, because each metric visualisation is simply a means of different data points represented through a low-frequency data collection. All the charts below represent the same mean, from which different conclusions could be drawn. Now, try to find the exact code line through a generic metric that gets affected by multiple signals.

Different Distribution with the same mean of 10

Some of the issues that can be detected by Continuous Profiling but not via Metric monitoring.

Pinpointing CPU/GPU/Memory issues down to the line of code, function or method
- Example: A single function in a microservice is causing 30-40% of CPU usage due to an unoptimized loop, while metrics only show overall high CPU.
- Why metrics miss it: Metrics aggregate data across the system, masking the root cause. Short-lived issues are diluted in metric aggregation intervals.
Thread contention/deadlocks
- Example: A database connection pool causing threads to block for 500ms due to inadequate slots.
- Why metrics miss it: Metrics like request latency may spike, but profiling shows thread wait states.
Memory Leaks
- Example: A caching function inadvertently retains references to unused objects, leaking 2MB/hour
- Why metrics miss it: Metrics show rising memory usage, but cannot link it to specific code
Algorithmic inefficiencies
- Example: A sorting algorithm degrading performance as dataset size grows, visible in profiling flame graphs.
- Why metrics miss it: Latency metrics may increase, but the root code path remains unknown
Garbage Collection related memory, performance issues
- Example: A logging library creating temporary strings for every API call, increasing GC pauses by 30%
- Why metrics miss it: GC metrics show overall pause times but not the offending code
Library overhead, monitoring each component’s performance impact separately
- Example: A JSON serialization library using reflection, adding 50ms per API call. Why metrics miss it: Metrics attribute latency to the service, not the underlying library.
Cold start bottlenecks - External Continuous Profiling (only available at Polar Signals)
- Example: A slow database connection setup adds 2 seconds to serverless function startups.
- Why metrics miss it: Metrics report invocation duration but not initialization breakdown. Only an external Continuous Profiler can really catch these, because in-process profilers can only start working when the process is running but at that point it's already too late to measure.
Unused/fallback code paths
- Example: A retry mechanism executing an unoptimized fallback API call during outages.
- Why metrics miss it: Metrics lack context about code branching.

In a nutshell, Continuous Profiling fills critical gaps left by metric monitoring by linking system behavior to exact code paths. This bridges the gap between "what" is wrong (metrics) and "why" it’s happening (profiling). It enables teams to resolve issues such as memory leaks, concurrency bottlenecks, and algorithmic inefficiencies that metrics can only detect superficially. By providing line-of-code visibility, it transforms observability from "something is wrong" to "this exact function is the problem," accelerating root-cause analysis and optimization, reducing operational costs, and improving application reliability and user experience.

➡️ If you've made it this far, you've read about my market potential thesis for joining Polar Signals.

The majority of Polar Signals customers are focusing on improving the performance and reliability of their applications. Cost saving comes as a non-negligible side benefit with increased performance, which often reduces the bills at the same rate. Seeing a 30% performance improvement in a few hours after using Polar Signals is nothing unheard of.

Some Use cases:

Proprietary application developing companies at any maturity. Companies are modernizing or rewriting existing applications and want to get it right this time.
Latency-sensitive applications: High-frequency trading, Banking, E-commerce, gaming platforms.
Multi-tenant platforms: Database services, SaaS services, CDNs, and internal developer platforms. Ideal for pinpointing the Noisy Neighbour problem.
AI/ML workloads.
Edge Computing and IoT: Any environment with a resource constraint.
Cost Savings
Answering challenging questions. Such as the root cause of OOMkills, pod start time issues, and answering where the real performance issue is across applications, environments, and platforms.

Thanks for reading till the end.

Discuss: