Profiling NVIDIA CUDA in Kubernetes
The easiest way to get started Profiling CUDA in Kubernetes
In our previous post, we introduced our open-source NVIDIA CUDA profiler for continuous production monitoring. It works by having the CUDA workload load a shared library (libparcagpucupti.so) into the application process using the CUDA_INJECTION64_PATH environment variable.
The question is: how do you get that library into your application's existing container without rebuilding it? After all, just to try this profiler for the very first time, it would be nice if we didn't have to change our build process.
The answer on Kubernetes is: Init containers and an emptyDir volume.
Prerequisites
This assumes you already have the agent/profiler running on Kubernetes as described in the in-product documentation, and have added the --instrument-cuda-launch flag. This only takes a few minutes, so if you haven't already, now is the time for a free 14-day trial (no credit card required).
How It Works
An init container copies the profiler shared library to a shared volume. The main container mounts that volume and sets CUDA_INJECTION64_PATH to point at the library. Profit!
Three pieces:
A Complete Example
Let's look at an example: PyTorch MNIST training. This CronJob runs every 10 minutes, profiling included:
When this job runs, the init container copies the library first. Then PyTorch starts. CUDA loads our instrumentation automatically. Every kernel launch, memory transfer, and synchronization event gets captured.
Done. Your next pod starts with GPU profiling enabled.
Works With Any CUDA Application
This pattern works with any CUDA application. TensorFlow, JAX, custom C++ code, even closed-source binaries. The CUDA runtime loads injected libraries transparently.
It can profile:
- ML training jobs (PyTorch, TensorFlow, JAX)
- Inference servers (TensorRT, ONNX Runtime, vLLM)
- Scientific computing workloads (molecular dynamics, climate modeling)
- Custom CUDA applications
If it uses CUDA, it works.
Verifying It Works
Check your application logs for CUPTI initialization messages. Or verify the environment directly:
You should see the library file and the environment variable.
Example Data
Once the library has been successfully loaded you can start seeing actual time spent in CUDA functions on the GPU.
What are we seeing here? The widest call stack shows PyTorch's autograd system executing numerous small kernels - convolutions, activations, gradient computations, and memory operations.
This data can then be used to decide batch sizing, whether to do operator fusion, or maybe to change training all together. We'll write future blog posts on more detailed and more production relevant cases.
Wrapping Up
Try it out and let us know what you discover! Join our Discord if you have questions or run into issues.
Want to learn more about the GPU profiling architecture? Check out our announcement post for details on how we use NVIDIA CUPTI, Linux USDT probes, and eBPF to make it all happen.
Also keep an eye out, as we have lots of new features coming up for GPU Profiling!
Read more
Continuous NVIDIA CUDA Profiling In Production
Low-Overhead Profiling of GPUs with USDT Probes and eBPF
How Continuous Profiling reduces Engineering Debt and increases Application Performance
Application Performance engineering debt could live a long time without noticed and cause a significant overhead in developing new features, performance and the cost of operation.

Keep up with Polar Signals
Receive new posts, product updates, and insights on performance engineering straight to your inbox.