In our previous post, we introduced our open-source NVIDIA CUDA profiler for continuous production monitoring. It works by having the CUDA workload load a shared library (libparcagpucupti.so) into the application process using the CUDA_INJECTION64_PATH environment variable.
The question is: how do you get that library into your application's existing container without rebuilding it? After all, just to try this profiler for the very first time, it would be nice if we didn't have to change our build process.
The answer on Kubernetes is: Init containers and an emptyDir volume.
Prerequisites
This assumes you already have the agent/profiler running on Kubernetes as described in the in-product documentation, and have added the --instrument-cuda-launch flag. This only takes a few minutes, so if you haven't already, now is the time for a free 14-day trial (no credit card required).
How It Works
An init container copies the profiler shared library to a shared volume. The main container mounts that volume and sets CUDA_INJECTION64_PATH to point at the library. Profit!
Three pieces:
# 1. Init container copies the library
initContainers:
- name: lib-installer
image: ghcr.io/parca-dev/parcagpu:20251020-0297094
command: [sh, -c]
args:
- cp /usr/local/lib/libparcagpucupti.so /var/lib/parca/gpu/libparcagpucupti.so
volumeMounts:
- name: libparcagpu
mountPath: /var/lib/parca/gpu
# 2. Main container mounts the volume and sets the environment variable
containers:
- name: pytorch
image: nvcr.io/nvidia/pytorch:25.03-py3
env:
- name: CUDA_INJECTION64_PATH
value: /var/lib/parca/gpu/libparcagpucupti.so
volumeMounts:
- name: libparcagpu
mountPath: /var/lib/parca/gpu
readOnly: true
# 3. Shared volume
volumes:
- name: libparcagpu
emptyDir: {}
A Complete Example
Let's look at an example: PyTorch MNIST training. This CronJob runs every 10 minutes, profiling included:
apiVersion: batch/v1
kind: CronJob
metadata:
name: pytorch
namespace: polarsignals
spec:
schedule: '*/10 * * * *'
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
metadata:
labels:
name: pytorch-mnist-trainer
spec:
# Init container installs the profiler library
initContainers:
- name: lib-installer
image: ghcr.io/parca-dev/parcagpu:20251020-0297094
command: [sh, -c]
args:
- |
cp /usr/local/lib/libparcagpucupti.so /var/lib/parca/gpu/libparcagpucupti.so
volumeMounts:
- name: libparcagpu
mountPath: /var/lib/parca/gpu
containers:
- name: pytorch
image: nvcr.io/nvidia/pytorch:25.03-py3
command: [/bin/bash, -c]
args:
- |
apt-get update && apt-get install -y wget unzip
wget -O pytorch-examples.zip https://github.com/pytorch/examples/archive/refs/heads/main.zip
unzip pytorch-examples.zip
python3 examples-main/mnist/main.py
# Configure CUDA to use the profiler
env:
- name: CUDA_INJECTION64_PATH
value: /var/lib/parca/gpu/libparcagpucupti.so
# Mount the shared volume with the library
volumeMounts:
- name: libparcagpu
mountPath: /var/lib/parca/gpu
readOnly: true
# Request GPU resources
resources:
limits:
nvidia.com/gpu: "1"
# Shared volume for library transfer
volumes:
- name: libparcagpu
emptyDir: {}
restartPolicy: Never
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Equal
value: present
When this job runs, the init container copies the library first. Then PyTorch starts. CUDA loads our instrumentation automatically. Every kernel launch, memory transfer, and synchronization event gets captured.
Done. Your next pod starts with GPU profiling enabled.
Works With Any CUDA Application
This pattern works with any CUDA application. TensorFlow, JAX, custom C++ code, even closed-source binaries. The CUDA runtime loads injected libraries transparently.
It can profile:
- ML training jobs (PyTorch, TensorFlow, JAX)
- Inference servers (TensorRT, ONNX Runtime, vLLM)
- Scientific computing workloads (molecular dynamics, climate modeling)
- Custom CUDA applications
If it uses CUDA, it works.
Verifying It Works
Check your application logs for CUPTI initialization messages. Or verify the environment directly:
kubectl exec -it <pod-name> -- bash -c 'echo $CUDA_INJECTION64_PATH && ls -la /var/lib/parca/gpu/'
You should see the library file and the environment variable.
Example Data
Once the library has been successfully loaded you can start seeing actual time spent in CUDA functions on the GPU.
What are we seeing here? The widest call stack shows PyTorch's autograd system executing numerous small kernels - convolutions, activations, gradient computations, and memory operations.
This data can then be used to decide batch sizing, whether to do operator fusion, or maybe to change training all together. We'll write future blog posts on more detailed and more production relevant cases.
Wrapping Up
Try it out and let us know what you discover! Join our Discord if you have questions or run into issues.
Want to learn more about the GPU profiling architecture? Check out our announcement post for details on how we use NVIDIA CUPTI, Linux USDT probes, and eBPF to make it all happen.
Also keep an eye out, as we have lots of new features coming up for GPU Profiling!