Profiling NVIDIA CUDA in Kubernetes

December 18, 2025

In our previous post, we introduced our open-source NVIDIA CUDA profiler for continuous production monitoring. It works by having the CUDA workload load a shared library (libparcagpucupti.so) into the application process using the CUDA_INJECTION64_PATH environment variable.

The question is: how do you get that library into your application's existing container without rebuilding it? After all, just to try this profiler for the very first time, it would be nice if we didn't have to change our build process.

The answer on Kubernetes is: Init containers and an emptyDir volume.

Prerequisites

This assumes you already have the agent/profiler running on Kubernetes as described in the in-product documentation, and have added the --instrument-cuda-launch flag. This only takes a few minutes, so if you haven't already, now is the time for a free 14-day trial (no credit card required).

How It Works

An init container copies the profiler shared library to a shared volume. The main container mounts that volume and sets CUDA_INJECTION64_PATH to point at the library. Profit!

Three pieces:

# 1. Init container copies the library
initContainers:
- name: lib-installer
  image: ghcr.io/parca-dev/parcagpu:20251020-0297094
  command: [sh, -c]
  args:
  - cp /usr/local/lib/libparcagpucupti.so /var/lib/parca/gpu/libparcagpucupti.so
  volumeMounts:
  - name: libparcagpu
    mountPath: /var/lib/parca/gpu

# 2. Main container mounts the volume and sets the environment variable
containers:
- name: pytorch
  image: nvcr.io/nvidia/pytorch:25.03-py3
  env:
  - name: CUDA_INJECTION64_PATH
    value: /var/lib/parca/gpu/libparcagpucupti.so
  volumeMounts:
  - name: libparcagpu
    mountPath: /var/lib/parca/gpu
    readOnly: true

# 3. Shared volume
volumes:
- name: libparcagpu
  emptyDir: {}

A Complete Example

Let's look at an example: PyTorch MNIST training. This CronJob runs every 10 minutes, profiling included:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pytorch
  namespace: polarsignals
spec:
  schedule: '*/10 * * * *'
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            name: pytorch-mnist-trainer
        spec:
          # Init container installs the profiler library
          initContainers:
          - name: lib-installer
            image: ghcr.io/parca-dev/parcagpu:20251020-0297094
            command: [sh, -c]
            args:
            - |
              cp /usr/local/lib/libparcagpucupti.so /var/lib/parca/gpu/libparcagpucupti.so
            volumeMounts:
            - name: libparcagpu
              mountPath: /var/lib/parca/gpu

          containers:
          - name: pytorch
            image: nvcr.io/nvidia/pytorch:25.03-py3
            command: [/bin/bash, -c]
            args:
            - |
              apt-get update && apt-get install -y wget unzip
              wget -O pytorch-examples.zip https://github.com/pytorch/examples/archive/refs/heads/main.zip
              unzip pytorch-examples.zip
              python3 examples-main/mnist/main.py

            # Configure CUDA to use the profiler
            env:
            - name: CUDA_INJECTION64_PATH
              value: /var/lib/parca/gpu/libparcagpucupti.so

            # Mount the shared volume with the library
            volumeMounts:
            - name: libparcagpu
              mountPath: /var/lib/parca/gpu
              readOnly: true

            # Request GPU resources
            resources:
              limits:
                nvidia.com/gpu: "1"

          # Shared volume for library transfer
          volumes:
          - name: libparcagpu
            emptyDir: {}

          restartPolicy: Never

          tolerations:
          - effect: NoSchedule
            key: nvidia.com/gpu
            operator: Equal
            value: present

When this job runs, the init container copies the library first. Then PyTorch starts. CUDA loads our instrumentation automatically. Every kernel launch, memory transfer, and synchronization event gets captured.

Done. Your next pod starts with GPU profiling enabled.

Works With Any CUDA Application

This pattern works with any CUDA application. TensorFlow, JAX, custom C++ code, even closed-source binaries. The CUDA runtime loads injected libraries transparently.

It can profile:

ML training jobs (PyTorch, TensorFlow, JAX)
Inference servers (TensorRT, ONNX Runtime, vLLM)
Scientific computing workloads (molecular dynamics, climate modeling)
Custom CUDA applications

If it uses CUDA, it works.

Verifying It Works

Check your application logs for CUPTI initialization messages. Or verify the environment directly:

kubectl exec -it <pod-name> -- bash -c 'echo $CUDA_INJECTION64_PATH && ls -la /var/lib/parca/gpu/'

You should see the library file and the environment variable.

Example Data

Once the library has been successfully loaded you can start seeing actual time spent in CUDA functions on the GPU.

What are we seeing here? The widest call stack shows PyTorch's autograd system executing numerous small kernels - convolutions, activations, gradient computations, and memory operations.

This data can then be used to decide batch sizing, whether to do operator fusion, or maybe to change training all together. We'll write future blog posts on more detailed and more production relevant cases.

Wrapping Up

Try it out and let us know what you discover! Join our Discord if you have questions or run into issues.

Want to learn more about the GPU profiling architecture? Check out our announcement post for details on how we use NVIDIA CUPTI, Linux USDT probes, and eBPF to make it all happen.

Also keep an eye out, as we have lots of new features coming up for GPU Profiling!

Discuss: