My Journey Optimizing a CUDA Kernel with Polar Signals

For Polar Signals’ latest hackathon, I decided to write a CUDA kernel and use our GPU profiler to make it as fast as I could. I mostly work on our storage layer and have (had?) zero GPU programming experience, so this was a great opportunity to learn more about CUDA programming and GPU performance in general. In this blog post, I’ll walk through how I built a string decompression CUDA kernel and how I thought about performance optimizations at each step.

The Plan

First up was deciding what to build. Following tutorials can be educational but I think the best way to learn is to implement something you actually need in production. Thankfully one of the goals of Vortex, the new file format we’ve been using at Polar Signals, is to ship compressed data to GPUs for high-throughput decompression and fused query execution. These GPU kernels are an active area of development, and there was a need for a kernel to decompress FSST-encoded strings. As a bonus, a lot of the kernel launching scaffolding was already in place, so I could focus specifically on just the kernel.

What is FSST

FSST is a string compression algorithm that learns up to 255 frequently occurring short symbol sequences (up to 8 bytes), and replaces each with a 1-byte code. The 256th code is a special “escape” code that denotes that the next byte in the compressed sequence should be interpreted literally. FSST compression is particularly effective for strings that differ slightly but share common patterns (e.g. URLs, log lines, JSON). Importantly, FSST allows random-access decompression and as an extension, enables strings to be decompressed in parallel since the context required to decompress any string is the offset, length, and the static shared symbol table.

Vortex already has an FSST encoding and implements compression and decompression on the CPU, so I set up a benchmark that decompresses a corpus of 10 million ClickBench-style URLs to use as a reference:

cuda/cuda/fsst/decompress/10M
                        time:   [25.076 ms 25.123 ms 25.150 ms]
                        thrpt:  [34.200 GiB/s 34.237 GiB/s 34.301 GiB/s]

All of these benchmarks were run on an NVIDIA DGX Spark. At 20 CPU cores, this number lines up with the decompression throughput of 1-3GB/s per core cited in the paper.

Writing a kernel is easy, optimizing it is hard

On to CUDA. Vortex uses the cudarc crate with a nice Rust API so it’s easy to copy over the buffers the CUDA kernel should work on. The initial implementation of the kernel shipped over each buffer from the pre-existing encoding from the CPU to the GPU and has each GPU thread decompress a string at a time. Omitting the boilerplate index computation to determine which strings a thread will decompress, the meat of the work is pretty straightforward:

const uint64_t in_end = compressed_offsets[tid + 1];
uint64_t in_pos = compressed_offsets[tid];
uint64_t out_pos = output_offsets[tid];

while (in_pos < in_end) {
    const uint8_t code = codes_bytes[in_pos];
    if (code == 255) {
        // Escape: the next input byte is emitted as a literal.
        output_bytes[out_pos] = compressed_bytes[in_pos + 1];
        in_pos += 2;
        out_pos += 1;
    } else {
        const uint64_t symbol = symbols[code];
        const uint8_t sym_len = symbol_lengths[code];
        for (uint8_t i = 0; i < sym_len; ++i) {
            output_bytes[out_pos + i] = (uint8_t)(symbol >> (8 * i));
        }
        in_pos += 1;
        out_pos += sym_len;
    }
}

Pretty simple stuff. Writing the kernel had been easier than I expected. I was feeling good until I ran the benchmark:

cuda/cuda/fsst/decompress/10M
                        time:   [59.395 ms 59.458 ms 59.514 ms]
                        thrpt:  [14.453 GiB/s 14.466 GiB/s 14.482 GiB/s]
                 change:
                        time:   [+136.49% +137.23% +138.04%] (p = 0.00 < 0.05)
                        thrpt:  [-57.990% -57.848% -57.714%]

I didn’t expect to hit the 270GB/s theoretical DGX Spark memory bandwidth throughput on the first try, but I also didn’t expect the kernel to perform worse than on-CPU decompression.

Profiling CUDA Kernels

This seemed like a good time to dogfood our GPU profiler with PC sampling enabled to see if I could find anything obvious:

Note that all profiles in this post are inverted in order to group by stall reason. The profiles produced by PC sampling are very similar to off-CPU profiles: every sample records why progress isn’t happening. The SASS instructions have an associated stall reason that describes why the warp (group of 32 threads) was stalled. The _not_issued suffix isolates scheduling cycles where the warp scheduler could not issue an instruction at all.

With the help of our MCP server (I highly recommend using it if you’re not familiar with the code you’re profiling), I managed to decipher the two main stall reasons: long_scoreboard where we’re basically waiting for a memory load before (overwhelmingly) an ISETP instruction (integer set predicate; predicate evaluation) on that data dependency and lg_throttle which implies we’re issuing way too many (local or global) memory load/store instructions and are stalled waiting for the instruction queue to drain.

This profile illustrates one of the most important lessons I took away about GPU performance engineering: it’s all about memory. I found this talk by Stephen Jones (NVIDIA System Architect) to be a wonderful explanation of why GPU performance is all about hiding memory latency, or as he puts it: “Where is my data?”

Where is my data?

The long_scoreboard stall reason is a large part of the profile, and reducing these seems like a question of reducing and/or amortizing memory loads, especially in our hot loop. Two candidates stand out: the byte-at-a-time code load from compressed data and the static symbol table accesses.

Packing the compressed data load into as many bytes as possible requires some pointer arithmetic since CUDA does not allow issuing a misaligned multi-byte load. The trick here is to use register scratch space and issue the largest aligned load to get to u64 alignment and load a u64 at a time from then on.

cuda/cuda/fsst/decompress/10M
                        time:   [59.290 ms 60.825 ms 64.445 ms]
                        thrpt:  [13.347 GiB/s 14.141 GiB/s 14.507 GiB/s]
                 change:
                        time:   [-8.7897% -2.8485% +3.2111%] (p = 0.45 > 0.05)
                        thrpt:  [-3.1112% +2.9320% +9.6367%]

Unfortunately, that didn’t seem to move the needle at all, likely because the GPU is fetching a full cache line’s worth of data (64 bytes) regardless of whether one or eight bytes were requested, so these loads seem to already be optimized. This change also seems to introduce some previously unseen instructions like BSSY and BSYNC which are used to handle warp divergence (threads in a warp doing different things). Given that threads in a warp must fundamentally execute in lockstep, these instructions are not a good sign:

The next candidate was putting the symbol table into shared memory, which is cheaper to access than global memory. However, there was no improvement with this change either. As the profile tells us the change seems to have reduced long_scoreboard stalls in exchange for mio_throttles which in GPU architecture is upstream of the lg_throttles and the long_scoreboard stalls, so it looks like we just shifted the bottleneck upstream:

A little bit of research produces a convincing theory: since the symbol reads are inherently random, multiple threads in a warp request different addresses from shared memory and this causes these accesses to be serialized. This particular issue is known as bank conflict latency.

After these changes I came to the conclusion that it’s tough to beat the GPU’s L1 cache-line amortization and random access reads are not a fit for shared memory. I’m very likely missing something, and there might be some optimizations to do around fetching compressed bytes before the hot loop, but at this point I decided to move on to the next bottleneck.

Where is my data? (try number two)

Loads were difficult to optimize (at least to my untrained eye) so the next logical step was moving to optimizing stores. Theoretically, loads are harder to optimize because if you request a specific byte, it is very likely that you will request the next, contiguous byte. GPUs (and many other things) use this information for implicit optimization. Stores are a different story. When you write a byte, the next byte to be written is unknown.

The main local to global memory stores the FSST kernel is doing is writing the decompressed bytes a byte at a time. The optimization for this path is very similar to the aligned loads described above: write the decompressed result byte by byte into registers and emit the widest possible aligned store when enough data has been buffered. The code is here if you’re feeling brave, but the animation below illustrates this idea well enough.

This optimization showed a dramatic improvement and (finally!) beats the CPU implementation.

cuda/cuda/fsst/decompress/10M
                        time:   [10.459 ms 10.928 ms 11.908 ms]
                        thrpt:  [72.234 GiB/s 78.713 GiB/s 82.241 GiB/s]
                 change:
                        time:   [-84.048% -81.531% -77.600%] (p = 0.00 < 0.05)
                        thrpt:  [+346.42% +441.46% +526.86%]

This profile comparison also shows that the lg_throttle bottleneck was greatly reduced:

Warp Divergence and GSST

GSST is a paper that modifies the FSST encoding for more efficient execution on the GPU. I didn’t use it as a reference for the optimizations we covered since a goal of the hackathon was to dogfood the Polar Signals GPU Profiler, but there is a lot of overlap between the optimizations mentioned in the paper and the ones we covered.

One thing profiling/PC sampling doesn’t show is the warp execution efficiency. As mentioned before, all threads in a warp must execute in lockstep so if one thread has finished execution or will execute a different instruction, it is masked as inactive during an execution cycle which reduces execution efficiency.

One further optimization I would not have discovered with profiling is the split kernel optimization from the GSST paper. Since threads are decompressing variable-length strings, some threads in a warp might be inactive after decompressing short strings while others are still decompressing the longer strings. The optimization is to use pre-computed splits in the compressed data and even out the workload across threads to increase execution efficiency.

Computing this split metadata on the CPU side and shipping it off to the GPU results in another nice performance boost:

cuda/cuda/fsst/decompress/10M
                        time:   [8.8424 ms 9.5543 ms 10.775 ms]
                        thrpt:  [79.830 GiB/s 90.026 GiB/s 97.275 GiB/s]
                 change:
                        time:   [-29.272% -16.710% -1.0435%] (p = 0.06 > 0.05)
                        thrpt:  [+1.0545% +20.062% +41.386%]

However, this improvement is more illustrative than anything else, since the CPU pre-computation is something to be avoided in production scenarios. The whole point of GSST is that this sort of metadata is pre-computed at write time. It’s good motivation to introduce a GPU-optimized encoding though.

Final thoughts

Writing and optimizing the FSST CUDA kernel in Vortex was both frustrating and fun. GPUs are really finicky but reward you with outstanding performance once you understand the architecture and how it responds to the kernel. I realized that a large part of performance engineering on GPUs is about throughput: figuring out how to turn on the memory taps so the data flows as freely as possible from one side of the GPU to the other while keeping the threads as busy as possible in order to hide the memory latency.

The result of the hackathon was good enough to PR and is now generally available in Vortex. Although introducing GSST is the long term goal, the FSST decompression we covered in this blog post achieves 40% of measured device bandwidth (via cuMemcpy, 30% of theoretical) and is 3x faster than nvcomp’s zstd on the same corpus of data (this is fundamentally due to FSST's decompression parallelizability):

cuda/cuda/zstd/decompress/10M
                        time:   [28.587 ms 33.475 ms 37.264 ms]
                        thrpt:  [23.082 GiB/s 25.695 GiB/s 30.089 GiB/s]

Thanks for reading and feel free to point out anything I did wrong or could have done better. We’re also still in the early days of GPU profiling, so we'd love to hear how we can make Polar Signals' GPU profiler fit your workflow better.