Questioning an Interface: From Parquet to Vortex

November 25, 2025

“Interfaces are powerful, they are the boundaries where our components interact to exchange information [...], but as boundaries, interfaces are not only what’s exposed, but what’s imposed.”

Joran’s talk on “The Power of an Interface for Performance” at Systems Distributed 2025 and the specific idea of how the design of an interface can impose limitations on the user really stuck with me. This is true even in widespread interfaces where the limitations have become so normalized that we treat them as inevitable. A couple of months later, we coincidentally got the opportunity to put theory into practice and break free from our own shackles of interface-imposed performance limitations.

Parquet had been our file format choice for our profiling database since the beginning. We dogfood our own product extensively, so we’d always known that our query CPU time was dominated by converting Parquet into Arrow, a more queryable data format. Although we had invested engineering effort into avoiding this cost with things like aggregation pushdowns, we had come to the conclusion that avoiding this cost in a general way would be impossible without changing the storage format. We’d been keeping an eye on file format alternatives and when Vortex was donated to the Linux Foundation by SpiralDB, we decided to give it a shot.

After switching, we ended up getting a 70% average performance improvement on all our queries with 10% better uncompressed storage size and only 3% larger compressed storage size compared to snappy-compressed Parquet.

Parquet as an interface

The Parquet file format is a very popular choice for anybody deciding to store column-oriented data in as efficient a manner as possible. It’s been around for more than a decade and has wide-ranging support across the “big data” ecosystem as well as multiple reader/writer implementations across languages.

Parquet has three design goals:

Efficient data storage
Efficient data retrieval
Interoperability

Parquet minimizes the bytes on disk used to represent data and reduces IO operations when retrieving data. It does this by offering the user some knobs to tune column encodings and laying out data in a way that allows reading only needed columns and pruning unnecessary data with statistics.

Additionally, one of the benefits of Parquet is that it is so widespread that it can be used as a data interchange format. You can theoretically write parquet in one system using a writer library and ship it off somewhere else to be read by a completely different reader.

The Reality

Choosing Parquet as a storage format almost requires no thought. It’s so ubiquitous that it seems like an easy choice, and that was the case for us. However, we now realize that compression was the only goal where our needs matched what Parquet delivered.

Efficient data retrieval sounds fantastic until you realize that “retrieval” does not mean “querying”. Retrieval is about reducing IO for the data you need while querying is about what happens when you actually run computations on that data. In practice, you need to convert Parquet into Arrow, whose design goal is O(1) data access, making it much better suited for general-purpose query execution.

This conversion into Arrow came at a heavy cost and ended up dominating our query CPU time. Query engines like DuckDB and Datafusion try to push down filters as much as possible to avoid this conversion cost, which helps. But even with optimal filter pushdown, general computation pushdown is limited.

There are paths you can take to evaluate computations on the Parquet file directly in some cases (e.g. we implemented DISTINCT pushdown on dictionary encodings) or enhance the default set of stats for even better filtering (see this very interesting blog post about adding user-defined indexes), but these optimizations usually target a very specific set of query patterns and require custom work for each one, including figuring out how to integrate them with existing file readers.

Interoperability creates its own problems. It sounds like a great add-on until you realize that it also comes at a cost. Parquet is constantly being updated with more efficient encodings, but DuckDB has a great blog post outlining how you cannot use these newer encodings without sacrificing interoperability because most mainstream query engines do not support reading newer Parquet encodings. In practice, you need to sacrifice data size if you want interoperability.

Vortex

Like Parquet, Vortex minimizes bytes on disk. However, Vortex is also designed with a core use-case in mind: decoding and querying data directly from object storage on GPUs. This key idea translates very well to our use-case even though we don’t run our queries on GPUs (yet?). Specifically, the file format is designed to maximize throughput and parallelism from the metadata format to the SIMD/SIMT friendly encodings used.

Crucially, it also acknowledges that part of making queries fast is not only good filter pushdown, but also general-purpose compute pushdown. If anything cannot be pushed down, Vortex’s encodings can be tuned to offer zero-copy conversion to Arrow for further query execution using any general-purpose query execution engine.

Vortex also learns from Parquet’s limitations around extensibility and aims to be as future-proof as possible. New encodings can ship with WASM decoders so encoding adoption is not limited by reader libraries having to implement support. The main Rust library is also designed to be fully extensible, so you can write your own layouts/encodings and plug them in as first-class citizens.

Given how well Vortex’s design matched our needs, we tried it out and got a 70% average performance improvement on all our queries. With the newer encodings that Vortex offers, we got 10% better uncompressed storage size and only 3% larger compressed storage size compared to snappy-compressed Parquet.

Conclusion

For us, finding Vortex was like finding a pair of nice-fitting shoes after walking around in a size too small for longer than we could remember. The payoff was finally being able to run 70% faster.

Many blog posts could be written comparing Vortex to Parquet with numbers and specific use-cases (see bench.vortex.dev if you’re curious), and many more about why benchmarks don’t tell the whole story. In the end, finding the right interface that satisfies your needs is much more important. When you get that right, everything else, including better performance, flows from that.

Discuss: