German strings are everywhere I look. The impression I’ve gotten from working in the Rust Arrow/Datafusion ecosystem and related file formats for the last couple of months is that StringViews (the implementation of German strings in Arrow) are becoming, if they have not already, the canonical form of representing string columns at execution time.
This is generally a good idea. German strings are a fantastic innovation rooted in simplicity that greatly improves most string processing use-cases in database systems. However, “most” does not mean “all”. At Polar Signals, we are one of these exceptional use-cases.
In this blog post, I want to argue that we should be careful about treating German strings as a silver bullet at the expense of other encodings. Ideally, German string encoding should be “Just Another Encoding”™ to be chosen based on physical data characteristics and the type of workload, rather than an implicit choice database systems make for the user.
German Strings
You can read more about German strings here and the implementation of them in the Rust arrow library in this two part blog post (part 1, part 2). I will focus on the Rust Arrow StringViewArray implementation since that is what we use.
The general idea is that string views are split into two buffers: one views buffer where each element is a 128 bit/16 byte “view”, and at least one (possibly multiple) data buffers that are pointed to by the views:
This layout for strings offers advantages for many operations like comparisons/filters and sorting, since most of the operations can be performed directly on the views buffer.
Downsides
Part 2 of the Datafusion blog post on StringViews nicely summarizes the downsides of this encoding. A common theme is memory use. Because each element requires at least a 16 byte representation, both tiny and repeated short strings use more memory than they otherwise would. Even longer strings, although deduplicated in the buffer, require at least 16 bytes of memory per element.
Dictionary Encoding
Sometimes performance is not everything. At Polar Signals, we need to be very conscious about resource usage per query because we run a multitenant cloud service. Also, the better performance offered by StringViews needs to be put into context. If your workload is running many filters/sorts etc… on string columns you will see better performance using StringViews, but if your workload does not operate much on string columns, then this improvement is not as relevant.
In our case, we have many string columns in our data schema but our workload performance is not bottlenecked on these string operations. This makes query memory usage and therefore optimizing each column’s physical representation much more important. Additionally, these string columns have low cardinality (most are below 0.05%). This is why we made the choice to represent most of our string columns using dictionary encoding.
Dictionaries are views too. The equivalent of the views buffer for dictionaries are just indexes into an accompanying values buffer that stores the unique strings just as the StringView does. The difference is that the index buffer can be sized appropriately depending on the number of distinct values that need to be addressed. For example, if your string column only stores three-character airport codes you know off the bat that you will be able to address all possible combinations using just 2 bytes per element.
For our columns concretely, we rely on an index buffer that costs 4 bytes per element. This is a quarter of the 16 bytes that would be required for a StringView. The values buffer for both encodings are more or less equivalent since both deduplicate strings, and the memory used by this buffer is negligible in comparison to the views/indexes. On a real-world column for example, the values buffer represents around 0.2% of total dictionary memory usage.
An important thing to keep in mind is that queries are not executed on an unlimited number of rows at once. There is a batch size that defines how big a chunk of data is that passes through the database operators at a time so in theory this also limits the amount of memory used by queries that do not accumulate values. However, batch sizes are usually defined as a number of rows, and within a deeply nested schema such as ours, a row can hold a list of arbitrarily many values.
But what if the workload changes?
If our workload changes, and operations on string columns become more important we will of course have to think about whether we’re still making the right tradeoff. However, this brings me back to the idea I hinted at at the beginning of this blog post. Rather than the database user explicitly choosing the best encoding via a schema or the database making an implicit choice about the best physical encoding, physical representations of data should ideally be dynamically chosen based on certain characteristics both at the storage and execution layer.
At the storage layer for example, Vortex already represents data schemas using only logical types and chooses the best encoding for each column chunk dynamically. On the execution side, it looks like the datafusion community is starting to discuss this idea, opening the door for potentially one day choosing the best physical encoding dynamically at plan time based on the query and storage format.
Conclusion
German strings are generally a great encoding choice, but not always. As we've shown with the Polar Signals use case, context matters: our low-cardinality string columns achieve a 75% reduction in memory usage with dictionary encoding compared to StringViews, with minimal performance trade-offs since string operations aren't our bottleneck.
Because it is difficult to assume what the best encoding will be for any given workload, database systems should dynamically choose encodings based on storage and workload characteristics. There’s a lot of good work happening to promote the idea of separating logical types and physical encodings. Until that work is complete, database systems should not make assumptions about the optimal physical encoding on behalf of users.