New year new database

June 18, 2025

New Storage

Previously I wrote about what we learned in building a database for profiling storage. If you missed that one you can read about it here. This entry into our database journey is about taking what we learned when building our first database and using those lessons to build something new.

For those that didn’t read the original blog post. Our original design goals were:

Keep storage and compute decoupled
Cost effective at scale
Easy to operate

We built the database at the same time we were building the product, so we solved problems as they came to us which resulted in a system that organically grew to no longer satisfy these goals. We needed a new database that lived up to the design goals. So, armed with the knowledge from the previous iteration we set out on a mission for a new database.

Now, first I’ll address the question we get asked frequently.

Why don’t we use an off the shelf solution instead of building our own?

Since we’re no longer building the database as an embedded open source storage for the Parca project we could definitely leverage off the shelf technology. However, we’ve been watching the space as it has matured, and there continues to be risk in not owning your storage layer. With license changes and pricing increases, it was clear that this had the potential to become a risk to the business if we decided to build on top of an off the shelf solution. In addition we felt that with the knowledge we had from building the first iteration we could build a database that was better suited to our very specific needs.

Decoupling Storage And Compute

As mentioned in my previous post, having storage and compute together created some large pain points when managing the system during incidents. These lessons made it clear we needed to get back to having our compute and storage separated.

To accomplish this we decided on using the open table format Delta Lake. Delta Lake offers an ACID compliant database built entirely on object storage. Its architecture allows for really high-throughput appends to the table, as it only requires a single write to object storage to commit a new write to the table. This is different from Iceberg which can require multiple writes to object storage for a write to be committed to the table. We felt that this architecture was better suited for our use case, as profiling data is much more write heavy, and doesn’t require some of the other features that make Iceberg attractive.

writer to Delta Lake table

Having our database on object storage is great, it makes it easier to manage as there aren't multiple copies of the data that we have to manage, and we don’t have to reach out to multiple systems to query their data. In addition our compute and storage is decoupled, meaning we can run long running analytical queries, or multiple high-memory queries all in parallel without affecting the ingestion side of our database. And of course the primary benefit of object storage is that the scaling is handled for us.

Cost Effective At Scale

However, this still leaves us with the design goal of being cost effective at scale. While the storage costs of object storage aren’t expensive, the API costs can quickly add up. As the database scales up, we’d be writing more and more data to the database. Having each customer write triggering two or more object storage writes would quickly increase the costs of scaling the database.

This isn’t a new problem we’ve had to solve, or that other databases have had to solve. We solved it in our original database by buffering customer writes on disk and once we had enough data, we’d write it to object storage (this was the compute and storage coupling problem mentioned in the original blog post). Many of the latest databases we’ve seen others talk about do this same type of buffering using Kafka. Writes are streamed into a Kafka cluster, and once there’s enough data in the cluster for a database partition they are written out to object storage.

This strategy work for reducing the write amplification on object storage and a high write throughput system is maintained. However this does create two concessions for our design goals. No longer are compute and storage separated; queries now need to also extract data from Kafka instances which adds load to the system also responsible for ingestion. Additionally it means that our database is no longer simple to manage. Managing a separate Kafka cluster just to manage your database is no small task and it can become quite expensive to run itself. We needed a way to keep the data in object storage, but also reduce the write amplification so as to keep the database cost effective at scale.

Buffered writer to a Delta Lake table and compator to move to final Delta Lake table

This is where the benefit of designing a usage specific database allows us to make more unique trade offs. Since profiling data doesn’t need to be queryable in a sub second interval, we can make the decision to buffer writes in memory, while not acknowledging to the client that the write has been accepted until it is written to object storage. This allows us to buffer writes to multiple databases into a single larger write, and commit that buffer to a Delta Lake table that holds these larger buffers. Once a database has enough data in this buffer table, we have a compaction process to move that data to the final Delta Lake table for that database. Effectively we’re using a Delta Lake table as what the Kafka clusters were utilized for, where it buffers writes until there’s enough of it to commit to the table. This keeps our writes to object storage low even as writes to the database increase and also maintains an easy to operate database. There are no disks to manage, and all committed data is on object storage.

Additional opportunities

Since we’re embarking on a rewrite of our storage layer, now was the time to revisit some non-architecture related decisions we made early on with our original database.

One of our biggest CPU usages in our database is from Go’s garbage collector. When new writes come in we have to allocate memory for the new data. Eventually that data gets moved into object storage leaving behind the buffer that needs to be cleaned up. Because our database is a high write oriented database, we create a lot of garbage. Moving to a manually memory managed language like Rust means we have less CPU overhead from a garbage collection, and we can reduce our CPU footprint for the same workload.

We were all Gophers, so it made the most sense as we were still figuring out what we needed to build to not also add the additional burden of a new language that we were unfamiliar with. Now that we knew what we were building it made sense to revisit this decision. Since we knew we were going to be using Apache Arrow and Parquet formats for our data, and we had learned how hard it is to build a query engine, we decided we wanted to be able to use the DataFusion ecosystem in our new database so as to not have to implement any of the query pipeline ourselves. DataFusion gives us a blazing fast query experience without having to do the hard work ourselves to implement it.

Another decision we made early on was to store the profiling data into a flat structure in Parquet format. This was much easier to implement and our assumption was there wasn’t much savings in storing it in a nested format. Boy were we wrong, after some experiments we saw in some cases as much as a 90% reduction in Parquet file sizes when storing with a nested structure. So rewriting our database gave us a great opportunity to redefine the schemas we were using to store data, and allow us to save on our storage costs. We’ll have a larger blog post on file formats and storage schemas later.

The final opportunity we had when writing a new database from scratch was to implement it in such a way that it was deterministic. Deterministic databases allow you to perform deterministic simulation testing. This gives us a high degree of confidence whenever changes are made. This was such a foundational change for us that it requires a blog post all of its own.

The Future

We’re excited to roll out the new database to customers in the coming months. It’s already shown a lot more resilience as we’ve tested it out, and we can’t wait to talk about all the additional optimizations we’re making to it. Watch the blog for future posts about the new database, and maybe I’ll have another lessons learned post in a couple years about this database.

Discuss: