Where it came from
We set out to build the best continuous profiling product we could, one that we would be proud of using every day ourselves. To determine if we had a product fit, we started with what we knew, Prometheus. Our initial iteration of the CONtinuous PROFiling product was unimaginatively called ConProf. It was built from a fork of Prometheus, where we replaced the front end code to be able to render profiles, and the backend database TSDB was modified to store pprof format binary blobs instead of floats. This was a great stepping stone as it gave us feedback from investors and customers quickly. But we knew we needed something that was going to handle our needs better.
Now at the time, we knew what some of our goals with a continuous profiling product would be. We come from open source and so we always wanted to build it as open source. We knew we wanted to replicate the experience of Prometheus by having a single binary that does all that you need for continuous profiling. This meant we needed a database that could handle arbitrarily wide columns to support arbitrary user defined labels, and one that was embeddeble in Go. This is where FrostDB comes in. You can read more about the design decisions, and why we decided to build our own database in our previous blog post.
What we built
FrostDB worked great as the database backend for the open source project Parca. However it was always meant to be a single binary solution that didn’t scale further than the largest box you could put it on. That's where we felt we could build our business; offering a hosted, scalable, cloud solution of the open source project. We needed to take FrostDB and create an enterprise ready wrapper around it that allowed us to scale the database for a multi-tenant cloud environment.
Our original design goals when building a scalable FrostDB solution were:
- Storage and compute are decoupled
- Cost effective at scale
- Easy to operate
We wanted to avoid having to manage disks, and state, and instead leverage the object storage that cloud providers offer. We also required a system that was easy and inexpensive to operate. Polar Signals is a small company, and we couldn’t afford certain architectures like having a Kafka cluster in front, because managing that can quickly become a full time job in and of itself as well as be quite expensive.
Armed with these design goals in mind we created our first cloud database around FrostDB. We ran a cluster of FrostDB instances and each write was partitioned across these instances. Queries were then equally partitioned out to each FrostDB instance to retrieve the portion of the query space it was responsible for. Whenever an instance would accumulate a configured amount of data in-memory, it would flush those Arrow records into object storage as a Parquet file. This seemed to work great, it was really simple to manage, it was cost effective, and compute was mostly decoupled from storage except for small in-memory footprints waiting to be flushed.
One of the decisions we made early on in accordance with making our system cost effective at scale, was that we were going to use preemptible instances. Preemptible instances deliver a large cost savings as the cloud provider can terminate the instance at any time, and move it to other locations where they have more capacity. And this was fine for a while. When our nodes received a termination signal they would send what they had in memory to object storage, and pick up where they left off when they were rescheduled.
However, it became apparent as we on-boarded more customers to the platform that this would not scale. As we added more customers, we naturally added more data to each node. Soon it became impossible to flush data to object storage in the grace period between notification and before the node was terminated. This was causing us to lose data, as it wasn’t making it to object storage before being shutdown. Not much of a database if it can just drop data at random.
So we reached for every database engineer's best friend, the Write Ahead Log (WAL). The WAL allowed us to recover from these instances where we couldn’t flush to object storage in time for termination. But it meant giving up on one of our design goals, to not have to manage disks/state. But seeing as we were losing data, this seems like the right trade off to make at the moment, so we added disks to each node where we could persist the WAL.
The WAL worked great, and we were able to successfully recover. However, as we continued to add more customers onto the platform we noticed that the time for a node to recover was increasing. Each node only had so much compute, and each customer database had its own WAL that needed to be replayed on recovery. This increase in recovery time meant that databases were unable to answer queries while that partition of the database was replaying. We first decided to increase the costs of our system by switching to solid state storage instead of hard disks. This increase in disk storage meant we could replay faster, but this only bought us time, and we knew we were going to have to solve this problem. We decided we couldn’t get away with simply partitioning the data across nodes, but instead needed to replicate the data. By having the data replicated it didn’t matter if a node was down, we could still answer queries and ingest new data. But replication meant we were storing more data which increased our costs to run the system. Solving these problems meant we had chipped away at our second goal, to have a cost effective solution.
This worked great for a time, but eventually we found the limits of this solution as well. The long replay times continued to grow as customers on-boarded more of their infrastructure. These longer wait times for a node to come ready weren’t so bad, except that nodes were getting behind from the other nodes in a cluster. Normally the distributor would buffer recent writes, and when a node came back online it would replay those writes to that node so that it could catch up to the other nodes in the cluster. But because the replay times were growing so large, we couldn’t hold all the writes it was missing.
To alleviate the pressure on the distributors holding all the missing writes in memory, we implemented a coordination layer in our enterprise wrapper. This coordination layer would tell instances to form clusters that could coordinate with each other to ensure that nodes that lagged behind could be brought up to speed. When a node would come up, it would replay it’s WAL, then it would reach out to the other nodes in it’s cluster and ask them for their latest transaction id. If that id was ahead of it’s own, it would request the missing transactions. This meant that we didn’t have to buffer these writes but could go straight to the source for the missing information.
However this cluster coordination meant we were giving up on our last remaining design goal; to create an easy to manage system. With this new coordination layer it required a large amount of knowledge to operate, and created some complicated failure scenarios to recover from. For example one of these failure scenarios happened when our cloud provider decided to preempt multiple of the same instances at the same time. We could end in a death spiral where nodes were trying to recover from each other, but this caused memory footprints to balloon during recovery which caused the pods to get OOMKILLED, thus starting the cycle of recovery over again, all the while getting further behind. Not exactly what we had designed for in the beginning.
Where we’re headed
While the database we have today is in a very stable state, it’s no longer the database we set out to build. As we earned battle scars and gave up ground on our design goals we ended with a database that was more expensive, complex to manage, and compute was tightly coupled to storage. We had never built a database for profiling data before, it was a bit like trying to build a plane as it’s trying to land. We landed, but it isn’t exactly where we wanted to end up. But we’re taking all these lessons we’ve learned from building our database, and bringing them back to our original design goals. We’re excited to share the new rewrite of our database that’s able to meet our original design goals. Check out our blog in the near future for a write up on building our next generation of profiling storage.