Deterministic Simulation Testing in Rust: A Theater Of State Machines

July 8, 2025

It's been just a year since we wrote about how we implemented deterministic simulation testing (DST) of our Go database (FrostDB). Since then, we have been hard at work writing our new database backed by object storage in Rust (read more here about why). In this blog post, I want to lay out the approach we took to build our new Rust database with DST principles front-and-center and how it compares to the approach we took with FrostDB.

As a quick recap, DST tests are randomized full system integration tests where any failure can be reproduced using the same initial random seed. DST helps you find complex bugs in your system before customers hit them in production. Since failures can be deterministically reproduced, it significantly reduces the time needed to replicate hard-to-find bugs and streamlines the debugging process so that more time can be spent on feature work. The confidence this provides in system correctness is transformative: developers can ship complex features at full speed, knowing that DST will catch any subtle regressions before they reach production. For more on DST principles and benefits, see our previous blog post.

The Four Ingredients of DST

In essence, DST can be boiled down to controlling four main ingredients: concurrency, time, randomness, and failure injection. The first three are non-negotiable: you must have full control over task scheduling while injecting replayable sources of time and randomness. Deterministic execution can already be achieved with just these three ingredients. Failure injection is not required: zero failure injection is just a deterministic and reproducible integration test of a random execution schedule of your system, which is already valuable. However, the real power of DST is when you turn up the failure injection knob.

This amount of control requires you to think carefully about how your code is designed and run. When we implemented DST in FrostDB, we had to achieve this level of control with an existing codebase. In Go, the default concurrency model is that goroutines are multiplexed by the language runtime onto a set of OS threads, giving up your scheduling control to not one, but two external actors. Our choices were therefore to rewrite an existing complex codebase with its own scheduler, or figure out a way to control the existing scheduler by forcing single-threaded execution and making the runtime scheduler decisions deterministic.

We chose the latter approach and with some small changes, gained control of time and randomness. This approach was effective and found a good number of issues, including data loss and data duplication bugs. However, we were never really satisfied with the amount of failure injection we were able to achieve with this approach, since every type of failure required a specific interface implementation (e.g. VFS) and any scheduler-related failure injection (e.g. randomized execution schedules) would have required us to modify the Go runtime scheduler more than we wanted to.

We could have gone down the same path with our Rust database. The folks at RisingWave have helpfully written a deterministic futures executor called madsim, a drop-in replacement for tokio. Madsim offers deterministic execution of futures given a random seed, and also helpfully implements failure injection for some interfaces.

For an existing Rust codebase implementing DST, madsim is probably a good choice. However, writing a new codebase from scratch gave us a golden opportunity to claim the responsibility of full control over concurrency, time, randomness, and failure injection. So we decided to go all in.

Enter State Machines

Confident in our decision, we chose to architect our new database as a set of state machines, heavily inspired by the actor-based concurrency model outlined in the sled simulation guide. The idea is that all of your core components are written as single-threaded state machines. In production, these state machines are wrapped by a “driver” that routes messages to other state machines via the network and handles local state machine outputs. In testing, a single thread drives an event loop and routes messages between state machines using a message bus.

Simplified Testing Architecture Diagram

These state machines implement a simple interface that defines a Message type as the sole method of communication with other state machines. This is a way to implement a technique known as dimensionality reduction: by constraining all interactions to message passing, we collapse the complex space of possible component interactions into a single, well-defined channel. Every state change and side effect must flow through the same messaging interface, which makes the system far easier to reason about and test.

pub trait StateMachine {
	fn receive(
		&mut self,
		m: Message,
	) -> Option<Vec<(Message, Destination)>>;
		
	fn tick(
		&mut self,
		curtime: Instant,
	) -> Option<Vec<(Message, Destination)>>;
}

The tick method is called periodically for any timer-related events state machines implement (e.g. flushing buffers) and is the only source of messages without inputs in the system. receive implements the core state machine code that produces any number of messages for a given input message from another component in the system.

With this architecture, the state of the system can be described at any point in time as the combination of individual state machine states. Since state only changes through tick and receive calls, we can simulate our entire system exactly as it would run in production.

The Message Bus: The Director

The message bus that connects these state machines in deterministic simulation tests becomes a very powerful abstraction because it is the only component responsible for ticking and applying messages to state machines. This is where this state machine architecture really shines for DST because the control of all four ingredients is cleanly centralized in one place. Let's take a look at how this works in practice:

Scheduling

The event loop runs on a single thread and either ticks a random state machine (through tick) or applies a message to a destination state machine (calling receive). The state machine interface does not include the async keyword, which ensures at compile time that there are no runaway futures spawned inside state machines, ensuring the main thread will run tick or receive to completion. Each call to a state machine can be thought of as an asynchronous task executed to completion, described as a (Message, Destination) tuple in the message bus event queue. This results in the message bus fully controlling task scheduling since the execution schedule is just the order in which messages are applied to state machines.

Time

State machines have no direct access to system time. They only perceive time through the timestamp explicitly passed in to tick. In our case, we mostly use this to implement logic to flush buffers based on some amount of time passing. Production drivers typically set up a task that calls this method periodically with the real current time, but the message bus can precisely choose what the “current time” is. This gives the message bus complete control over temporal behavior, allowing arbitrary time manipulation with CPU performance as the only limiting factor.

Randomness

Controlling randomness boils down to using a single pseudo-random number generator everywhere random decisions need to be made. This random number generator is initialized with a random seed printed out at the start of each test. Every random decision (e.g. event re-ordering, state machine selection for ticks, failure injection) is made based on this random number generator. Seeding this random number generator with the same seed guarantees the same sequence of random decisions.

Failure Injection

The conventional way failure injection is designed is by providing components with an abstract dependency that is swapped out during testing for a randomly fallible dependency. For example, in our FrostDB DST implementation we implemented an in-memory VFS interface that would randomly fail. As mentioned earlier, the downside with this approach is that failure injection needs to be implemented for every single fallible interface. Additionally, passing a random number generator to each fallible dependency implementation creates a lot of boilerplate and complexity.

With the state machine architecture, dependencies are not “injected” into components that need them. Instead, dependencies are modeled as state machines themselves and therefore dynamically resolved at runtime. For example, an IO write just becomes a message with some bytes sent to the IO state machine. In production, these messages are handled explicitly and result in what you would expect, but in testing the message bus is free to do anything with these messages.

Since all calls to fallible dependencies are just messages, the message bus implements failure injection by sending back an error for any given message to the original component. It can also choose whether to do this before or after applying the original message to the destination state machine, or even drop messages at any point. Crucially, this failure injection needs to be implemented only once in the message bus to apply to all state machines (including all dependencies). This results in very powerful and comprehensive failure injection while avoiding the complexity of reimplementing failure injection for every fallible component.

Disadvantages

The state machine architecture allows for flexible and powerful simulation, but does come with some disadvantages that need to be navigated when applied in practice.

First, forcing the system to be written as a set of state machines imposes considerable cognitive overhead on developers. For DST to be effective, core system logic must be located in state machines, which implies certain restrictions on the developer. From a theoretical point of view this is great, because it forces developers to think very carefully about the exact state transitions state machines make, but in practice the path of least resistance is taken. As a result, code that should be in state machines leaks up into the production driver and is not tested in DST tests. However, this is more of an issue now because our database architecture is still evolving rapidly and I’m hopeful that this will improve over time as it stabilizes.

Second, external dependencies can introduce nondeterminism, reducing DST coverage since you cannot control what goes on inside the dependency. Dependency nondeterminism must be eliminated, otherwise DST reproducibility goes out the window. In practice, we categorize dependencies as either non-deterministic (varying output for the same input) or deterministic (consistent output for the same input). The first kind of dependencies must be mocked out during testing (e.g. UUID generation), but the second kind can be wrapped in a state machine during testing for maximum coverage. Even if the dependency spawns multiple futures under the hood, we use block_on on any async functions to block the testing thread on the dependency output. However, there is no easy answer to avoiding reduced DST coverage due to dependency use. Our approach to this problem is to remove as many dependencies as possible over time.

Conclusion

There are many ways to implement DST in your system. For existing applications considering implementing DST, I still think that relying on third parties is the correct tradeoff to make (WASM+runtime in Go, madsim in Rust) since a full codebase rewrite is usually too much to stomach. However, the state machine architecture outlined in this blog post offers both a simpler and more powerful way to build a system where determinism is paramount for correctness.

After our experience retrofitting DST into FrostDB, this new architecture was the right choice for us. We lost the ability to “just write code” and let the DST harness we built for FrostDB handle determinism, but gained tremendous control over the critical ingredients required for DST. This approach has already proven its value – we've discovered two critical bugs in our new Rust database: one causing data loss and another causing data duplication. Testing aside, this architecture is also a powerful mental model to reason about system behavior. Even though cognitive overhead during implementation is significant, it comes from having to think explicitly about your system’s state transitions. In the long term, this is a good thing.

I hope this blog post inspires you to implement deterministic simulation testing into whatever you’re building and gives you an idea of some ways to do this. If you have any questions or comments, I’m always happy to chat (reachable at alfonso@polarsignals.com). Let’s build more correct systems together!

Discuss: