(Mostly) Deterministic Simulation Testing in Go

May 28, 2024

Deterministic simulation testing (DST) is a testing approach that has gained popularity lately with Antithesis coming out of stealth mode. They do a much better job than I could ever do explaining what DST is and why it’s an extremely powerful tool in your testing arsenal, so I encourage you to read their blog posts and listen to Antithesis’ founder Will Wilson's talk at StrangeLoop 2014 to get a general understanding of DST and its benefits. In a nutshell, however, DST’s core differentiator from other testing methods is determinism. Similar to randomized testing, each test run starts with a random seed and explores a random program execution path. However, if this execution path fails, the developer can deterministically reproduce the failure using the random seed used to generate the test run. Simulation further enhances the execution path exploration by randomly injecting faults in software layers and seeing how the program behaves. This idea is not new (think of mocking), but paired with deterministic test failure reproductions, is a very powerful tool.

The key advantages of DST are that exhaustive exploration of execution paths uncover a lot of hairy and hard-to-reproduce bugs much earlier in the development cycle and deterministic reproduction of these bugs mean that the reproduce, debug, and fix cycles are much shorter. This makes both customers and developers happier!

If it’s so great, why isn’t everybody doing it?

DST is super powerful, but it is also extremely tricky to get right. Most projects that have DST as part of their testing toolkit were designed from the ground up to be deterministic (e.g. TigerBeetle). A core problem, for example, is that a deterministic program cannot run on more than one OS thread since the scheduling and execution of threads is outside the developer’s control. To address this issue, the projects running DST build their own concurrency model (i.e. scheduling) that runs on a single OS thread. FoundationDB built Flow to this end, and Resonate runs their core code in a single goroutine running a cooperative scheduler to execute concurrent tasks.

At Polar Signals, we were attracted to the benefits that DST could give us to improve our testing of FrostDB but writing our own scheduler seemed like a daunting task. We didn’t really want to give up on parallelism in production so we were looking at either rearchitecting our database to use separate schedulers for the parallel (e.g. data processing)/non-parallel components that communicate with message queues similar to what Resonate does, or figure out how to write a scheduler that offered parallelism only in production.

As you can see, when you start to actually want to adopt DST, things start to get tricky. This is why it was exciting to see Antithesis emerge from stealth. Their value proposition is to run your application as a set of containers with a test workload on a proprietary deterministic hypervisor that offers determinism at a much lower level of abstraction than a userspace scheduler in the software stack, with the result that application code can run unchanged and reap the benefits of DST. A lot of existing projects like CockroachDB used Antithesis while it was still in stealth successfully and we were excited to try it out. However, we were also pretty budget-constrained and the pricing just didn’t work for us. We know the Antithesis team is working on this and has also announced a giveaway program to offer free testing to open-source projects.

Pareto to the rescue: DST at Polar Signals in Go

We were now between a rock and a hard place if we wanted to implement DST in our database. Given that we couldn’t conjure money out of thin air, we went back to studying the approach of the custom scheduler. One thought we kept coming back to was why write our own scheduler when Go provides its own battle-tested runtime scheduler? Was there a way to work with rather than against the scheduler in order to achieve deterministic scheduling or something close to it?

The answer is yes! We don’t want to claim that the approach we ended up using achieves 100% deterministic execution of any Go program, but it was good enough to achieve the deterministic execution of a toy program and find three data loss and two data duplication bugs in a deterministic integration test of our database in the first couple of weeks of existence. This approach comes with some caveats/limitations which we will cover, but we thought it would be useful to share what we ended up doing with the Go community.

Achieving single-threaded execution

The first thing that comes to any Gopher’s mind when someone mentions single-threaded execution in Go is to use GOMAXPROCS to limit the number of OS threads executing Go code. As a refresher, the Go scheduler works by multiplexing goroutines (if you’re not familiar with Go, think of them as lightweight userspace threads) onto OS threads. Using GOMAXPROCS is a step in the right direction, but does not achieve the desired outcome (see this google group discussion). As the documentation states “there is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit”. This makes sense. If a goroutine and therefore its running thread is blocked on a read syscall, for example, useful work can be scheduled on other threads until the read returns.

So we need to go a level deeper than just what the Go runtime can offer us to limit executions to a single OS thread. At this point a bit of pattern-matching occurred and we recalled that it was possible to compile Go programs to WASM and run them on a WASM runtime that implements the WebAssembly syscall interface (wasip1). Crucial to our goals, WebAssembly programs are run on a single thread by design (although there is currently a proposal to offer multi-threaded execution).

Running Go programs on a WASM runtime forces execution onto a single OS thread and disables non-cooperative preemption (although this could likely be deterministic if it were supported). The caveat is that wasip1 as a syscall interface is relatively limited and your program might not easily compile to WASM. In our case, our core database could run on WASM just fine, but we had to switch out a dependency at compile time that couldn’t be compiled to WASM since it relied on some arch-specific features.

Randomness

Even if you have single-threaded execution there are many sources of randomness in Go. I stumbled across this interesting google group discussion on the topic that also mentions this blog post that does a nice job summarizing sources of randomness one could encounter in a Go program. Some of them are even purposeful, like map iteration, so that developers do not rely on any specific map iteration order. One comment in the Go runtime I found interesting was that even scheduling order of goroutines is randomized in race builds to shake out latent scheduling assumptions in the code.

Thankfully, all random choices in the runtime use a global random number generator seeded at startup, including the rand package exposed to the Go application. This seed is read using the OS-specific readRandom function. There is no way to provide this seed at startup, so we ended up modifying the Go runtime to read the seed via an environment variable. This allows us to use a predefined seed before running our simulation tests and provide the same seed if the test fails to reproduce the same execution.

The downside here is that this approach requires using a modified version of the Go runtime. This sounds terrible at first and a maintainability nightmare, but in practice, the change is less than 10 lines of code (could be less) in a very stable part of the codebase.

Time to party?

Not quite. Another source of execution randomness is time. time.Now will return different values between executions, and even a sleep is best effort and doesn’t cause the caller to sleep exactly the amount of time given as an argument.

We took a lazy approach here thanks to the Go runtime. The Go playground was built as a web service to run arbitrary Go programs in a sandbox for educational purposes. Although not fully deterministic, the playground uses the concept of fake time to make it easier to cache programs by giving them deterministic output and avoid limiting programs running in the playground that use time. This blog post, although a little outdated, delves a little deeper into the subject. The key idea is that time starts at a fixed timestamp and only advances when all goroutines are blocked. This advances time virtually only when necessary and returns stable timestamps across runs.

Faketime can be enabled by setting a Go build tag (-tags=faketime). This is great since no runtime modifications are required, but the downside is that the manipulation of time is very coarse-grained, as well as some edge cases where a valid Go program can enter into an infinite loop. For our purposes, we chose to modify the runtime the least we could and see how far we’d get (again, thinking about Pareto’s 80-20 rule). Long term, we’d probably like finer-grained control over how time advances, although the issue of maintainability rears its head again in this case.

Validating these changes

While playing around with these key changes, we wrote a simple Go program that spawned a bunch of workers that performed random things and recorded the execution order to validate that the changes described above would achieve deterministic executions across runs with the same initial seed. Eventually, we got to a place where this was true, while a Go program running on a vanilla runtime and no extra build arguments would produce a random execution every time (minus the occasional collisions). Feel free to take a look at the toy repo and run it with the custom runtime implementation (see entropy.sh).

To run DST using this limited but effective approach, the GOROOT environment variable should be set to wherever the custom Go repository lives. After installing the wasmtime runtime (any runtime can be used), tests can be invoked using the go test command as follows:

GORANDSEED=<random_seed> GOOS=wasip1 GOARCH=wasm $GOROOT/bin/go test -tags=faketime --exec="$GOROOT/misc/wasm/go_wasip1_wasm_exec -S inherit-env=y" ./…

We also tested a FrostDB integration test with these changes and as mentioned above, uncovered three data loss and two data duplication bugs:

#864 fixed a data loss bug and duplicate data bug.
#866 fixed a data loss bug.
#871 fixed a duplicate data bug.
#874 fixed a data loss bug and also found a regression in a recent change that cause a crash.

Although we focused mainly on determinism, the little fault simulation we did inject was randomizing goroutine scheduling as done for race builds (without actually having to run a slower race build). This helped uncover the mentioned bugs due to logical races, even though scheduling randomness is limited to local run queues and runnable goroutines. Eventually, we would like to introduce more failure simulation by mocking out more critical components of our stack (e.g. filesystem, maybe even at the wasip1 syscall layer).

Limitations and Future Work

A lot of the limitations of this approach were alluded to in the previous sections. Go programs need to be able to be compiled to WASM, a custom runtime needs to be used, and even then, simulation of goroutine scheduling order is limited. If this approach were to be evaluated in terms of how many possible theoretical execution paths can be explored given any random seed compared to Antithesis’ solution for example, this approach would likely fall pretty short.

The approach itself is also likely incomplete. The title of this blog post claims “(mostly) deterministic” simulation testing because we mostly achieve deterministic executions of our test with the same seed, but occasionally this is not true for reasons we have not yet investigated.

One option that we briefly looked at (although too late to consider) that would remove the requirement to compile to WASM (and possibly modify the Go runtime given it captures reads to /dev/urandom) was to use https://github.com/facebookexperimental/hermit. However, this was discarded partly because we had already validated the WASM approach when we came across the Hermit option, and partly because it is no longer under active development. It is definitely on our radar though and could be a good solution to solve some of the limitations outlined in this blog post.

We’re also interested in taking a look at how far we can customize WASM runtimes. The wasip1 syscall interface seems to be a great place to add fault simulation and possibly remove some customization we added to the Go runtime (e.g. by intercepting calls to get_random). We chose to use wasmtime as the WASM runtime to start with simply because it was the one mentioned in the blog post announcing wasip1 support in Go. Wazero looks like another interesting runtime given that it could remove the need to install the runtime separately and just be a Go dependency. It also seems to offer a certain level of customizability that could be interesting for our use case.

Conclusion

Despite its limitations, our DST approach in Go provided significant value quickly, even with failure simulations restricted to limited goroutine scheduling randomization. Over time, we aim to incorporate more randomized failure simulations to enhance our testing coverage and confirm that integrating DST into our testing toolkit improves software reliability, boosts developer confidence when modifying the codebase, and reduces debugging time.

We hope the ideas presented in this blog post have been helpful to other Go developers considering DST as a testing method. Our experience shows that you don’t always need a perfect solution; sometimes, an approach that gets you 80% of the way there is sufficient. While we eventually plan to use a tool like Antithesis, starting with our own implementation targeting a higher layer of the stack proved beneficial. It not only enhanced our software's reliability but also deepened our understanding of our DST needs.

Discuss: