The Cost of Go's Interfaces and How to Fix It

Frederic Branczyk

November 24, 2023

Golang

We'll start with the fundamentals: What are dynamic dispatch and devirtualization in Go in regards to interfaces?

In Go, when let's say a function accepts a parameter that's an interface, and a function ends up being called on that parameter, Go needs to first figure out what the concrete function is that needs to be executed, as it doesn't know the concrete type, which is the whole point of interfaces in Go. This is referred to as dynamic dispatch.

Let's have a look at a small example:

type TestInterface interface {
	Something()
}

type ConcreteType struct{}

func (t ConcreteType) Something() {}

func main() {
	t := ConcreteType{}
	AcceptsInterface(t)
}

func AcceptsInterface(i TestInterface) {
	for j := 0; j < 1_000_000; j++ {
		i.Something()
	}
}

This piece of code has a main function, that instantiates an instance of ConcreteType, which implements the TestInterface by defining and binding the Something() function to it, which itself is a no-op. It passes the instance to the function AcceptsInterface, which has the parameter i of type TestInterface. AcceptsInterface calls Something() 1 million times on the passed parameter i. Because AcceptsInterface doesn't know the concrete type, it has to first figure out what concrete implementation of Something() every time for each of those 1 million executions.

What's the impact of dynamic dispatch? Let's benchmark it!

Here's a simple benchmark.

func BenchmarkInterfaceCall(b *testing.B) {
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		main()
	}
}

Let's run it.

$ go test -bench=. -count=5 -cpuprofile=dynamic-dispatch-cpu.prof -memprofile=dynamic-dispatch-mem.prof | tee dynamic-dispatch.txt
goos: darwin
goarch: arm64
pkg: github.com/polarsignals/go-interface-devirtualization-pgo
BenchmarkInterfaceCall-10           1254            954272 ns/op              13 B/op          0 allocs/op
BenchmarkInterfaceCall-10           1258            959966 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           1240            958960 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           1243            959028 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           1260            956171 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/polarsignals/go-interface-devirtualization-pgo       6.855s

Ok, interesting result, the first run shows some 13 B/op? Let's have a look at the memory profile.

Alright looks like it was just go test framework things and profiling, the rest don't allocate, so our code doesn't do any heap allocations - phew!

Now the most dramatic way to demonstrate the cost of dynamic dispatch is by type-asserting.

---		i.Something()
+++		i.(ConcreteType).Something()

And rerun the benchmark.

$ go test -run=^$ -bench=BenchmarkInterfaceCall -count=5 -cpuprofile=type-assert-cpu.prof -memprofile=type-assert-mem.prof | tee type-assert.txt
goos: darwin
goarch: arm64
pkg: github.com/polarsignals/go-interface-devirtualization-pgo
BenchmarkInterfaceCall-10           3318            319502 ns/op               4 B/op          0 allocs/op
BenchmarkInterfaceCall-10           3813            319497 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           3820            320011 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           3750            320674 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           3780            320104 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/polarsignals/go-interface-devirtualization-pgo       6.458s

And compare.

$ benchstat dynamic-dispatch.txt type-assert.txt

name              old time/op    new time/op    delta
InterfaceCall-10     958µs ± 0%     320µs ± 0%  -66.59%  (p=0.008 n=5+5)

name              old alloc/op   new alloc/op   delta
InterfaceCall-10     0.00B          0.00B          ~     (all equal)

name              old allocs/op  new allocs/op  delta
InterfaceCall-10      0.00           0.00          ~     (all equal)

Wow! A ~66% improvement. Of course this is a synthetic example to demonstrate the cost of dynamic dispatch, but I think we've shown there is overhead.

When the compiler applies this optimization by itself, then that's called devirtualization.

Note: Don't do this at home, as it's a type assertion that if not successful, it will panic. If you do, only ever use a type switch with a fallback.

Enter Profile-Guided Optimizations (PGO)

Now, wouldn't it be nice if we didn't have to do apply this optimization ourselves? It turns out with Go 1.21 profile-guided optimizations (PGO) is now generally available. PGO can be summarized as using profiling data to inform the compiler to perform optimizations that wouldn't generally be good or known, but thanks to profiling data we know they are possible and will be good.

Let's give it a spin. All we need to do is either have a CPU profile that's called default.pgo, or pass a file via the -pgo flag. We'll undo the type-assertion and use the profiling data we took from our previous run.

$ go test -run=^$ -bench=BenchmarkInterfaceCall -count=5 -cpuprofile=pgo-devirtualization.prof -memprofile=pgo-devirtualization.prof -pgo dynamic-dispatch-cpu.prof | tee pgo-devirtualization.txt

goos: darwin
goarch: arm64
pkg: github.com/polarsignals/go-interface-devirtualization-pgo
BenchmarkInterfaceCall-10           2226            478978 ns/op               7 B/op          0 allocs/op
BenchmarkInterfaceCall-10           2520            478064 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           2482            477475 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           2505            478984 ns/op               0 B/op          0 allocs/op
BenchmarkInterfaceCall-10           2488            478541 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/polarsignals/go-interface-devirtualization-pgo       6.528s

And compare to the initial run.

$ benchstat dynamic-dispatch.txt pgo-devirtualization.txt

name              old time/op    new time/op    delta
InterfaceCall-10     957µs ± 0%     478µs ± 0%  -50.00%  (p=0.008 n=5+5)

name              old alloc/op   new alloc/op   delta
InterfaceCall-10     0.00B          0.00B          ~     (all equal)

name              old allocs/op  new allocs/op  delta
InterfaceCall-10      0.00           0.00          ~     (all equal)

Wow nice, we didn't have to modify our code and still got a 50% improvement! Why "only" 50%? Compared to the type-assertion, that would have panic'ed in the event of the concrete type not being the one we assert to, the devirtualization optimizer of course ensures that our code would still function correctly if we didn't have this concrete type.

The way this works is that the Go compiler knows that the concrete implementation is one that's being called in practice, thanks to the provided profiling data, and therefore inserts the type switch to devirtualize automatically.

What's next?

We've learned that dynamic dispatch can have significant cost in Go, but remember never prematurely optimize without measuring that the optmization is worth it. With PGO we can automate it and don't have to think about it or search for the cases where it's worth it. PGO is still very new in the Go compiler toolchain, and while it's impressive already, it's still evolving quickly, so I was happy to see that while I was writing this blog post, a new optimization was implemented, which combines function inlining with devirtualizing.

Lastly, there has always been a bit of a UX issue with PGO, and that is: How can you get representative profiling data from production? The answer is: Use a continuous profiler! And as it so happens the Parca open-source project and Polar Signals Cloud are currently the only documented solutions supporting producing profiling data suitable for Go's PGO.

You can start a free 14-day trial today and try for yourself with our zero-instrumentation eBPF-based profiler, deployment only takes seconds!

Start Your Free 14-Day Trial

Discuss: