Why Compiler Function Inlining Matters

All about function inlining, how it helps us create performant software, how we can learn to work with it, and how it influences profiling.

December 15, 2021

Let's imagine a super simple program written in Go.
This program simply iterates to 1000 and adds the numbers onto the result.

package main
func main() {
var result uint64
for i := uint64(0); i < 1_000; i++ {
result += add(result, i)
}
}
//go:noinline
func add(a, b uint64) uint64 {
return a + b
}

Note: This is similar in other compiled languages, we just use Go as an example.

The comment `//go:noinline` tells the compiler not to inline this function. How does it perform? Let’s see with this simple benchmark.

package main
func BenchmarkAdd(b *testing.B) {
for i := 0; i < b.N; i++ {
main()
}
}

We run this benchmark function by running
`go test -bench=BenchmarkAdd -count=10 | tee BenchmarkAddNoInline.txt`

goos: linux
goarch: amd64
pkg: github.com/polarsignals/inlining
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkAdd-24 835048 1825 ns/op
BenchmarkAdd-24 859546 1606 ns/op
BenchmarkAdd-24 856646 1909 ns/op
BenchmarkAdd-24 855582 1715 ns/op
BenchmarkAdd-24 856621 1431 ns/op
BenchmarkAdd-24 845157 1545 ns/op
BenchmarkAdd-24 765014 1466 ns/op
BenchmarkAdd-24 812818 1441 ns/op
BenchmarkAdd-24 787130 1496 ns/op
BenchmarkAdd-24 867459 1456 ns/op
PASS
ok github.com/polarsignals/inlining 18.092s

On its own, this doesn't tell us much. Therefore, we want to compare this against a benchmark run that has the `add` function inline. By removing the `//go:noinline` comment the compiler should inline this function. Let's run the benchmark again:
`go test -bench=BenchmarkAdd -count=10 | tee BenchmarkAddInline.txt`

goos: linux
goarch: amd64
pkg: github.com/polarsignals/inlining
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkAdd-24 2404077 464.0 ns/op
BenchmarkAdd-24 2358136 485.2 ns/op
BenchmarkAdd-24 2420811 464.9 ns/op
BenchmarkAdd-24 2429461 477.0 ns/op
BenchmarkAdd-24 2376388 459.8 ns/op
BenchmarkAdd-24 2396380 483.8 ns/op
BenchmarkAdd-24 2533822 476.6 ns/op
BenchmarkAdd-24 2457052 460.3 ns/op
BenchmarkAdd-24 2430799 488.9 ns/op
BenchmarkAdd-24 2432954 467.3 ns/op
PASS
ok github.com/polarsignals/inlining 16.517s

Interesting!

Go has a little helper tool called benchstat that we can use to compare these results.

name old time/op new time/op delta
Add-24 1.59µs ±20% 0.47µs ± 3% -70.25% (p=0.000 n=10+10)

It seems that for this example program inlining the add function makes a huge difference. Why is that?

Why does inlining exist?

When you call a function in your program the compiler must emit a few extra instructions to actually make that function call happen. Specifically, depending on the function call ABI, the compiler will pass function arguments either on the stack or via CPU registers. Following that, the return address of the function we're about to call must be pushed onto the stack so we can continue where we left off before calling that function.
Finally, a jump (or similar) instruction must be used to begin executing the called function. When the function call returns we must reverse that process a bit by restoring the caller's stack frame and reading return values off of the stack or CPU registers. This extra overhead is relatively small, but when calling a function in a tight loop for example it can really add up. Inlining functions removes this overhead by simply "inlining" or copying the instructions the function would normally execute directly into the function that calls it.

If this sounds like a lot of overhead for this small function, then you're right.

Inlining has some other nice properties to it as well: It helps with fetching instructions to execute from memory and has better CPU cache properties since the instructions are contiguous as opposed to having to be fetched from memory and then executed.

Inlining

Since the function has no side effects Go decides to inline this function. We can check this by compiling the program with some flags: `go build -gcflags -m main.go`

# command-line-arguments
./main.go:10:6: can inline add
./main.go:3:6: can inline main
./main.go:6:16: inlining call to add

We can compare the assembly with and without inlining by adding and removing that `//go:noinline` comment.

On the compiled binary we can run the `go tool objdump main | grep main.go`

TEXT main.main(SB) /home/metalmatze/src/github.com/polarsignals/inlining/main.go
main.go:3 0x4553e0 493b6610 CMPQ 0x10(R14), SP
main.go:3 0x4553e4 7651 JBE 0x455437
main.go:3 0x4553e6 4883ec28 SUBQ $0x28, SP
main.go:3 0x4553ea 48896c2420 MOVQ BP, 0x20(SP)
main.go:3 0x4553ef 488d6c2420 LEAQ 0x20(SP), BP
main.go:3 0x4553f4 31c0 XORL AX, AX
main.go:3 0x4553f6 31c9 XORL CX, CX
main.go:5 0x4553f8 eb2b JMP 0x455425
main.go:5 0x4553fa 4889442418 MOVQ AX, 0x18(SP)
main.go:6 0x4553ff 48894c2410 MOVQ CX, 0x10(SP)
main.go:6 0x455404 4889c3 MOVQ AX, BX
main.go:6 0x455407 4889c8 MOVQ CX, AX
main.go:6 0x45540a e831000000 CALL main.add(SB)
main.go:5 0x45540f 488b4c2418 MOVQ 0x18(SP), CX
main.go:5 0x455414 48ffc1 INCQ CX
main.go:6 0x455417 488b542410 MOVQ 0x10(SP), DX
main.go:6 0x45541c 4801c2 ADDQ AX, DX
main.go:5 0x45541f 4889c8 MOVQ CX, AX
main.go:6 0x455422 4889d1 MOVQ DX, CX
main.go:5 0x455425 483de8030000 CMPQ $0x3e8, AX
main.go:5 0x45542b 72cd JB 0x4553fa
main.go:8 0x45542d 488b6c2420 MOVQ 0x20(SP), BP
main.go:8 0x455432 4883c428 ADDQ $0x28, SP
main.go:8 0x455436 c3 RET
main.go:3 0x455437 e824ceffff CALL runtime.morestack_noctxt.abi0(SB)
main.go:3 0x45543c eba2 JMP main.main(SB)
TEXT main.add(SB) /home/metalmatze/src/github.com/polarsignals/inlining/main.go
main.go:12 0x455440 4801d8 ADDQ BX, AX
main.go:12 0x455443 c3 RET

As you can see at the end there is our add function with two lines of assembly for it. We can also see the assembly call to `CALL main.add(SB)` that invokes the function. Now, if we let the Go compiler inline the add function we get the resulting assembly:

TEXT main.main(SB) /home/metalmatze/src/github.com/polarsignals/inlining/main.go main.go:3 0x4553e0 31c0 XORL AX, AX main.go:5 0x4553e2 eb03 JMP 0x4553e7 main.go:5 0x4553e4 48ffc0 INCQ AX main.go:5 0x4553e7 483de8030000 CMPQ $0x3e8, AX main.go:5 0x4553ed 72f5 JB 0x4553e4 main.go:8 0x4553ef c3 RET

As you can see now, there is no CALL to main.add(SB) anymore and instead it all happens within the main.main, which means that the overhead of calling the function add is gone.

Function inlining and profiling

These inlined functions basically disappear as their own function calls in the compiled binaries, yet, as humans, we don't necessarily know this, so it's important to be able to differentiate them in profiling data analysis.

In pprof each Function has a Location and Line that reference the function itself. Inline functions are thus at the same Locations, however, have their own Line (think about it, these functions are still on a different source code line) and then point to their own function. More on the pprof internals can be found in our previous “DIY pprof profiles using Go” blog post!

Rendering these inlined functions is done by showing them as part of the stack trace and essentially “squeezing” them in between the other functions.

Here you can see a part of a Prometheus goroutine stack trace. The `waitRead` function was inlined and is shown like any other function.

Rendering a flame graph with inlined functions

Each profile within Parca, which is a continuous profiling project for applications and infrastructure, needs to be rendered as an icicle graph, which means that we need to walk all stack traces of a profile and create a tree data structure from these individual stack traces merging at the root and inserting the individual stack traces as individual trees onto the existing tree.

It becomes quite a challenge with inlined functions to render them properly in Parca’s icicle graphs. Basically, while merging the new stack trace tree, each inlined function becomes its own subtree of stack traces again that have to be correctly merged into the existing tree too.

Finally, our implementation handles these cases correctly since we merged out Pull Request: https://github.com/parca-dev/parca/pull/485

Roadmap for inlined functions

Currently, we don’t show the inlined functions in any specific way. What do you think, reader, would you want us to handle these more specifically in the icicle graphs? Is it fine for you to simply show them as "normal" functions?

Further reading

Discuss:
Sign up for the latest Polar Signals news