Almost every developer knows the phrase “premature optimization is the root of all evil”, coined by Donald Knuth in 1974. But how are we supposed to know what is worthy of being optimized?
Our computing power improved manifold since then. But the sentiment to focus on the real issues for optimization efforts holds still true. Understanding the different kinds of latency and how to find actual relevant bottlenecks, and not only perceived ones, is the key to good benchmarking.
The Pareto Principle
It’s an axiom of business management, named after the Italian economists Vilfredo Pareto, stating:
80% of sales come from 20% of clients.
This distribution applies to many different areas. In computer science, we can apply the principle to our optimization efforts. 80% of the actual work and time are done by 20% of code.
There might be low-hanging fruits for our optimization efforts in the 80%. But we should concentrate on the harder 20% if we want to make a real impact.
To identify the bottlenecks and optimization-worthy code, we need to benchmark the right things.
Different Kinds of Latency
Computers are highly complex systems. The further away we get from its core, the CPU, the slower it gets. And there are many different parts involved before our code hits the actual silicon.
The speed degradation isn’t linear either. As developers, we should actually know the factors between the different kinds of latencies, so we understand which parts worth optimizing.
The different numbers are not easy to comprehend, so imagine that a single L1 cache reference would take 1 second.
To access data in our main memory, it takes now about as long as we should brush our teeth.
Reading 1MB from the main memory would take 50 minutes, and reading the same amount from an SSD would take more than half a day.
A single packet round-trip between Amsterdam and San Francisco would take almost 5 years!
This is supposed to illustrate that we need to optimize a lot of CPU cycles before it really matters, compared to the other latencies. Saving some iterations over data in memory is great. But caching some data instead of getting it every time from the database might be a better optimization effort.
Compiler and Runtime Optimizations
One of the biggest nemesis of benchmarking are compilers and runtimes.
Compilers try to optimize our code to varying degrees. They change the actual source code before compiling it to machine code instructions. Runtimes and Virtual Machines are even worse. By working with an intermediate language, like Bytecode or CIL, they can optimize code just-in-time:
null-checks, control flow optimizations to prefer hot paths, unrolling loops, inlining methods and final variables, generating native code, are some of the most common optimizing techniques. Every language has its own specific optimization ruleset, e.g., Java is replacing String concatenation with a
StringBuilder to reduce
This means the actual performance might not be constant and prone to change, thanks to the runtime or virtual machine understanding your code better, and optimizing it further.
As a result, we can’t benchmark our code bu running it a few times in a loop and just measure the elapsed time with a stopwatch around method invocation.
Java Microbenchmark Harness
The simplest way to really benchmark your Java code is the Java Microbenchmark Harness (JMH). It helps to benchmark the actual performance by taking care of JVM warm-up and code-optimizations that might dilute the results.
Putting on the Harness
The JMH became the de-facto standard for benchmarking, and it got included in the JDK 12. Before this version, we need to add the dependencies manually:
We can run the benchmark with our favorite build system or IDEs:
Creating a Benchmark
It’s as easy as creating a unit test: Create a new file, add benchmark method annotated with
@Benchmark, and add a
main-wrapper to run it:
The final output is something like this:
Benchmark (N) Mode Cnt Score Error Units Benchmark.benchmark1 1000 avgt 3 0.004 ± 0.001 ms/op Benchmark.benchmark1 10000 avgt 3 0.043 ± 0.002 ms/op Benchmark.benchmark2 1000 avgt 3 0.004 ± 0.001 ms/op Benchmark.benchmark3 10000 avgt 3 0.040 ± 0.004 ms/op
There are for different benchmark types available:
Average time per operation.
Time for each operation, including min and max.
Time for a single operation.
Operations per unit of time.
All of the above.
We can set the desired modes with the annotation
The default mode is
Dealing with the JVM
To warm-up the JVM we can add the
@Warmup(iterations = <int>)annotation.
Our benchmarks will be run the specified times, and the results will be discarded.
After that, the JVM should be warm enough, and JMH runs the actual benchmarks and presents us with the results.
What Time is it?
We can specify the time unit our results should be printed in by adding the annotation
Providing state allows us to simplify the benchmarking code.
By creating a helper class with
@Scope(...) we can specify parameters that should be benchmarked:
Useful benchmarks have to work around JVM optimizations, or we just check how good the JVM is, and not our code.
The JVM can detect if you actually have dead code, and remove it:
result is never used, so its actually dead code and all three lines of the benchmark will be removed.
There are 2 options to force the JVM not to eliminate dead code:
Don’t use the return type
If you actually
return resultin the method, the JVM can’t be 100% sure its dead code, so it won’t be removed.
org.openjdk.jmh.infra.Blackholecan be passed as an argument, and it provides
consume(...)methods, so the result won’t be dead code.
Even if we return the result or use a Blackhole to prevent dead code removal, the JVM might optimize the constant values. This reduces our code to something like this:
Providing a state classs prevents the JVM from optimizing the constants away:
Benchmarking is a lot like unit testing. We shouldn’t test or benchmark large units of code. The smaller the code units are, the smaller are possible side-effects. We need to minimize anything that might pollute the benchmark results.
Every time you see a benchmark done on a developer machine, like a MacBook Pro, take it with a grain of salt. Developer machines behave differently, compared to production environments, depending on a multitude of parameters (e.g., VM options, CPU, memory, operating system, system settings, …).
My Java development setup, for example, consists of multiple Docker containers on a single machine (Eclipse, MySQL, MongoDB, RabbitMQ), plus some other containers (ELK-Stack, Postgres, Killbill, MariaDB). They all share the same 32 GB of RAM, and 12 CPU threads. Production is distributed among multiple hosts, with less containers, and double the RAM and CPU threads, plus RAID 1 SSD configuration.
Benchmark results won’t be very representative if we reach the limits of our hardware. We want our benchmarks to represent our code’s actual performance, not the “almost identical” development setup.
Running the benchmarks locally is a good starting point, but it won’t necessarily reflect the real-world, especially in edge-cases.
Good (Micro-)Benchmarks are hard.
Almost every in the pipeline between our source code and running on the silicon works against exact measuring. But with the help of JMH we gain back a lot of control needed for reliable results.
The quintessence of optimization is that we should stop worrying about the wrong issues. Are your benchmark results really applicable to real-world situations of our code? Look at the bigger picture and concentrate on the real issues, like optimizing data-access, algorithms, and data structures.
This article only scratched the surface of benchmarking with JMH. It’s a powerful addition to our toolbelt.