Almost every developer knows the phrase “premature optimization is the root of all evil”, coined by Donald Knuth in 1974. But how are we supposed to know what is worthy of being optimized?

Our computing power improved manifold since then. But the sentiment to focus on the real issues for optimization efforts holds still true. Understanding the different kinds of latency and how to find actual relevant bottlenecks, and not only perceived ones, is the key to good benchmarking.


The Pareto Principle

It’s an axiom of business management, named after the Italian economists Vilfredo Pareto, stating:

80% of sales come from 20% of clients.

This distribution applies to many different areas. In computer science, we can apply the principle to our optimization efforts. 80% of the actual work and time are done by 20% of code.

There might be low-hanging fruits for our optimization efforts in the 80%. But we should concentrate on the harder 20% if we want to make a real impact.

To identify the bottlenecks and optimization-worthy code, we need to benchmark the right things.


Different Kinds of Latency

Computers are highly complex systems. The further away we get from its core, the CPU, the slower it gets. And there are many different parts involved before our code hits the actual silicon.

The speed degradation isn’t linear either. As developers, we should actually know the factors between the different kinds of latencies, so we understand which parts worth optimizing.

Adapted from https://github.com/colin-scott/interactive_latencies

The different numbers are not easy to comprehend, so imagine that a single L1 cache reference would take 1 second.

To access data in our main memory, it takes now about as long as we should brush our teeth.

Reading 1MB from the main memory would take 50 minutes, and reading the same amount from an SSD would take more than half a day.

A single packet round-trip between Amsterdam and San Francisco would take almost 5 years!

This is supposed to illustrate that we need to optimize a lot of CPU cycles before it really matters, compared to the other latencies. Saving some iterations over data in memory is great. But caching some data instead of getting it every time from the database might be a better optimization effort.


Compiler and Runtime Optimizations

One of the biggest nemesis of benchmarking are compilers and runtimes.

Compilers try to optimize our code to varying degrees. They change the actual source code before compiling it to machine code instructions. Runtimes and Virtual Machines are even worse. By working with an intermediate language, like Bytecode or CIL, they can optimize code just-in-time:

Removing null-checks, control flow optimizations to prefer hot paths, unrolling loops, inlining methods and final variables, generating native code, are some of the most common optimizing techniques. Every language has its own specific optimization ruleset, e.g., Java is replacing String concatenation with a StringBuilder to reduce String creation.

This means the actual performance might not be constant and prone to change, thanks to the runtime or virtual machine understanding your code better, and optimizing it further.

As a result, we can’t benchmark our code bu running it a few times in a loop and just measure the elapsed time with a stopwatch around method invocation.


Java Microbenchmark Harness

The simplest way to really benchmark your Java code is the Java Microbenchmark Harness (JMH). It helps to benchmark the actual performance by taking care of JVM warm-up and code-optimizations that might dilute the results.

Putting on the Harness

The JMH became the de-facto standard for benchmarking, and it got included in the JDK 12. Before this version, we need to add the dependencies manually:

We can run the benchmark with our favorite build system, our IDEs, or even our build systems:

Creating a Benchmark

It’s as easy as creating a unit test: Create a new file, add benchmark method annotated with @Benchmark, and add a main-wrapper to run it:

1
2
3
4
5
public class Runner {
    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}

The final output is something like this:

Benchmark                (N)  Mode  Cnt  Score    Error  Units
Benchmark.benchmark1    1000  avgt    3  0.004 ±  0.001  ms/op
Benchmark.benchmark1   10000  avgt    3  0.043 ±  0.002  ms/op
Benchmark.benchmark2    1000  avgt    3  0.004 ±  0.001  ms/op
Benchmark.benchmark3   10000  avgt    3  0.040 ±  0.004  ms/op

Benchmark Types

There are for different benchmark types available:

  • Mode.AverageTime
    Average time per operation.
  • Mode.SampleTime
    Time for each operation, including min and max.
  • Mode.SingleShotTime
    Time for a single operation.
  • Mode.Throughput
    Operations per unit of time.
  • Mode.All
    All of the above.

We can set the desired modes with the annotation @BenchmarkMode(...). The default mode is Mode.Throughput.

Dealing with the JVM

To warm-up the JVM we can add the @Warmup(iterations = <int>)annotation. Our benchmarks will be run the specified times, and the results will be discarded. After that, the JVM should be warm enough, and JMH runs the actual benchmarks and presents us with the results.

What Time is it?

We can specify the time unit our results should be printed in by adding the annotation @OutputTimeUnit(<java.util.concurrent.TimeUnit>):

  • TimeUnit.NANOSECONDS
  • TimeUnit.MICROSECONDS
  • TimeUnit.MILLISECONDS
  • TimeUnit.SECONDS
  • TimeUnit.MINUTES
  • TimeUnit.HOURS
  • TimeUnit.DAYS

State Management

Providing state allows us to simplify the benchmarking code. By creating a helper class with @Scope(...) we can specify parameters that should be benchmarked:

1
2
3
4
5
6
@State(Scope.Benchmark)
public class MyBenchmarkState {
 
    @Param({ "1", "10", "100", "1000", "10000" })
    public int value;
}

Best Practices

Useful benchmarks have to work around JVM optimizations, or we just check how good the JVM is, and not our code.

Dead Code

The JVM can detect if you actually have dead code, and remove it:

1
2
3
4
5
6
@Benchmark
public void benchmark1() {
    long lhs = 123L;
    long rhs = 321L;
    long result = lhs + rhs;
}

The variable result is never used, so its actually dead code and all three lines of the benchmark will be removed.

There are 2 options to force the JVM not to eliminate dead code:

  • Don’t use the return type void.
    If you actually return result in the method, the JVM can’t be 100% sure its dead code, so it won’t be removed.

  • Using a Blackhole.
    The class org.openjdk.jmh.infra.Blackhole can be passed as an argument, and it provides consume(...) methods, so the result won’t be dead code.

Constant Optimization

Even if we return the result or use a Blackhole to prevent dead code removal, the JVM might optimize the constant values. This reduces our code to something like this:

1
2
3
4
5
@Benchmark
public long benchmark1() {
    long result = 444L;
    return result;
}

Providing a state classs prevents the JVM from optimizing the constants away:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@State(Scope.Thread)
public static class MyState {
    public long lhs = 123L;
    public long rhs = 321L;
}

@Benchmark
public long benchmark1(MyState state) {
    long result = state.lhs + state.rhs;
    return result;
}

Smaller Units

Benchmarking is a lot like unit testing. We shouldn’t test or benchmark large units of code. The smaller the code units are, the smaller are possible side-effects. We need to minimize anything that might pollute the benchmark results.

(Almost) Production

Every time you see a benchmark done on a developer machine, like a MacBook Pro, take it with a grain of salt. Developer machines behave differently, compared to production environments, depending on a multitude of parameters (e.g., VM options, CPU, memory, operating system, system settings, …).

My Java development setup, for example, consists of multiple Docker containers on a single machine (Eclipse, MySQL, MongoDB, RabbitMQ), plus some other containers (ELK-Stack, Postgres, Killbill, MariaDB). They all share the same 32 GB of RAM, and 12 CPU threads. Production is distributed among multiple hosts, with less containers, and double the RAM and CPU threads, plus RAID 1 SSD configuration.

Benchmark results won’t be very representative if we reach the limits of our hardware. We want our benchmarks to represent our code’s actual performance, not the “almost identical” development setup.

Running the benchmarks locally is a good starting point, but it won’t necessarily reflect the real-world, especially in edge-cases.


Conclusion

Good (Micro-)Benchmarks are hard.

Almost every in the pipeline between our source code and running on the silicon works against exact measuring. But with the help of JMH we gain back a lot of control needed for reliable results.

The quintessence of optimization is that we should stop worrying about the wrong issues. Are your benchmark results really applicable to real-world situations of our code? Look at the bigger picture and concentrate on the real issues, like optimizing data-access, algorithms, and data structures.

This article only scratched the surface of benchmarking with JMH. It’s a powerful addition to our toolbelt.

Resources