Optimising for performance in Elm, Pt. 5

Pitfalls you may run into

Ilias
Ilias Van Peer

--

When optimising for performance, there are a few things to watch out for.

Micro-benchmarks and JIT compilers

When writing micro-benchmarks, and you’re getting funky results, you might not want to trust those results. Lets start with an example:

Wait, what?

If you’re on a recent version of Chrome, you’ll notice that Tuple.first outperforms a simple inline pattern match. That seems… odd. Well, it is!

Notice the commented out “optimisation buster”? Let’s toggle it and see what happens.

Urhm.

Woah! Suddenly, our pattern match outperforms Tuple.first. As for what exactly is going on here would require inspecting the intermediate code the JIT uses. However, that’s pretty complicated, so the gist of it is this: the function using Tuple.first is optimised by the JIT, possibly to the point of just inlining the first result, most likely during the initial “figuring out how many times to call this function for each sample” step.

See, the properties that make Elm so nice to work with (i.e. pure functions, type safety, etc) make it easy for JavaScript compilers to optimise. While that is certainly a very good thing, it does make it exceedingly hard to write micro-benchmarks with any degree of certainty.

One thing we’re doing in our micro-benchmarks, is calling the same function over and over again. Given the simplicity of the function, the JIT is perfectly capable of noticing that it will always return the exact same result, which means it doesn’t actually have to do any computation at all. The more advanced JIT compilers become, the harder it will become to write (micro)benchmarks that represent reality.

The benchmarking library for Elm currently doesn’t keep any reference to the result of function calls made as part of the benchmark. This makes things slightly less predictable, as the JIT may just as well decide to replace the entire function by a no-op: a side-effect free function whose result is discarded is essentially a no-op.

For the time being, there’s only one thing to do: don’t trust wonky numbers. Perhaps, at some point in the future, we may be able to do benchmarks with varying or cycling inputs, and keep track of results. The major difficulties with such an approach will likely be figuring out a workable API and trying to keep the overhead of doing these things to the bare minimum, optimally not tracking time during the “input generation” and “output storing” phases of the process.

An interesting resource that talks more about microbenchmarks and compilers can be found here: http://mrale.ph/blog/2012/12/15/microbenchmarks-fairy-tale.html

Testing the intended behaviour

In a very early version of my Dict optimisation adventure, my benchmark was very crude. Essentially, it boiled down to this:

-- Insert key 1000 with value `()` into dictionary that has 1000 entries
benchmark3 "Size 1000" Dict.insert 1000 () (dictOfSize 1000)

There are a few things that make this into a pretty bad benchmark for testing insertion behaviour:

  • Not all inserts are equal — some may require rebalancing, some may leave the tree balanced. Doing one single insertion, over and over, will not give us any information about the amortised cost, but only how this particular insert behaves.
  • It’s a no-op! The third argument, dictOfSize 1000, create a dictionary with keys 1 through 1000, all initialised with value ().

A better solution, for the time being, is to replace that with something like the following.

benchmark1 “Size 1000” Dict.fromList listOf1000KeyValuePairs

Now, this does come with a downside: Dict.fromList has some overhead, namely folding over that large list. On the other hand, folding over a list is a fairly small cost, especially compared to inserting itself. This become especially valuable when comparing different Dict implementations, as the underlying fromList implementation will most likely be exactly the same, while the insert implementation will most likely be entirely different. Since both branches of the compare would have to deal with the same overhead, this makes the result a fair comparison, at the cost of benchmarking more than just the function we’re trying to benchmark.

Takeaways

  • Be careful when constructing micro-benchmarks
  • Don’t trust wonky results
  • Test the intended behaviour, rather than one specific case

--

--