Should benchmarkings be done at all? What is the point?

2019-12-16

An answer to this question on the Scientific Computing Stack Exchange.

Question

I am reading a paper which compares algorithm A versus algorithm B.

It shows that algorithm A is faster than algorithm B via benchmarking that shows the CPU time.

What is the point of this? Any benchmarking is going to be dependent on the particular implementation. The benchmarking says nothing about the intrinsic speed of algorithm A versus algorithm B -- all it tells us that the implementation of algorithm A happened to be faster than the implementation of algorithm B.

So what is the point of that? I'd even go as far as saying such information is downright misleading, since it is making an assertive claim (A is faster than B) when in fact such a claim is wholly dependent on the coding and optimizing-abilities of the author of the code.

This mistake is something I am particularly seeing in statistics, where a lot of authors implement their code in R, and R-code can be notoriously faster or slower dependent on the smallest of changes (e.g. vectorization or using compiled base R functions), hence it is very very easy to write an implementation of an algorithm that happens to be suboptimal compared to another algorithm, even if that algorithm would be faster if it were implemented in a better way.

Answer

Yes, benchmarking should be done. I make this claim as an Editor, Author, and Reviewer. Below, I represent these roles' stances slightly hyperbolically.

But let me strongman your argument first. In addition to the following claims

The benchmarking says nothing about the intrinsic speed of algorithm A versus algorithm B -- all it tells us that the implementation of algorithm A happened to be faster than the implementation of algorithm B.

[...] it is making an assertive claim (A is faster than B) when in fact such a claim is wholly dependent on the coding and optimizing-abilities of the author of the code.

I think you might agree with or even implicitly make an additional claim:

If Algorithm A has clear theoretical advantages of Algorithm B, then it is not necessary to compare the two in practice. For instance, if Algorithm A runs in O(N) time and Algorithm B runs in O(N^2) time then it is pointless to compare them.

Would you rather:

Live in a world in which authors implement their algorithms and make their code available for use by normal humans?
Or do you prefer to live in a world in which authors make "brilliant" advances but never bother to code them up, leaving that as an "exercise for the reader"?

I prefer the first world. When I have a problem, the best thing for me is to look in the literature, find a lucid write-up with a link to a Github repo, and then to go and download a well-commented source code to use as a library in my work. It may take a few rounds of theory to get to this point, but I think it's an objectively better world state.

How do I bring about this world as an Editor? By requiring authors to submit code along with their papers.

How do I bring about this world as an Author? By making my code available and raking in the citations.

As a Reviewer: by recommending revision and rejection for papers without implementations.

Another question. Would you rather:

Live in a world in which it's clear which code you should use.
A world in which every author claims their code is the best code and it's left as an "exercise for the reader" to verify these claims.

I prefer the first world.

As an Editor, every paper I get claims to have solved X better than any other paper ever. It is my job to protect your time by serving as the first line of defense against a wave of BS (conversely: to protect your time by helping good work get seen). Sometimes this is easy. The authors reduce an $O(N^2)$ solution to $O(N)$. Other times everything is $O(N)$.

In this latter case, I have a few choices: accept every well-written paper that claims that for theoretical reasons their work is better, reject all new papers because the status quo is clearly good enough, or ask authors to benchmark.

This last option puts the burden of proof on authors to support their claims and, importantly, it encourages them to write decently performant implementations because they know that I will be asking future authors to use their work as a baseline.

Thus, by consistently requesting benchmark comparisons, I encourage authors to (a) implement their code, (b) contextualize their implementation versus the status quo, and (c) constrain the claims they make about their work by reality.

As an Editor I could require authors to rewrite other folks' code to make a "fair" comparison. But I can't trust them to be fair: they are not motivated to do this work and very well might miss something. Thus, I ask only for improvement in new work. If older works didn't get their implementation details right that's just too bad for them.

As an Editor, I am a volunteer. I am overworked. I get dozens of papers a week. Wouldn't it be nice if there was a simple way to determine if they meet the bar for further publication?

As a reviewer, I prefer the first world. I am a volunteer. I am over-worked. I want to do a nominally good job, but I don't have the time or incentive to do a great job. Wait, there's a benchmark? Problem solved.

An an author, I prefer the first world as well. Every paper in my background section claims it's the best paper ever. To differentiate my work from that sea of noise, I use benchmarks to demonstrate in a clear and compelling way that my claims are true.

tl;dr Editors and Reviewers and Readers are all incentivized to want benchmarks. Authors should be as well, but they have a conflict of interest since they want to maximize their number of publications by minimizing the work put into each. Editors and Reviewers provide the excuse Authors needs to do things right.

You're right that R makes it easy to write slow, crappy code. But this is the very reason benchmarking is important: it's the only mechanism we have for ratcheting performance forward.