These workarounds might be good enough to detect ~1-5% changes from a baseline of a native, pre-compiled application. However this won't be sufficient for many dynamic, JIT-compiled languages which will usually still have large amounts of inter-run variance due to timing-sensitive compilation choices of the runtime. A statically significant ~10% change can be hard to detect in these circumstances from a single run.
In my experience multi-run benchmarking frameworks which use non-parametric statistics should be the default tool of choice unless you know the particular benchmark is exceptionally well behaved.
> In my experience multi-run benchmarking frameworks which use non-parametric statistics should be the default tool of choice unless you know the particular benchmark is exceptionally well behaved.
Agreed. Do you have any suggestions? :-)
I like taking the trimmed mean of 10-20 runs, or if a run is quick, the (half-sample) mode of more runs. See robust_statistics.h.
> These workarounds might be good enough to detect ~1-5% changes from a baseline of a native, pre-compiled application.
and
> However this won't be sufficient for many dynamic, JIT-compiled languages which will usually still have large amounts of inter-run variance due to timing-sensitive compilation choices of the runtime.
are not mutually exclusive. Any sufficiently complex statically compiled application will suffer from the same variance issues.
> A statically significant ~10% change can be hard to detect in these circumstances from a single run.
Multiple runs do not solve the problem. For example, if you have your 1st test-run reporting 11%, 2nd test-run 8%, 3rd test-run 18%, 4th test-run 9% and 5th test-run 10% how do you get to decide if 18% from the 3rd test-run is noise or signal?
> Multiple runs do not solve the problem. For example, if you have your 1st test-run reporting 11%, 2nd test-run 8%, 3rd test-run 18%, 4th test-run 9% and 5th test-run 10% how do you get to decide if 18% from the 3rd test-run is noise or signal?
In your 5 sample example, you can't determine if there are any outliers. You need more samples, each containing a multitude observations. Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.
Outliers are only important if you have to throw away data; good measures of central tendency should be robust to them unless your data is largely noise.
> Any sufficiently complex statically compiled application will suffer from the same variance issues.
Sure, its a rule of thumb.
> In your 5 sample example, you can't determine if there are any outliers. You need more samples
I think the same issue is present no matter how many samples we collect. Statistical apparatus of choice may indeed tell us that given sample is an outlier in our experiment setup but what I am wondering is what if the sample was an actual signal that we measured and not noise.
Concrete example: in 10/100 test-runs you see a regression of 10%. The rest of the test-runs show 3%. You can 10x or 100x that example if you wish. Are those 10% regressions the outliers because "the environment was noisy" or did our code really run slower for whatever conditions/reasons in those experiments?
> Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.
In theory yes, and for sufficiently large N (samples). Sometimes you cannot afford to reach this "sufficiently large N" condition.