> In my experience multi-run benchmarking frameworks which use non-parametric statistics should be the default tool of choice unless you know the particular benchmark is exceptionally well behaved.
Agreed. Do you have any suggestions? :-)
I like taking the trimmed mean of 10-20 runs, or if a run is quick, the (half-sample) mode of more runs. See robust_statistics.h.