Reliably Benchmarking Small Changes

Posted by rbanffy 6 days ago

Reliably Benchmarking Small Changes – Ankush Menat

ankush.dev

aktau • 3 hours ago

The author disables SMT (hyperthreading) like this:

    disabled_cpus=(1 3 5 7 9 11 13 15)
    for cpu_no in $disabled_cpus ; do
      echo 0 | sudo tee /sys/devices/system/cpu/cpu$cpu_no/online
    done

But there is an easier way on Linux that doesn't require parsing /sys/devices/system/cpu/cpu*/topology/thread_siblings_list:

    sudo tee /sys/devices/system/cpu/smt/control <<< off

muziq • 5 days ago

Windows has been driving me crackers the last few days trying to benchmark hard to measure optimisations, and tend to end up on long runs, looking for the minimum time.. Usually closing chrome results in an immediate ~10% performance boost even minimised.. I’d love to see an option for a Developer to lock off some cores totally, nothing runs on them unless its in an approved list.. At least then I can profile on those cores and get a reasonable result..

1 reply

jamwaffles • 6 hours ago

You can! I needed to run some realtime networking stuff on an isolated core and followed this [1]

I used Windows 11 and the two cores I isolated show no CPU usage in task manager until you run something that's pinned to those cores.

[1]: https://learn.microsoft.com/en-us/windows/iot/iot-enterprise...

krona • 6 hours ago

These workarounds might be good enough to detect ~1-5% changes from a baseline of a native, pre-compiled application. However this won't be sufficient for many dynamic, JIT-compiled languages which will usually still have large amounts of inter-run variance due to timing-sensitive compilation choices of the runtime. A statically significant ~10% change can be hard to detect in these circumstances from a single run.

In my experience multi-run benchmarking frameworks which use non-parametric statistics should be the default tool of choice unless you know the particular benchmark is exceptionally well behaved.

2 replies

Sesse__ • 6 hours ago

> In my experience multi-run benchmarking frameworks which use non-parametric statistics should be the default tool of choice unless you know the particular benchmark is exceptionally well behaved.

Agreed. Do you have any suggestions? :-)

1 reply

janwas • 2 hours ago

I like taking the trimmed mean of 10-20 runs, or if a run is quick, the (half-sample) mode of more runs. See robust_statistics.h.

menaerus • 4 hours ago

> These workarounds might be good enough to detect ~1-5% changes from a baseline of a native, pre-compiled application.

and

> However this won't be sufficient for many dynamic, JIT-compiled languages which will usually still have large amounts of inter-run variance due to timing-sensitive compilation choices of the runtime.

are not mutually exclusive. Any sufficiently complex statically compiled application will suffer from the same variance issues.

> A statically significant ~10% change can be hard to detect in these circumstances from a single run.

Multiple runs do not solve the problem. For example, if you have your 1st test-run reporting 11%, 2nd test-run 8%, 3rd test-run 18%, 4th test-run 9% and 5th test-run 10% how do you get to decide if 18% from the 3rd test-run is noise or signal?

1 reply

krona • 1 hour ago

> Multiple runs do not solve the problem. For example, if you have your 1st test-run reporting 11%, 2nd test-run 8%, 3rd test-run 18%, 4th test-run 9% and 5th test-run 10% how do you get to decide if 18% from the 3rd test-run is noise or signal?

In your 5 sample example, you can't determine if there are any outliers. You need more samples, each containing a multitude observations. Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.

Outliers are only important if you have to throw away data; good measures of central tendency should be robust to them unless your data is largely noise.

> Any sufficiently complex statically compiled application will suffer from the same variance issues.

Sure, its a rule of thumb.

1 reply

menaerus • 9 minutes ago

> In your 5 sample example, you can't determine if there are any outliers. You need more samples

I think the same issue is present no matter how many samples we collect. Statistical apparatus of choice may indeed tell us that given sample is an outlier in our experiment setup but what I am wondering is what if the sample was an actual signal that we measured and not noise.

Concrete example: in 10/100 test-runs you see a regression of 10%. The rest of the test-runs show 3%. You can 10x or 100x that example if you wish. Are those 10% regressions the outliers because "the environment was noisy" or did our code really run slower for whatever conditions/reasons in those experiments?

> Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.

In theory yes, and for sufficiently large N (samples). Sometimes you cannot afford to reach this "sufficiently large N" condition.