krona 3 hours ago

> Multiple runs do not solve the problem. For example, if you have your 1st test-run reporting 11%, 2nd test-run 8%, 3rd test-run 18%, 4th test-run 9% and 5th test-run 10% how do you get to decide if 18% from the 3rd test-run is noise or signal?

In your 5 sample example, you can't determine if there are any outliers. You need more samples, each containing a multitude observations. Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.

Outliers are only important if you have to throw away data; good measures of central tendency should be robust to them unless your data is largely noise.

> Any sufficiently complex statically compiled application will suffer from the same variance issues.

Sure, its a rule of thumb.

1
menaerus 1 hour ago

> In your 5 sample example, you can't determine if there are any outliers. You need more samples

I think the same issue is present no matter how many samples we collect. Statistical apparatus of choice may indeed tell us that given sample is an outlier in our experiment setup but what I am wondering is what if the sample was an actual signal that we measured and not noise.

Concrete example: in 10/100 test-runs you see a regression of 10%. The rest of the test-runs show 3%. You can 10x or 100x that example if you wish. Are those 10% regressions the outliers because "the environment was noisy" or did our code really run slower for whatever conditions/reasons in those experiments?

> Then using fairly standard nonparametric measures of dispersion and central tendency, a summary statistic should make sense, due to CLT.

In theory yes, and for sufficiently large N (samples). Sometimes you cannot afford to reach this "sufficiently large N" condition.