That's not a lot of samples for such a small effect, I don't think it's statistically significant (p-value of around 10%).
is there a shorthand/heuristic to calculate pvalue given n samples and effect size?
There are no great shorthands, but here are a few rules of thumb I use:
- for N=100, worst case standard error of the mean is ~5% (it shrinks parabolically the further p gets from 50%)
- multiply by ~2 to go from standard error of the mean to 95% confidence interval
- scale sample size by sqrt(N)
So:
- N=100: +/- 10%
- N=1000: +/- 3%
- N=10000: +/- 1%
(And if comparing two independent distributions, multiply by sqrt(2). But if they’re measured on the same problems, then instead multiply by between 1 and sqrt(2) to account for them finding the same easy problems easy and hard problems hard - aka positive covariance.)
p-value of 7.9% — so very close to statistical significance.
the p-value for GPT-4.1 having a win rate of at least 49% is 4.92%, so we can say conclusively that GPT-4.1 is at least (essentially) evenly matched with Claude Sonnet 3.7, if not better.
Given that Claude Sonnet 3.7 has been generally considered to be the best (non-reasoning) model for coding, and given that GPT-4.1 is substantially cheaper ($2/million input, $8/million output vs. $3/million input, $15/million output), I think it's safe to say that this is significant news, although not a game changer
I make it 8.9% with a binomial test[0]. I rounded that to 10%, because any more precision than that was not justified.
Specifically, the results from the blog post are impossible: with 200 samples, you can't possibly have the claimed 54.9/45.1 split of binary outcomes. Either they didn't actually make 200 tests but some other number, they didn't actually get the results they reported, or they did some kind of undocumented data munging like excluding all tied results. In any case, the uncertainty about the input data is larger than the uncertainty from the rounding.
[0] In R, binom.test(110, 200, 0.5, alternative="greater")