Item 43683977

pcwelder • 5 days ago

Can someone explain to me why we should take Aider's polyglot benchmark seriously?

All the solutions are already available on the internet on which various models are trained, albeit in various ratios.

Any variance could likely be due to the mix of the data.

If you're looking to test an LLMs ability to solve a coding task without prior knowledge of the task at hand, I don't think their benchmark is super useful.

If you care about understanding relative performance between models for solving known problems and producing correct output format, it's pretty useful.

- Even for well-known problems, we see a large distribution of quality between models (5 to 75% correctness) - Additionally, we see a large distribution of model's ability to produce responses in formats they were instructed in

At the end of the day, benchmarks are pretty fuzzy, but I always welcome a formalized benchmark as a means to understand model performance over vibe checking.

meroes • 5 days ago

To join in the faux rigor?