You're right. We've seen the "garbage in, garbage out" problem firsthand.
We've seen the models hit typical statistical pitfalls like overfitting and data leakage during testing. We've improved by implementing strict validation protocols and guardrails around data handling. While we've fixed the agents getting stuck in recursive debugging loops, statistical validity remains an ongoing challenge. We're actively working on better detection of these issues, but ultimately, we rely on domain expertise from users for evaluating model performance.
Yeah, I don't mean to indicate that the model is bad. It's just that statistics are notoriously complicated (both in terms of the mathematics involved but also intuitively understanding the impact of non-perfect data and potential interactions is really hard), and most people really, really suck at it. Once you move from the maths to actually modelling data, you have to rely a lot on (niche) domain knowledge and experience.
I'm a bit biased because I work in a space with actual statisticians, but I'd wager that it's almost impossible for current LLMs to distinguish between good and bad examples in their training data. After all, even very smart humans fail to do that.