That sounds incredibly disappointing given how high their benchmarks are, indicating they might be overtuned for those, similar to Llama4.
Yeah, I think so too. They seemed to be better at specific tasks, but worse overall, at broader tasks.