jjani 5 days ago

That sounds incredibly disappointing given how high their benchmarks are, indicating they might be overtuned for those, similar to Llama4.

1
XCSme 4 days ago

Yeah, I think so too. They seemed to be better at specific tasks, but worse overall, at broader tasks.