Item 43690517

jjani • 5 days ago

That sounds incredibly disappointing given how high their benchmarks are, indicating they might be overtuned for those, similar to Llama4.

XCSme • 4 days ago

Yeah, I think so too. They seemed to be better at specific tasks, but worse overall, at broader tasks.