rfoo 1 day ago

I think the most interesting result [0] is, compared to our current benchmarks, on which scaling law is showing diminishing returns, what they did managed to tell apart large language models (Llama 405B, GPT-4.5) from not-so-large LMs.

This could be really interesting if it wasn't due to trivial f-up (e.g. difference in inference speed).

[0] Assuming the paper isn't flawed, haven't read it thoroughly yet.

2
nonfamous 1 day ago

According to the paper, the human and AI responses were both delayed by the same amount (depending on message length) to mask the effect of inference speed on the interrogator.

sterlind 1 day ago

It's not so surprising to me. It's like how Markov chains get better at passing for human the more N-grams they memorize. larger models will continue getting marginally better at predicting the distribution (human language.) but that doesn't translate into improved intelligence.

rfoo 1 day ago

The point is, it isn't marginally better. I agree the setup is not a demonstration of intelligence, but the difference is pretty significant. Not to mention that on conventional benchmarks Llama 405B is usually worse than GPT-4o.