Item 42212238

vidarh • 6 days ago

I do contract work in the LLM space which involves me seeing a lot of human prompts, and its made the magic of human reasoning fall away: Humans are shocking bad at reasoning on the large.

One of the things I find extremely frustrating is that almost no research on LLM reasoning ability benchmarks them against average humans.

Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.

meroes • 6 days ago

Aren’t prompts seeking to offload reasoning though? Is that really a fair data point for this?

1 reply

vidarh • 5 days ago

When people are claiming they can't reason, then yes, benchmarking against average human should be a bare minimum. Arguably they should benchmark against below-average humans too, because the bar where we'd be willing to argue that a human can't reason is very low.

If you're testing to see whether it can replace certain types of work, then it depends on where you would normally set the bar for that type of work. You could offload a whole lot of work with something that can reliably reason at below an average human.

dartos • 6 days ago

Another one!

What’s the point of your argument?

AI companies: “There’s a new machine that can do reasoning!!!”

Some people: “actually they’re not very good at reasoning”

Some people like you: “well neither are humans so…”

> research on LLM reasoning ability benchmarks them against average humans

Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.

> Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.

So what? How does that assumption make LLMs better?

1 reply

vidarh • 5 days ago

The point of my argument is that the vast majority of tasks we carry out do not require good reasoning, because if they did most humans would be incapable of handling them. The point is also that a whole lot of people claim LLMs can't reason, based on setting the bar at a point where a large portion of humanity wouldn't clear it. If you actually benchmarked against average humans, a whole lot of the arguments against reasoning in LLMs would instantly look extremely unreasonable, and borderline offensive.

> Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.

They're currently regularly being benchmarked against expectations most humans can't meet. It'd make the models look a whole lot better.