I felt the same way about a year ago, I’ve since changed my mind based on personal experience and new research.
Please elaborate.
I work in the LLM search space and echo OC’s sentiment.
The more I work with LLMs the more the magic falls away and I see that they are just very good at guessing text.
It’s very apparent when I want to get them to do a very specific thing. They get inconsistent about it.
Pretty much the same, I work on some fairly specific document retrieval and labeling problems. After some initial excitement I’ve landed on using LLM to help train smaller, more focused, models for specific tasks.
Translation is a task I’ve had good results with, particularly mistral models. Which makes sense as it’s basically just “repeat this series of tokens with modifications”.
The closed models are practically useless from an empirical standpoint as you have no idea if the model you use Monday is the same as Tuesday. “Open” models at least negate this issue.
Likewise, I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions, they are solutions up voted by novices.
I find LLM code is not only bad, but when I point this out the LLM then “apologizes” and gives better code. My worry is inexperienced people can’t even spot that and won’t get this best answer.
In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”
There have even been times where an LLM will spit out _the exact same code_ and you have to give it the answer or a hint how to do it better
Yeah. I had the same experience doing code reviews at work. Sometimes people just get stuck on a problem and can't think of alternative approaches until you give them a good hint.
> I’ve found LLM code to be of poor quality
Yes. That was my experience with most human-produced code I ran into professionally, too.
> In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”
Yes, that sometimes works with humans as well. Although you usually need to provide more specific feedback to nudge them in the right track. It gets tiring after a while, doesn't it?
What is the point of your argument?
I keep seeing people say “yeah well I’ve seen humans that can’t do that either.”
What’s the point you’re trying to make?
The point is that the person I responded to criticized LLMs for making the exact sort of mistakes that professional programmers make all the time:
> I’ve found LLM code to be of poor quality. I think that has to do with being a very experienced and skilled programmer. What the LLM produce is at best the top answer in stack overflow-level skill. The top answers on stack overflow are typically not optimal solutions
Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow" that the commenter was complaining about, with the additional twist that most developers' breadth of knowledge is going to be limited to a very narrow range of APIs/platforms/etc. whereas these LLMs are able to be comparable to decent programmers in just about any API/language/platform, all at once.
I've written code for thirty years and I wish I had the breadth and depth of knowledge of the free version of ChatGPT, even if I can outsmart it in narrow domains. It is already very decent and I haven't even tried more advanced models like o1-preview.
Is it perfect? No. But it is arguably better than most programmers in at least some aspects. Not every programmer out there is Fabrice Bellard.
But LLMs aren’t people. And people do more than just generate code.
The comparison is weird and dehumanizing.
I, personally, have never worked with someone who consistently puts out code that is as bad as LLM generated code either.
> Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow"
How could you possibly know that?
All these types of arguments come from a belief that your fellow human is effectively useless.
It’s sad and weird.
>> > Most professional developers are unable to produce code up to the standard of "the top answer in stack overflow"
> How could you possibly know that?
I worked at four multinationals and saw a bunch of their code. Most of it wasn't "the top answer in stack overflow". Was some of the code written by some of the people better than that? Sure. And a lot of it wasn't, in my opinion.
> All these types of arguments come from a belief that your fellow human is effectively useless.
Not at all. I think the top answers in stack overflow were written by humans, after all.
> It’s sad and weird.
You are entitled to your own opinion, no doubt about it.
> In fact try this - ask an LLM to generate some code then reply with “isn’t there a simpler, more maintainable, and straightforward way to do this?”
These are called "code reviews" and we do that amongst human coders too, although they tend to be less Socratic in nature.
I think it has been clear from day one that LLMs don't display superhuman capabilities, and a human expert will always outdo one in tasks related to their particular field. But the breadth of their knowledge is unparalleled. They're the ultimate jacks-of-all-trades, and the astonishing thing is that they're even "average Joe" good at a vast number of tasks, never mind "fresh college graduate" good.
The real question has been: what happens when you scale them up? As of now it appears that they scale decidedly sublinearly, but it was not clear at all two or three years ago, and it was definitely worth a try.
I do contract work in the LLM space which involves me seeing a lot of human prompts, and its made the magic of human reasoning fall away: Humans are shocking bad at reasoning on the large.
One of the things I find extremely frustrating is that almost no research on LLM reasoning ability benchmarks them against average humans.
Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.
Aren’t prompts seeking to offload reasoning though? Is that really a fair data point for this?
When people are claiming they can't reason, then yes, benchmarking against average human should be a bare minimum. Arguably they should benchmark against below-average humans too, because the bar where we'd be willing to argue that a human can't reason is very low.
If you're testing to see whether it can replace certain types of work, then it depends on where you would normally set the bar for that type of work. You could offload a whole lot of work with something that can reliably reason at below an average human.
Another one!
What’s the point of your argument?
AI companies: “There’s a new machine that can do reasoning!!!”
Some people: “actually they’re not very good at reasoning”
Some people like you: “well neither are humans so…”
> research on LLM reasoning ability benchmarks them against average humans
Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.
> Large proportions of humans struggle to comprehend even a moderately complex sentence with any level of precision.
So what? How does that assumption make LLMs better?
The point of my argument is that the vast majority of tasks we carry out do not require good reasoning, because if they did most humans would be incapable of handling them. The point is also that a whole lot of people claim LLMs can't reason, based on setting the bar at a point where a large portion of humanity wouldn't clear it. If you actually benchmarked against average humans, a whole lot of the arguments against reasoning in LLMs would instantly look extremely unreasonable, and borderline offensive.
> Tin foil hat says that it’s because it probably wouldn’t look great and most LLM research is currently funded by ML companies.
They're currently regularly being benchmarked against expectations most humans can't meet. It'd make the models look a whole lot better.