But what kind of magic sauce are experts really based on in your opinion? Something which hasn't been written down in the thousands of books on any technical subject?
In my opinion it is ridiculous to still say that there is anything fundamentally different between human intelligence and scaling LLMs another 10x or 100x.
Valid question, and yes I don't think there's any difference in performance.
However I'm not talking about technical tasks with objectively measurable criteria of success (which is a super narrow subset, not even coding is entirely like this). I'm saying that you have to transfer some kind of human preference to the model, as unsupervised learning will never be able to infer an accurate reference point for what you subjectively want from the pretraining data on its own, no matter the scale. Even if I'm wrong on that somehow, we're currently at 1x scale, and model finetuning right now is a pretty hands-on process. It's clear that ML people that usually curate this process have a really vague idea of what looks/reads/feels good. Which is why they produce slop.
TFA is talking about that:
>AI doesn’t understand why something matters, can’t independently prioritize what’s most important, and doesn’t bring the accountability or personal investment that gives work its depth and resonance.
Of course it doesn't, because it's not trained to understand it. Claude was finetuned for "human likeness" up to the version 3, and Opus had really deep understanding of why something matters, it had better agency than any current model, and a great reference point for your priorities. That's what happens when you give the curation to a non-ML adjacent person who knows what she's doing (AFAIK she left Anthropic since then and Anthropic seemingly dropped that "character training" policy).
Check 4o's image generation as well - it has terrible yellow tint by default, thick constant-width linework in "hand-drawn" pictures etc. You can somewhat steer it with a prompt and references, but it's pretty clear that the people that have been finetuning it didn't have a good idea whether their result is any good, so they made something instantly recognizable as slop. This is not just a botched training run or a dataset preparation bug, it's a recurring pattern for OpenAI, they simply do not care about this. The recurring pattern for Midjorney, for example, is to finetune their models on kitsch.
This all could be fixed in no time, making these models way more usable as products, right now, not someday when they maybe reach the 100x scale (which is neither likely to happen nor likely to change anything).
Thanks for your reply. Well reasoned.
I am with you that the current dichotomy of training vs. inference seems unsustainable in the long run. We need ways for LLMs to learn from the interactions they are having, we might need introspection and self-modification.
I am not sure we need more diversity. Part of your argument sounds to me like we do. Slop (to me) is primarily the result of over-generalizing to everyone's taste. We get generic replies and generic images rather than consistently unique outcomes which we could call a personality.
>AI doesn’t understand why something matters.
I beg to differ. LLMs have seen all the reasons why something could matter. This is how they do everything. This is also how the brain works: You excite neurons with two concepts at a similar time and they become linked. For causality/correlation/memory...
I also agree with you that too much reliance on RLHF has not been the best idea. We are overfitting what people want rather than what people should want if they knew. LLMs are too eager to please and haven't yet learned how much teenage rebellion is needed for progress.