> The model genuinely believes it’s giving a correct reasoning chain, but the interpretability microscope reveals it is constructing symbolic arguments backward from a conclusion.
Sounds very human. It's quite common that we make a decision based on intuition, and the reasons we give are just post-hoc justification (for ourselves and others).
> Sounds very human
well yes, of course it does, that article goes out of its way to anthropomorphize LLMs, while providing very little substance
Isn't the point of computers to have machines that improve on default human weaknesses, not just reproduce them at scale?
They've largely been complementary strengths, with less overlap. But human language is state-of-the-art, after hundreds of thousands of years of "development". It seems like reproducing SOTA (i.e. the current ongoing effort) is a good milestone for a computer algorithm as it gains language overlap with us.
Why would computers have just one “point”? They have been used for endless purposes and those uses will expand forever
The other very human thing to do is invent disciplines of thought so that we don't just constantly spew bullshit all the time. For example you could have a discipline about "pursuit of facts" which means that before you say something you mentally check yourself and make sure it's actually factually correct. This is how large portions of the populace avoid walking around spewing made up facts and bullshit. In our rush to anthropomorphize ML systems we often forget that there are a lot of disciplines that humans are painstakingly taught from birth and those disciplines often give rise to behaviors that the ML-based system is incapable of like saying "I don't know the answer to that" or "I think that might be an unanswerable question."
In a way, the main problem with LLMs isn't that they are wrong sometimes. We humans are used to that. We encounter people who are professionally wrong all the time. Politicians, con-men, scammers, even people who are just honestly wrong. We have evaluation metrics for those things. Those metrics are flawed because there are humans on the other end intelligently gaming those too, but generally speaking we're all at least trying.
LLMs don't fit those signals properly. They always sound like an intelligent person who knows what they are talking about, even when spewing absolute garbage. Even very intelligent people, even very intelligent people in the field of AI research are routinely bamboozled by the sheer swaggering confidence these models convey in their own results.
My personal opinion is that any AI researcher who was shocked by the paper lynguist mentioned ought to be ashamed of themselves and their credulity. That was all obvious to me; I couldn't have told you the exact mechanism the arithmetic was being performed (though what is was doing was well in the realm of what I would have expected from a linguistic AI trying to do math), but the fact that its chain of reasoning bore no particular resemblance to how it drew its conclusions was always obvious. A neural net has no introspection on itself. It doesn't have any idea "why" it is doing what it is doing. It can't. There's no mechanism for that to even exist. We humans are not directly introspecting our own neural nets, we're building models of our own behavior and then consulting the models, and anyone with any practice doing that should be well aware of how those models can still completely fail to predict reality!
Does that mean the chain of reasoning is "false"? How do we account for it improving performance on certain tasks then? No. It means that it is occurring at a higher level and a different level. It is quite like humans imputing reasons to their gut impulses. With training, combining gut impulses with careful reasoning is actually a very, very potent way to solve problems. The reasoning system needs training or it flies around like an unconstrained fire hose uncontrollably spraying everything around, but brought under control it is the most powerful system we know. But the models should always have been read as providing a rationalization rather than an explanation of something they couldn't possibly have been explaining. I'm also not convinced the models have that "training" either, nor is it obvious to me how to give it to them.
(You can't just prompt it into a human, it's going to be more complicated than just telling a model to "be carefully rational". Intensive and careful RHLF is a bare minimum, but finding humans who can get it right will itself be a challenge, and it's possible that what we're looking for simply doesn't exist in the bias-set of the LLM technology, which is my base case at this point.)