Item 43464967

refulgentis • 6 days ago

They're real squished for space, more than I expected :/ good illustration here, Qwen2.5-1.5B trained to reason, i.e. the name it is released under is "DeepSeek R1 1.5B". https://imgur.com/a/F3w5ymp 1st prompt was "What is 1048576^0.05", it answered, then I said "Hi", then...well...

Fwiw, Claude Sonnet 3.5 100% had some sort of agentic loop x precise file editing trained into it. Wasn't obvious to me until I added a MCP file server to my client, and still isn't well-understood outside a few.

I'm not sure on-device models will be able to handle it any time soon because it relies on just letting it read the whole effing file.

Seperately...

I say I don't understand why no other model is close, but it makes sense. OpenAI has been focused on reasoning, Mistral, I assume is GPU-starved, and Google...well, I used to work there, so I have to stop myself from going on and on. Let's just say I assume that there wouldn't be enough Consensus Built™ to do something "scary" and "experimental" like train that stuff in.

This also isn't going so hot for Sonnet IMHO.

There's vague displeasure and assumptions it "changed" the last week, but, AFAICT the real problem is that the reasoning stuff isn't as "trained in" as, say, OpenAI's.

This'd be a good thing except you see all kinds of whacky behavior.

One of my simple "read file and edit" queries yesterday did about 60 pages worth of thinking, and the thinking contained 130+ separate tool calls that weren't actually called, so it was just wandering around in the wilderness, reacting to hallucinated responses it never actually got.

Which plays into another one of my hobbyhorses, chat is a "hack" on top of an LLM. Great. So is reasoning, especially in the way Anthropic implemented it. At what point are the abstractions too much, so much that it's unreliable? 3.7 Sonnet may be answering that, because when it fails, all that thinking looks like the agentic loop cooked into Sonnet 3.5. So maybe it's altogether too much to have chat, reasoning, and fully reliable agentic loops...

AlexCoventry • 6 days ago

I asked o1-pro what 99490126816810951552*23977364624054235203 is, yesterday. It took 16 minutes to get an answer which is off by eight orders of magnitude.

https://chatgpt.com/share/67e1eba1-c658-800e-9161-a0b8b7b683...

2 replies

CamperBob2 • 5 days ago

What in the world is that supposed to prove? Let's see you do that in your head.

Tell it to use code if you want an exact answer. It should do that automatically, of course, and obviously it eventually will, but jeez, that's not a bad Fermi guess for something that wasn't designed to attempt such problems.

refulgentis • 6 days ago

Sorry, I'm in a rush, could only afford a couple minutes looking at it, but I'm missing something:

Google: 2.385511e+39 Your chat: "Numerically, that’s about 2.3855 × 10^39"

Also curious how you think about LLM-as-calculator in relation to tool calls.

1 reply

AlexCoventry • 6 days ago

If you look at the precise answer, it's got 8 too many digits, despite it getting the right number of digits in the estimate you looked at.

> Also curious how you think about LLM-as-calculator in relation to tool calls.

I just tried this because I heard all existing models are bad at this kind of problem, and wanted to try it with the most powerful one I have access to. I think it shows that you really want an AI to be able to use computational tools in appropriate circumstances.