refulgentis 6 days ago

I've been waiting since November for 1, just 1*, model other than Claude than can reliably do agentic tool call loops. As long as the Chinese open models are chasing reasoning and benchmark maxxing vs. mid-2024 US private models, I'm very comfortable with somewhat ignoring these models.

(this isn't idle prognostication hinging on my personal hobby horse. I got skin in the game, I'm virtually certain I have the only AI client that is able to reliably do tool calls with open models in an agentic setting. llama.cpp got a massive contribution to make this happen and the big boys who bother, like ollama, are still using a dated json-schema-forcing method that doesn't comport with recent local model releases that can do tool calls. IMHO we're comfortably past a point where products using these models can afford to focus on conversational chatbots, thats cute but a commodity to give away per standard 2010s SV thinking)

* OpenAI's can but are a little less...grounded?...situated? i.e. it can't handle "read this file and edit it to do $X". Same-ish for Gemini, though, sometimes I feel like the only person in the world who actually waits for the experimental models to go GA, as per letter of the law, I shouldn't deploy them until then

3
anon373839 5 days ago

A but of a tangent, but what’re your thoughts on code agents compared to the standard “blobs of JSON” approach? I haven’t tried it myself, but it does seem like it would be a better fit for existing LLMs’ capabilities.

cess11 5 days ago

You mean like https://manusai.ai/ is supposed to function?

refulgentis 5 days ago

Yes, exactly, and no trivially: Manus is Sonnet with tools

cess11 4 days ago

Right. Apparently they also claim it's more than that:

https://xcancel.com/peakji/status/1898997311646437487

refulgentis 3 days ago

No, they don't, that's just a bunch of other stuff (ex. Something something we don't differ from academic papers on agents (???))

throwawaymaths 6 days ago

is there some reason you cant train a 1b model to just do agentic stuff?

anon373839 6 days ago

The Berkeley Function Calling Leaderboard [1] might be of interest to you. As of now, it looks like Hammer2.1-3b is the strongest model under 7 billion parameters. Its overall score is ~82% of GPT-4o's. There is also Hammer2.1-1.5b at 1.5 billion parameters that is ~76% of GPT-4o.

[1] https://gorilla.cs.berkeley.edu/leaderboard.html

refulgentis 6 days ago

Worth noting:

- That'll be 1 turn scores: at multiturn, 4o is 3x as good as the 3b

- BFCL is generally turn natural language into an API call, then multiturn will involve making another API call.

- I hope to inspire work towards an open model that can eat the paid models sooner rather than later

- trained quite specifically on an agent loop with tools read_files and edit_file (you'll also probably do at least read_directory and get_shared_directories, search_filenames and search_files_text are good too), bonus points for cli_command

- IMHO, this is much lower hanging-fruit than ex. training an open computer-vision model, so I beseech thee, intrepid ML-understander, to fill this gap and hear your name resound throughout the age

refulgentis 6 days ago

They're real squished for space, more than I expected :/ good illustration here, Qwen2.5-1.5B trained to reason, i.e. the name it is released under is "DeepSeek R1 1.5B". https://imgur.com/a/F3w5ymp 1st prompt was "What is 1048576^0.05", it answered, then I said "Hi", then...well...

Fwiw, Claude Sonnet 3.5 100% had some sort of agentic loop x precise file editing trained into it. Wasn't obvious to me until I added a MCP file server to my client, and still isn't well-understood outside a few.

I'm not sure on-device models will be able to handle it any time soon because it relies on just letting it read the whole effing file.

Seperately...

I say I don't understand why no other model is close, but it makes sense. OpenAI has been focused on reasoning, Mistral, I assume is GPU-starved, and Google...well, I used to work there, so I have to stop myself from going on and on. Let's just say I assume that there wouldn't be enough Consensus Built™ to do something "scary" and "experimental" like train that stuff in.

This also isn't going so hot for Sonnet IMHO.

There's vague displeasure and assumptions it "changed" the last week, but, AFAICT the real problem is that the reasoning stuff isn't as "trained in" as, say, OpenAI's.

This'd be a good thing except you see all kinds of whacky behavior.

One of my simple "read file and edit" queries yesterday did about 60 pages worth of thinking, and the thinking contained 130+ separate tool calls that weren't actually called, so it was just wandering around in the wilderness, reacting to hallucinated responses it never actually got.

Which plays into another one of my hobbyhorses, chat is a "hack" on top of an LLM. Great. So is reasoning, especially in the way Anthropic implemented it. At what point are the abstractions too much, so much that it's unreliable? 3.7 Sonnet may be answering that, because when it fails, all that thinking looks like the agentic loop cooked into Sonnet 3.5. So maybe it's altogether too much to have chat, reasoning, and fully reliable agentic loops...

AlexCoventry 6 days ago

I asked o1-pro what 99490126816810951552*23977364624054235203 is, yesterday. It took 16 minutes to get an answer which is off by eight orders of magnitude.

https://chatgpt.com/share/67e1eba1-c658-800e-9161-a0b8b7b683...

CamperBob2 5 days ago

What in the world is that supposed to prove? Let's see you do that in your head.

Tell it to use code if you want an exact answer. It should do that automatically, of course, and obviously it eventually will, but jeez, that's not a bad Fermi guess for something that wasn't designed to attempt such problems.

refulgentis 6 days ago

Sorry, I'm in a rush, could only afford a couple minutes looking at it, but I'm missing something:

Google: 2.385511e+39 Your chat: "Numerically, that’s about 2.3855 × 10^39"

Also curious how you think about LLM-as-calculator in relation to tool calls.

AlexCoventry 6 days ago

If you look at the precise answer, it's got 8 too many digits, despite it getting the right number of digits in the estimate you looked at.

> Also curious how you think about LLM-as-calculator in relation to tool calls.

I just tried this because I heard all existing models are bad at this kind of problem, and wanted to try it with the most powerful one I have access to. I think it shows that you really want an AI to be able to use computational tools in appropriate circumstances.