stavros 7 days ago

Have you guys had luck with tool calls? I made a simple assistant with access to my calendar, and most models fail to call the tool to add calendar events. GPT-4.1 also regularly tries to gaslight me into believing that it added the event when it didn't call the tool!

Overall, I found tool use extremely hit-and-miss, to the point where I'm sure I'm doing something wrong (I'm using the OpenAI Agents SDK, FWIW).

1
simonw 7 days ago

I get the impression that the key to getting great performance out of tool calls is having a really detailed system prompt, with a bunch of examples.

Anthropic's system prompt just for their "web_search" tool is over 6,000 tokens long! https://simonwillison.net/2025/May/25/claude-4-system-prompt...

xrd 7 days ago

Is no one else bothered by that way of using tools? Tools feel like a way to get deterministic behavior from a very hallucinatory process. But unless you put a very lengthy and comprehensive non-deterministic English statement, you can't effectively use tools. As we all know, the more code, the more bugs. These long and often hidden prompts seem like the wrong way to go.

And, this is why I'm very excited about this addition to the llm tool, because it feels like it moves the tool closer to the user and reduces the likelihood of the problem I'm describing.

simonw 7 days ago

As an experienced software engineer I'm bothered about pretty much everything about how we develop things on top of LLMs! I can't even figure out how to write automated tests for them.

See also my multi-year obsession with prompt injection and LLM security, which still isn't close to being a solved problem: https://simonwillison.net/tags/prompt-injection/

Yet somehow I can't tear myself away from them. The fact that we can use computers to mostly understand human language (and vision problems as well) is irresistible to me.

131012 7 days ago

This is exactly why I follow your work, this mix of critical thinking and enthusiasm. Please keep going!

th0ma5 7 days ago

> The fact that we can use computers to mostly understand human language

I agree that'd be amazing if they do that, but they most certainly do not. I think this is the core my disagreement here that you believe this and let this guide you. They don't understand anything and are matching and synthesizing patterns. I can see how that's enthralling like watching a rube goldberg machine go through its paces, but there is no there there. The idea that there is an emergent something there is at best an unproven theory, is documented as being an illusion, and at worst has become an unfounded messianic belief.

simonw 7 days ago

That's why I said "mostly".

I know they're just statistical models, and that having conversations with them is like having a conversation with a stack of dice.

But if the simulation is good enough to be useful, the fact that they don't genuinely "understand" doesn't really matter to me.

I've had tens of thousands of "conversations" with these things now (I know because I log them all). Whether or not they understand anything they're still providing a ton of value back to me.

th0ma5 7 days ago

I guess I respect that you're stating it honestly, but this is a statement of belief or faith. I think it is something that you should disclose perhaps more often because it doesn't stem from other first principles and is I guess actually just tautological. I guess this is also getting more precise with our fundamental disagreement, I guess I just wouldn't blog about things that are beliefs as if they are the technology itself?

tessellated 7 days ago

I don't need belief or faith to get use and entertainment out of the transformers. As Simon said, good enough.

xrd 7 days ago

You put it so well! I agree wholeheartedly. Llms are language toys we get to play and it's so much fun. But I'm bothered in the same way you are and that's fine.

stavros 7 days ago

That's really interesting, thanks Simon! I was kind of expecting the LLM to be trained already, I'll use Claude's prompt and see. Thanks again.