omneity 5 days ago

I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.

My take aways:

- This is the first model from OpenAI that feels relatively agentic to me (o3-mini sucks at tool use, 4o just sucks). It seems to be able to piece together several tools to reach the desired goal and follows a roughly coherent plan.

- There is still more work to do here. Despite OpenAI's cookbook[0] and some prompt engineering on my side, GPT-4.1 stops quickly to ask questions, getting into a quite useless "convo mode". Its tool calls fails way too often as well in my opinion.

- It's also able to handle significantly less complexity than Claude, resulting in some comical failures. Where Claude would create server endpoints, frontend components and routes and connect the two, GPT-4.1 creates simplistic UI that calls a mock API despite explicit instructions. When prompted to fix it, it went haywire and couldn't handle the multiple scopes involved in that test app.

- With that said, within all these parameters, it's much less unnerving than Claude and it sticks to the request, as long as the request is not too complex.

My conclusion: I like it, and totally see where it shines, narrow targeted work, adding to Claude 3.7 - for creative work, and Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a smaller model compared to these last two, but maybe I just need to use it for longer.

0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide

2
ttul 5 days ago

I feel the same way about these models as you conclude. Gemini 2.5 is where I paste whole projects for major refactoring efforts or building big new bits of functionality. Claude 3.7 is great for most day to day edits. And 4.1 okay for small things.

I hope they release a distillation of 4.5 that uses the same training approach; that might be a pretty decent model.

sreeptkid 4 days ago

I completely agree. On initial takeaway I find 3.7 sonnet to still be the superior coding model. I'm suspicious now of how they decide these benchmarks...