I've found sonnet-3.7 to be incredibly inconsistent. It can do very well but has a strong tendency to get off-track and run off and do weird things.
3.5 is better for this, ime. I hooked claude desktop up to an MCP server to fake claude-code less the extortionate pricing and it works decently. I've been trying to apply it for rust work; it's not great yet (still doesn't really seem to "understand" rust's concepts) but can do some stuff if you make it `cargo check` after each change and stop it if it doesn't.
I expect something like o3-high is the best out there (aider leaderboards support this) either alone or in combination with 4.1, but tbh that's out of my price range. And frankly, I can't mentally get past paying a very high price for an LLM response that may or may not be useful; it leaves me incredibly resentful as a customer that your model can fail the task, requiring multiple "re-rolls", and you're passing that marginal cost to me.
I am avoiding the cost of API access by using the chat/ui instead, in my case Google Gemini 2.5 Pro with the high token window. Repomix a whole repo. Paste it in with a standard prompt saying "return full source" (it tends to not follow this instruction after a few back and forths) and then apply the result back on top of the repo (vibe coded https://github.com/radekstepan/apply-llm-changes to help me with that). Else yeah, $5 spent on Cline with Claude 3.7 and instead of fixing my tests, I end up with if/else statements in the source code to make the tests pass.
I decided to experiment with Claude Code this month. The other day it decided the best way to fix the spec was to add a conditional to the test that causes it to return true before getting to the thing that was actually supposed to be tested.
I’m finding it useful for really tedious stuff like doing complex, multi step terminal operations. For the coding… it’s not been great.
I’ve had this in different ways many times. Like instead of resolving the underlying issue for an exception, it just suggests catching the exception and keep going
It also depends a lot on the mix of model and type of code and libraries involved. Even in different days the models seem to be more or less capable (I’m assuming they get throttled internally - this is very noticeable sometimes in how they try to save on output tokens and summarize the code responses as much as possible, at least in the chat/non-api interfaces)
Cool tool. What format does it expect from the model?
I’ve been looking for something that can take “bare diffs” (unified diffs without line numbers), from the clipboard and then apply them directly on a buffer (an open file in vscode)
None of the paste diff extension for vscode work, as they expect a full unified diff/patch
I also tried a google-developed patch tool, but also wasn’t very good at taking in the bare diffs, and def couldn’t do clipboard
Markdown format with a comment saying what the file path is. So:
This is src/components/Foo.tsx
```tsx // code goes here ```
OR
```tsx // src/components/Foo.tsx // code goes here ```
These seem to work the best.
I tried diff syntax, but Gemini 2.5 just produced way too many bugs.
I also tried using regex and creating an AST of the markdown doc and going from there, but ultimately settled on calling gpt-4.1-mini-2025-04-14 with the beginning of the code block (```) and 3 lines before and 3 lines after the beginning of the code block. It's fast/cheap enough to work.
Though I still have to make edits sometimes. WIP.
I've been using Mistral Medium 3 last couple of days, and I'm honestly surprised at how good it is. Highly recommend giving it a try if you haven't, especially if you are trying to reduce costs. I've basically switched from Claude to Mistral and honestly prefer it even if costs were equal.
How are you running the model? Mistral’s api or some local version through ollama, or something else?
Is mistral on open router?
Yup https://openrouter.ai/provider/mistral
I guess it can't really be run locally https://www.reddit.com/r/LocalLLaMA/comments/1kgyfif/introdu...
I seem to be alone in this but the only methods truly good at coding are slow heavy test time compute models.
o1-pro and o1-preview are the only models I've ever used that can reliably update and work with 1000 LOC without error.
I don't let o3 write any code unless it's very small. Any "cheap" model will hallucinate or fail massively when pushed.
One good tip I've done lately. Remove all comments in your code before passing or using LLMs, don't let LLM generated comments persist under any circumstance.
Interesting. I've never tested o1-pro because it's insanely expensive but preview seemed to do okay.
I wouldn't be shocked if huge, expensive-to-run models performed better and if all the "optimized" versions were actually labs trying to ram cheaper bullshit down everyone's throat. Basically chinesium for LLMs; you can afford them but it's not worth it. I remember someone saying o1 was, what, 200B dense? I might be misremembering.
I'm positive they are pushing users to cheaper models due to cost. o1-pro is now in a sub menu for pro users and labled legacy. The big inference methods must be stupidly expensive.
o1-preview was and possibly still is the most powerful model they ever released. I only switched to pro for coding after months of them improving it and my api bill getting a bit crazy (like 0.50$ per question).
I don't think paramater count matters anymore. I think the only thing that matters is how much compute a vendor will give you per question.
I never have LLMs work on 1000 LOC. I don't think that's the value-add. Instead I use it a the function and class level to accelerate my work. The thought of having any agent human or computer run amok in my code makes me uncomfortable. At the end of the day I'm still accountable for the work, and I have to read and comprehend everything. If do it piecewise I it makes tracking the work easier.