Had a similar experience with Claude Code lately. I got a notice some credits were expiring, so I opened up Claude Code and asked it to fix all the credo errors in an elixir project (style guide enforcement).
I gave it incredibly clear steps of what to run in what process, maybe 6 steps, 4 of which were individual severity levels.
Within a few minutes it would as to commit code, create branches, run tests, start servers — always something new, none of which were in my instructions. It would also often run mix credo, get a list of warnings, deem them unimportant, then try to go do its own thing.
It was really cool, I basically worked through 1000 formatting errors in 2 hours with $40 of credits (that I would have had no use for otherwise).
But man, I can’t imagine letting this thing run a single command without checking the output
So... I know that people frame these sorts of things as if it's some kind of quantization conspiracy, but as someone who started using Claude Code the _moment_ that it came out, it felt particularly strong. Then, it feels like they... tweaked something, whether in CC or Sonnet 3.7 and it went a little downhill. It's still very impressive, but something was lost.
I've found Gemini 2.5 Pro to be extremely impressive and much more able to run in an extended fashion by itself, although I've found very high variability in how well 'agent mode' works between different editors. Cursor has been very very weak in this regard for me, with Windsurf working a little better. Claude Code is excellent, but at the moment does feel let down by the model.
I've been using Aider with Gemini 2.5 Pro and found that it's very much able to 'just go' by itself. I shipped a mode for Aider that lets it do so (sibling comment here) and I've had it do some huge things that run for an hour or more, but assuredly it does get stuck and act stupidly on other tasks as well.
My point, more than anything, is that... I'd try different editors and different (stronger) models and see - and that small tweaks to prompt and tooling are making a big difference to these tools' effectiveness right now. Also, different models seem to excel at different problems, so switching models is often a good choice.
> I've had it do some huge things that run for an hour or more,
Can you clarify this? If I am reading this right, you let the llm think/generate output for an hour? This seems bonkers to me.
Eh I am happy waiting many years before any of that. If it only work right with the right model for the right job, and it’s very fuzzy which models work for which tasks, and the models change all the time (often times silently)… at some point it’s just easier to do the easy task I’m trying to offload then juggle all off this.
If and when I go about trying these tools in the future, I’ll probably looks for and open source TUI, so keep up the great work on aider!