Item 43673876

wrs • 8 days ago

I’ve been using Cursor and Code regularly for a few months now and the idea of letting three of them run free on the codebase seems insane. The reason for the chat interface is that the agent goes off the rails on a regular basis. At least 25% of the time I have to hit the stop button and go back to a checkpoint because the automatic lawnmower has started driving through the flowerbed again. And paradoxically, the more capable the model gets, the more likely it seems to get random ideas of how to fix things that aren’t broken.

barrell • 8 days ago

Had a similar experience with Claude Code lately. I got a notice some credits were expiring, so I opened up Claude Code and asked it to fix all the credo errors in an elixir project (style guide enforcement).

I gave it incredibly clear steps of what to run in what process, maybe 6 steps, 4 of which were individual severity levels.

Within a few minutes it would as to commit code, create branches, run tests, start servers — always something new, none of which were in my instructions. It would also often run mix credo, get a list of warnings, deem them unimportant, then try to go do its own thing.

It was really cool, I basically worked through 1000 formatting errors in 2 hours with $40 of credits (that I would have had no use for otherwise).

But man, I can’t imagine letting this thing run a single command without checking the output

1 reply

tekacs • 8 days ago

So... I know that people frame these sorts of things as if it's some kind of quantization conspiracy, but as someone who started using Claude Code the _moment_ that it came out, it felt particularly strong. Then, it feels like they... tweaked something, whether in CC or Sonnet 3.7 and it went a little downhill. It's still very impressive, but something was lost.

I've found Gemini 2.5 Pro to be extremely impressive and much more able to run in an extended fashion by itself, although I've found very high variability in how well 'agent mode' works between different editors. Cursor has been very very weak in this regard for me, with Windsurf working a little better. Claude Code is excellent, but at the moment does feel let down by the model.

I've been using Aider with Gemini 2.5 Pro and found that it's very much able to 'just go' by itself. I shipped a mode for Aider that lets it do so (sibling comment here) and I've had it do some huge things that run for an hour or more, but assuredly it does get stuck and act stupidly on other tasks as well.

My point, more than anything, is that... I'd try different editors and different (stronger) models and see - and that small tweaks to prompt and tooling are making a big difference to these tools' effectiveness right now. Also, different models seem to excel at different problems, so switching models is often a good choice.

2 replies

sdesol • 7 days ago

> I've had it do some huge things that run for an hour or more,

Can you clarify this? If I am reading this right, you let the llm think/generate output for an hour? This seems bonkers to me.

barrell • 7 days ago

Eh I am happy waiting many years before any of that. If it only work right with the right model for the right job, and it’s very fuzzy which models work for which tasks, and the models change all the time (often times silently)… at some point it’s just easier to do the easy task I’m trying to offload then juggle all off this.

If and when I go about trying these tools in the future, I’ll probably looks for and open source TUI, so keep up the great work on aider!

cruffle_duffle • 6 days ago

"At least 25% of the time I have to hit the stop button and go back to a checkpoint because the automatic lawnmower has started driving through the flowerbed again."

I absolutely love this analogy! And yes 25% seems right. Interestingly in like 50% of those cases all the models get into the same loop.

esperent • 7 days ago

> letting three of them run free on the codebase seems insane

That seems like an unfair characterization of the process they described here.

They only allowed the agents to create pull requests for a specific bug. Both the bug report and the decision of which, if any, PR to accept is done by a human being.

1 reply

wrs • 7 days ago

Right, but it seems like that would just generate three PRs I don’t want to review, given the likelihood the agent went into the weeds without someone supervising it.