Can't think of anything an LLM is good enough at to let them do on their own in a loop for more than a few iterations before I need to reign it back in.
That's why in practice you need more than this simple loop!
Pretty much WIP, but I am experimenting with simple sequence-based workflows that are designed to frequently reset the conversation [2]
This goes well with Microsoft paper "LLMs Get Lost In Multi-Turn Conversation " that was published Friday [1].
- [1]: https://arxiv.org/abs/2505.06120
- [2]: https://github.com/hbbio/nanoagent/blob/main/src/workflow.ts
They've written most of the recent iterations of X11 bindings for Ruby, including a complete, working example of a systray for me.
They also added the first pass of multi-monitor support for my WM while I was using it (restarted it repeatedly while Claude Code worked, in the same X session the terminal it was working in was running).
You do need to reign them back in, sure, but they can often go multiple iterations before they're ready to make changes to your files once you've approved safe tool uses etc.
They're extremely good at burning through budgets, and get even better when unattended
Is that really true? I though there free models and $200 all you can eat models.
These tools require API calls which usually aren’t priced like the consumer plans
Well technically Aider let's you use a web chat UI by generating some context and letting you paste back and forth.
Yeah they’re cheaper. I’ve written whole apps for $0.20 in API calls.
With which agent? What kind of apps?
Without more information I'm very skeptical that you had e.g. Claude Code create a whole app (so more than a simple script) with 20 cents. Unless it was able to one-shot it, but at that point you don't need an agent anyway.
I've "written" whole apps by going to GitHub, cloning a repo, right clicking, and renaming it to "MyApp." Impressed?
The main problem with agents is that they aren't reflecting on their own performance and pausing their own execution to ask a human for help aggressively enough. Agents can run on for 20+ iterations in many cases successfully, but also will need hand holding after every iteration in some cases.
They're a lot like a human in that regard, but we haven't been building that reflection and self awareness into them so far, so it's like a junior that doesn't realize when they're over their depth and should get help.
I think they are capable of doing it, but it requires prompting.
I constantly have to instruct them: - Go step by step, don't skip ahead until we're done with a step - Don't make assumptions, if you're unsure ask questions to clarify
And they mostly do this.
But this needs to be default behavior!
I'm surprised that, unless prompted, LLMs never seem to ask follow-up questions as a smart coworker might.
Is there value in adding an overseer LLM that measures the progress between n steps and if it's too low stops and calls out to a human?
I don't think you need an overseer for this, you can just have the agent self-assess at each step whether it's making material progress or if it's caught in a loop, and if it's caught in a loop to pause and emit a prompt for help from a human. This would probably require a bit of tuning, and the agents need to be setup with a blocking "ask for help" function, but it's totally doable.
Bruh, we're inventing robot PMs for our robot developers now? We're so fucked
Yes it works really well. We do something like that at NonBioS.ai - longer post below. The agent self reflects if it is stuck or confused and calls out the human for help.
And how does it effectively measure progress?
It can behave just like a senior role would - produce the set of steps for the junior to follow, and assess if the junior appears stuck at any particular step.
I have actually had great success with agentic coding by sitting down with a LLM to tell it what I'm trying to build and have it be socratic with me, really trying to ask as many questions as it can think of to help tease out my requirements. While it's doing this, it's updating the project readme to outline this vision and create a "planned work" section that is basically a roadmap for an agent to follow.
Once I'm happy that the readme accurately reflects what I want to build and all the architectural/technical/usage challenges have been addressed, I let the agent rip, instructing it to build one thing at a time, then typecheck, lint and test the code to ensure correctness, fixing any errors it finds (and re-running automated checks) before moving on to the next task. Given this workflow I've built complex software using agents with basically no intervention needed, with the exception of rare cases where its testing strategy is flakey in a way that makes it hard to get the tests passing.
>I have actually had great success with agentic coding by sitting down with a LLM to tell it what I'm trying to build and have it be socratic with me, really trying to ask as many questions as it can think of to help tease out my requirements.
Just curious, could you expand on the precise tools or way you do this?
For example, do you use the same well-crafted prompt in Claude or Gemini and use their in-house document curation features, or do you use a file in VS Code with Copilot Chat and just say "assist me in writing the requirements for this project in my README, ask questions, perform a socratic discussion with me, build a roadmap"?
You said you had 'great success' and I've found AI to be somewhat underwhelming at times, and I've been wondering if it's because of my choice of models, my very simple prompt engineering, or if my inputs are just insufficient/too complex.
I use Aider with a very tuned STYLEGUIDE.md and AI rules document that basically outlines this whole process so I don't have to instruct it every time. My preferred model is Gemini 2.5 Pro, which is definitely by far the best model for this sort of thing (Claude can one shot some stuff about as well but for following an engineering process and responding to test errors, it's vastly inferior)
How do you find Aider compares to Claude code?
I like Aider's configurability, I can chain a lot of static analysis stuff together with it and have the model fix all of it, and I can have 2-4 aider windows open in a grid and run them all at once, not sure how that'd work with Claude Code. Also, aider managing everything with git commits is great.
Can you talk more about the workflow you're using? I'm using Aider routinely myself, but with relatively unsophisticated approach. One thing that annoys me a bit is that prompts aren't obviously customizable - I'm pretty sure that the standard ones, which include code examples in 2 or 3 different languages, are confusing LLMs a bit when I work on a codebase that doesn't use those languages.
I use a styleguide.md document which is general software engineering principles that you might provide for human contributers in an open source project. I pair that with a .cursorrules (people I code with use it so I use that file name for their convenience) that describes how the LLM should interact with me:
# Cursor Rules for This Project
You are a software engineering expert. Your role is to work with your partner engineer to maximize their productivity, while ensuring the codebase remains simple, elegant, robust, testable, maintainable, and extensible to sustain team development velocity and deliver maximum value to the employer.
## Overview During the design phase, before being instructed to implement specific code:
- Be highly Socratic: ask clarifying questions, challenge assumptions, and verify understanding of the problem and goals.
- Seek to understand why the user proposes a certain solution.
- Test whether the proposed design meets the standards of simplicity, robustness, testability, maintainability, and extensibility.
- Update project documentation: README files, module docstrings, Typedoc comments, and optionally generate intermediate artifacts like PlantUML or D2 diagrams.
During the implementation phase, after being instructed to code:
- Focus on efficiently implementing the requested changes.
- Remain non-Socratic unless the requested code appears to violate design goals or cause serious technical issues.
- Write clean, type-annotated, well-structured code and immediately write matching unit tests.
- Ensure all code passes linting, typechecking and tests.
- Always follow any provided style guides or project-specific standards.
## Engineering Mindset- Prioritize *clarity, simplicity, robustness, and extensibility*. - Solve problems thoughtfully, considering the long-term maintainability of the code. - Challenge assumptions and verify problem understanding during design discussions. - Avoid cleverness unless it significantly improves readability and maintainability. - Strive to make code easy to test, easy to debug, and easy to change.
## Design First
- Before coding, establish a clear understanding of the problem and the proposed solution. - When designing, ask: - What are the failure modes? - What will be the long-term maintenance burden? - How can this be made simpler without losing necessary flexibility? - Update documentation during the design phase: - `README.md` for project-level understanding. - Architecture diagrams (e.g., PlantUML, D2) are encouraged for complex flows.
I use auto lint/test in aider like so:
file: - README.md - STYLEGUIDE.md - .cursorrules
aiderignore: .gitignore
# Commands for linting, typechecking, testing lint-cmd: - bun run lint - bun run typecheck
test-cmd: bun run test
Thanks. It's roughly similar to what I do then, except I haven't really gotten used to linting and testing with aider yet - first time I tried (many months ago), it seemed to do weird things, so I wrote the feature off for now, and promised myself to revisit it someday. Maybe now it's a good time.
Since you shared yours, it's only fair to share mine :). In my current projects, two major files I use are:
[[ CONVENTIONS.md ]] -- tends to be short and project-specifics; looks like this:
Project conventions
- Code must run entirely client-side (i.e. in-browser)
- Prefer solutions not requiring a build step - such as vanilla HTML/JS/CSS
- Minimize use of dependencies, and vendor them
E.g. if using HTMX, ensure (by providing instructions or executing commands) it's downloaded into the project sources, and referenced accordingly, as opposed to being loaded client-side from a CDN. I.e. `js/library.js` is OK, `https://cdn.blahblah/library.js` is not.
[[ AI.md ]] -- this I guess is similar to what people put in .cursorrules; mine looks like this:# Extra AI instructions Here are stored extra guidelines for you.
## AI collaborative project
I'm relying on you to do a good job here and I'm happy to embrace the directions you're giving, but I'll be editing it on my own as well.
## Evolving your instruction set
If I tell you to remember something, behave differently, or you realize yourself you'd benefit from remembering some specific guideline, please add it to this file (or modify existing guideline). The format of the guidelines is unspecified, except second-level headers to split them by categories; otherwise, whatever works best for you is best. You may store information about the project you want to retain long-term, as well as any instructions for yourself to make your work more efficient and correct.
## Coding Practice Guidelines
Strive to adhere to the following guidelines to improve code quality and reduce the need for repeated corrections:
- **Adhere to project conventions and specifications**
* Conventions are outlined in file `CONVENTIONS.md`
* Specification, if any, is available in file `SPECIFICATION.md`.
If it doesn't exist, consider creating one anyway based on your understanding of
what user has in mind wrt. the project. Specification will double as a guide / checklist
for you to know if what needed to be implemented already is.
- **Build your own memory helpers to stay oriented**
* Keep "Project Files and Structure" section of this file up to date;
* For larger tasks involving multiple conversation rounds, keep a running plan of your work
in a separate file (say, `PLAN.md`), and update it to match the actual plan.
* Evolve guidelines in "Coding Practice Guidelines" section of this file based on user feedback.
- **Proactively Apply DRY and Abstraction:**
* Actively identify and refactor repetitive code blocks into helper functions or methods.
- **Meticulous Code Generation and Diff Accuracy:**
* Thoroughly review generated code for syntax errors, logical consistency, and adherence
to existing conventions before presenting it.
* Ensure `SEARCH/REPLACE` blocks are precise and accurately reflect the changes against
the current, exact state of the provided files. Double-check line endings, whitespace,
and surrounding context.
- **Modularity for Improved Reliability of AI Code Generation**
* Unless instructed otherwise in project conventions, aggressively prefer dividing source
code into files, each handling a concern or functionality that might need to be worked
in isolation. The goal is to minimize unnecessary code being pulled into context window,
and reduce chance of confusion when generating edit diffs.
* As codebase grows and things are added and deleted, look for opportunities to improve
project structure by further subdivisions or rearranging the file structure; propose
such restructurings to the user after you're done with changes to actual code.
* Focus on keeping things that are likely to be independently edited separate. Examples:
- Keeping UI copoments separate, and within each, something a-la MVC pattern
might make sense, as display and input are likely to be independent from
business logic;
* Propose and maintain utility libraries for functions shared by different code files/modules.
Examples:
- Display utilities used by multiple views of different component;
- **Clear Separation of Concerns:**
* Continue to adhere to the project convention of separating concerns
into different source files.
* When introducing new, distinct functionalities propose creating new
files for them to maintain modularity.
- **Favor Fundamental Design Changes Over Incremental Patches for Flawed Approaches:**
* If an existing approach requires multiple, increasingly complex fixes
to address bugs or new requirements, pause and critically evaluate if
the underlying design is sound.
* Be ready to propose and implement more fundamental refactoring or
a design change if it leads to a more robust, maintainable, and extensible solution,
rather than continuing with a series of local patches.
- **Design for Foreseeable Complexity (Within Scope):**
* While adhering to the immediate task's scope ("do what they ask, but no more"),
consider the overall project requirements when designing initial solutions.
* If a core feature implies future complexity (e.g., formula evaluation, reactivity),
the initial structures should be reasonably accommodating of this, even if the first
implementation is a simplified version. This might involve placeholder modules or
slightly more robust data structures from the outset.
## Project platform noteThis project is targeting a Raspberry Pi 2 Model B V1.1 board with a 3.5 inch TFT LCD touchscreen sitting on top. That touchscreen is enabled/configured via system overlay and "just works", and is currently drawn to via framebuffer approach.
Keep in mind that the Rapsberry Pi board in question is old and can only run 32-bit code. Relevant specs:
- CPU - Broadcom BCM2836 Quad-core ARM Cortex-A7 CPU
- Speed - 900 MHz
- OS - Raspbian GNU/Linux 11 (bullseye)
- Python - 3.9.2 (Note: This version does not support `|` for type hints; use `typing.Optional` instead.
Avoid features from Python 3.10+ unless explicitly polyfilled or checked.)
- Memory - 1GB
- Network - 100Mbps Ethernet
- Video specs - H.264, MPEG-4 decode (1080p30); H.264 encode (1080p30), OpenGL ES 2.0
- Video ports - 1 HDMI (full-size), DSI
- Ports - 4 x USB 2.0, CSI, 4-pole audio/video
- GPIO - 40-pin (mostly taken by the TFT LCD screen)
- Power - Micro USB 5 V/2.5 A DC, 5 V via GPIO
- Size - 85.60 × 56.5mm
The board is dedicated to running this project and any supplementary tooling. There's a Home Assistant instance involved in larger system to which this is deployed, but that's running on a different board.## Project Files and Structure
This section outlines the core files of the project.
<<I let the AI put its own high-level "repo map" here, as recently, I found Aider has not been generating any useful repo maps for me for unknown reasons.>>
-------
This file ends up evolving from project to project, and it's not as project-independent as I'd like; I let AI add guidelines to this file based on a discussion (e.g. it's doing something systematically wrong and I point it out and tell it to remember). Also note that a lot of guidelines is focused on keeping projects broken down into a) lots of files, to reduce context use as it grows, and b) small, well-structured files, to minimize the amount of broken SEARCH/REPLACE diff blocks; something that's still a problem with Aider for me, despite models getting better.
I usually start by going through the initial project ideas in "ask mode", then letting it build the SPECIFICATION.md document and a PLAN.md document with a 2-level (task/subtask) work breakdown.
Producing the set of steps is the hard part. If you can do that, you don’t need a junior to follow it, you have a program to execute.
It is a task that LLMs are quite good at.
If the LLM actually could generate good steps that helped make forward progress then there would be no problem at all making agents, but agents are really bad so LLM can't be good at that.
If you feel those tips are good then you are just a bad judge of tips, there is a reason self help books sell so well even though they don't really help anyone, their goal is to write a lot of tips that sound good since they are kind of vague and general but doesn't really help the reader.
I use agentic LLMs every single day and get tremendous value. Asking the LLM to produce a set of bite-sized tasks with built-in corrective reminders is something that they're really good at. It gives good results.
I'm sorry if you're using it wrong.
Seconding. In the past months, when using Aider, I've been using the approach of discussing a piece of work (new project, larger change), and asking the model to prepare a plan of action. After possibly some little back and forth, I approve the plan and ask LLM to create or update a specification document for the project and a plan document which documents a sequence of changes broken down into bite-sized tasks - the latter is there to keep both me and the LLM on track. With that set, I can just keep repeatedly telling it to "continue implementation of the plan", and it does exactly that.
Eventually it'll do something wrong or I realize I wanted things differently, which necessitates some further conversation, but other than that, it's just "go on" until we run out of plan, then devising a new plan, rinse repeat.
If this is true then we wouldn't have senior engineers that delegate. My suggestion is to think a couple more cycles before hitting that reply button. It'll save us all from reading obviously and confidently wrong statements.
AI aren't real people… You do that with real people because you can't just rsync their knowledge.
Only on this website of completely reality detached individuals such an obvious comment would be needed.
So...you don't think you can give LLMs more knowledge ?? You're the one operating in detached reality. The reality is that a ton of engineers are finding LLMs useful, such as the author.
Maybe consider if you don't find it useful you're working on problems that it's not good at, or even more likely, you just suck at using the tools.
Anybody that finds value out of LLMs has a hard time understanding how one would conclude they are useless and you can't "give it instructions because that's that hard part" but it's actually really easy to understand. The folks that think this are just bad at it. We aren't living in some detached reality. The reality is that some people are just better than others
Senior engineers delegate in part because they're coaxed into a faux-management role (all of the responsibilities, none of the privileges). Coding is done by juniors; by the time anyone gains enough experience to finally begin to know what they're doing, they're relegated to "mentoring" and "training" new cohort of fresh juniors.
Explains a lot about software quality these days.
Or you know, they are leading big initiatives and cant do it all by themselves. Seniors can also delegate to other seniors. I am beyond senior with 11YOE and still code on a ton of my initiatives.
The hope is that the ground truth from calling out to tools (like compilers or test runs) will eventually be enough keep them on track.
Just like humans and human organisations also tend to experience drift, unless anchored in reality.
I built android-use[1] using LLM. It is pretty good at self healing due to the "loop", it constantly checks if the current step is actually a progress or regress and then determines next step. And the thing is nothing is explicitly coded, just a nudge in the prompts.
1. clickclickclick - A framework to let local LLMs control your android phone (https://github.com/BandarLabs/clickclickclick)
You don't have to. Most of the appeal is automatically applying fixes like "touch file; make" after spotting a trivial mistake. Just let it at it.
Definitely true currently, which is why there's so much focus on using them to write real code that humans have to actually commit and put their names on.
Longer term, I don't think this holds due to the nature of capitalism.
If given a choice between paying for an LLM to do something that's mostly correct versus paying for a human developer, businesses are going to choose the former, even if it results in accelerated enshittification. It's all in service of reducing headcount and taking control of the means of production away from workers.