Item 43999072 - HN

libraryofbabel • 1 day ago

Strongly recommend this blog post too which is a much more detailed and persuasive version of the same point. The author actually goes and builds a coding agent from zero: https://ampcode.com/how-to-build-an-agent

It is indeed astonishing how well a loop with an LLM that can call tools works for all kinds of tasks now. Yes, sometimes they go off the rails, there is the problem of getting that last 10% of reliability, etc. etc., but if you're not at least a little bit amazed then I urge you go to and hack together something like this yourself, which will take you about 30 minutes. It's possible to have a sense of wonder about these things without giving up your healthy skepticism of whether AI is actually going to be effective for this or that use case.

This "unreasonable effectiveness" of putting the LLM in a loop also accounts for the enormous proliferation of coding agents out there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider, Codex... and a ton of also-rans; as one HN poster put it the other day, it seems like everyone and their mother is writing one. The reason is that there is no secret sauce and 95% of the magic is in the LLM itself and how it's been fine-tuned to do tool calls. One of the lead developers of Claude Code candidly admits this in a recent interview.[0] Of course, a ton of work goes into making these tools work well, but ultimately they all have the same simple core.

[0] https://www.youtube.com/watch?v=zDmW5hJPsvQ

12

vidarh • 12 hours ago

There's a Ruby port of the first article you linked as well. Feature-wise they're about the same, but if you (like me) enjoy Ruby more than Python it's worth reading both articles:

https://news.ycombinator.com/item?id=43984860

https://radanskoric.com/articles/coding-agent-in-ruby

forgingahead • 12 hours ago

Love to see the Ruby implementations! Thanks for sharing.

ichiwells • 5 hours ago

Thank you so much for sharing this!

We are using ruby to build a powerful AI toolset in the construction space, and we love how simple all of the SaaS parts are and not reinventing the wheel, but the ruby LLM SDK ecosystem is a bit lagging, so we've written a lot of our own low-level tools.

(btw we are also hiring rubyists https://news.ycombinator.com/item?id=43865448)

datpuz • 22 hours ago

Can't think of anything an LLM is good enough at to let them do on their own in a loop for more than a few iterations before I need to reign it back in.

hbbio • 20 hours ago

That's why in practice you need more than this simple loop!

Pretty much WIP, but I am experimenting with simple sequence-based workflows that are designed to frequently reset the conversation [2]

This goes well with Microsoft paper "LLMs Get Lost In Multi-Turn Conversation " that was published Friday [1].

- [1]: https://arxiv.org/abs/2505.06120

- [2]: https://github.com/hbbio/nanoagent/blob/main/src/workflow.ts

vidarh • 12 hours ago

They've written most of the recent iterations of X11 bindings for Ruby, including a complete, working example of a systray for me.

They also added the first pass of multi-monitor support for my WM while I was using it (restarted it repeatedly while Claude Code worked, in the same X session the terminal it was working in was running).

You do need to reign them back in, sure, but they can often go multiple iterations before they're ready to make changes to your files once you've approved safe tool uses etc.

TZubiri • 1 hour ago

How do they read the screen?

datpuz • 8 hours ago

Agents? Doubt.

vidarh • 8 hours ago

You can doubt it all you want - it doesn't make it any less true.

datpuz • 6 hours ago

Can you provide a source

Groxx • 21 hours ago

They're extremely good at burning through budgets, and get even better when unattended

_kb • 20 hours ago

Maximising paperclip production too.

mycall • 20 hours ago

Is that really true? I though there free models and $200 all you can eat models.

nsomaru • 19 hours ago

These tools require API calls which usually aren’t priced like the consumer plans

never_inline • 5 hours ago

Well technically Aider let's you use a web chat UI by generating some context and letting you paste back and forth.

adastra22 • 19 hours ago

Yeah they’re cheaper. I’ve written whole apps for $0.20 in API calls.

monsieurbanana • 9 hours ago

With which agent? What kind of apps?

Without more information I'm very skeptical that you had e.g. Claude Code create a whole app (so more than a simple script) with 20 cents. Unless it was able to one-shot it, but at that point you don't need an agent anyway.

adastra22 • 6 hours ago

Aider, Claude 3.7.

datpuz • 6 hours ago

I've "written" whole apps by going to GitHub, cloning a repo, right clicking, and renaming it to "MyApp." Impressed?

jfim • 19 hours ago

Claude code is now part of the consumer $100/mo max plan.

Aeolun • 16 hours ago

If they give me API access too I’m sold xD

piuantiderp • 16 hours ago

Read that you can very quickly blow the budget on the 200/mo ones too

CuriouslyC • 22 hours ago

The main problem with agents is that they aren't reflecting on their own performance and pausing their own execution to ask a human for help aggressively enough. Agents can run on for 20+ iterations in many cases successfully, but also will need hand holding after every iteration in some cases.

They're a lot like a human in that regard, but we haven't been building that reflection and self awareness into them so far, so it's like a junior that doesn't realize when they're over their depth and should get help.

vendiddy • 14 hours ago

I think they are capable of doing it, but it requires prompting.

I constantly have to instruct them: - Go step by step, don't skip ahead until we're done with a step - Don't make assumptions, if you're unsure ask questions to clarify

And they mostly do this.

But this needs to be default behavior!

I'm surprised that, unless prompted, LLMs never seem to ask follow-up questions as a smart coworker might.

ariwilson • 20 hours ago

Is there value in adding an overseer LLM that measures the progress between n steps and if it's too low stops and calls out to a human?

CuriouslyC • 16 hours ago

I don't think you need an overseer for this, you can just have the agent self-assess at each step whether it's making material progress or if it's caught in a loop, and if it's caught in a loop to pause and emit a prompt for help from a human. This would probably require a bit of tuning, and the agents need to be setup with a blocking "ask for help" function, but it's totally doable.

p_v_doom • 15 hours ago

Bruh, we're inventing robot PMs for our robot developers now? We're so fucked

suninsight • 14 hours ago

Yes it works really well. We do something like that at NonBioS.ai - longer post below. The agent self reflects if it is stuck or confused and calls out the human for help.

solumunus • 20 hours ago

And how does it effectively measure progress?

NotMichaelBay • 20 hours ago

It can behave just like a senior role would - produce the set of steps for the junior to follow, and assess if the junior appears stuck at any particular step.

CuriouslyC • 16 hours ago

I have actually had great success with agentic coding by sitting down with a LLM to tell it what I'm trying to build and have it be socratic with me, really trying to ask as many questions as it can think of to help tease out my requirements. While it's doing this, it's updating the project readme to outline this vision and create a "planned work" section that is basically a roadmap for an agent to follow.

Once I'm happy that the readme accurately reflects what I want to build and all the architectural/technical/usage challenges have been addressed, I let the agent rip, instructing it to build one thing at a time, then typecheck, lint and test the code to ensure correctness, fixing any errors it finds (and re-running automated checks) before moving on to the next task. Given this workflow I've built complex software using agents with basically no intervention needed, with the exception of rare cases where its testing strategy is flakey in a way that makes it hard to get the tests passing.

Xevion • 16 hours ago

>I have actually had great success with agentic coding by sitting down with a LLM to tell it what I'm trying to build and have it be socratic with me, really trying to ask as many questions as it can think of to help tease out my requirements.

Just curious, could you expand on the precise tools or way you do this?

For example, do you use the same well-crafted prompt in Claude or Gemini and use their in-house document curation features, or do you use a file in VS Code with Copilot Chat and just say "assist me in writing the requirements for this project in my README, ask questions, perform a socratic discussion with me, build a roadmap"?

You said you had 'great success' and I've found AI to be somewhat underwhelming at times, and I've been wondering if it's because of my choice of models, my very simple prompt engineering, or if my inputs are just insufficient/too complex.

CuriouslyC • 15 hours ago

I use Aider with a very tuned STYLEGUIDE.md and AI rules document that basically outlines this whole process so I don't have to instruct it every time. My preferred model is Gemini 2.5 Pro, which is definitely by far the best model for this sort of thing (Claude can one shot some stuff about as well but for following an engineering process and responding to test errors, it's vastly inferior)

vendiddy • 14 hours ago

How do you find Aider compares to Claude code?

CuriouslyC • 11 hours ago

I like Aider's configurability, I can chain a lot of static analysis stuff together with it and have the model fix all of it, and I can have 2-4 aider windows open in a grid and run them all at once, not sure how that'd work with Claude Code. Also, aider managing everything with git commits is great.

TeMPOraL • 10 hours ago

Can you talk more about the workflow you're using? I'm using Aider routinely myself, but with relatively unsophisticated approach. One thing that annoys me a bit is that prompts aren't obviously customizable - I'm pretty sure that the standard ones, which include code examples in 2 or 3 different languages, are confusing LLMs a bit when I work on a codebase that doesn't use those languages.

CuriouslyC • 9 hours ago

I use a styleguide.md document which is general software engineering principles that you might provide for human contributers in an open source project. I pair that with a .cursorrules (people I code with use it so I use that file name for their convenience) that describes how the LLM should interact with me:

# Cursor Rules for This Project

  You are a software engineering expert. Your role is to work with your partner engineer to maximize their productivity, while ensuring the codebase remains simple, elegant, robust, testable, maintainable, and extensible to sustain team development velocity and deliver maximum value to the employer.

## Overview

  During the design phase, before being instructed to implement specific code:
  - Be highly Socratic: ask clarifying questions, challenge assumptions, and verify understanding of the problem and goals.
  - Seek to understand why the user proposes a certain solution.
  - Test whether the proposed design meets the standards of simplicity, robustness, testability, maintainability, and extensibility.
  - Update project documentation: README files, module docstrings, Typedoc comments, and optionally generate intermediate artifacts like PlantUML or D2 diagrams.

  During the implementation phase, after being instructed to code:
  - Focus on efficiently implementing the requested changes.
  - Remain non-Socratic unless the requested code appears to violate design goals or cause serious technical issues.
  - Write clean, type-annotated, well-structured code and immediately write matching unit tests.
  - Ensure all code passes linting, typechecking and tests.
  - Always follow any provided style guides or project-specific standards.

## Engineering Mindset

- Prioritize *clarity, simplicity, robustness, and extensibility*. - Solve problems thoughtfully, considering the long-term maintainability of the code. - Challenge assumptions and verify problem understanding during design discussions. - Avoid cleverness unless it significantly improves readability and maintainability. - Strive to make code easy to test, easy to debug, and easy to change.

## Design First

- Before coding, establish a clear understanding of the problem and the proposed solution. - When designing, ask: - What are the failure modes? - What will be the long-term maintenance burden? - How can this be made simpler without losing necessary flexibility? - Update documentation during the design phase: - `README.md` for project-level understanding. - Architecture diagrams (e.g., PlantUML, D2) are encouraged for complex flows.

I use auto lint/test in aider like so:

file: - README.md - STYLEGUIDE.md - .cursorrules

aiderignore: .gitignore

# Commands for linting, typechecking, testing lint-cmd: - bun run lint - bun run typecheck

test-cmd: bun run test

TeMPOraL • 4 hours ago

Thanks. It's roughly similar to what I do then, except I haven't really gotten used to linting and testing with aider yet - first time I tried (many months ago), it seemed to do weird things, so I wrote the feature off for now, and promised myself to revisit it someday. Maybe now it's a good time.

Since you shared yours, it's only fair to share mine :). In my current projects, two major files I use are:

[[ CONVENTIONS.md ]] -- tends to be short and project-specifics; looks like this:

Project conventions

- Code must run entirely client-side (i.e. in-browser)

- Prefer solutions not requiring a build step - such as vanilla HTML/JS/CSS

- Minimize use of dependencies, and vendor them

  E.g. if using HTMX, ensure (by providing instructions or executing commands) it's downloaded into the project sources, and referenced accordingly, as opposed to being loaded client-side from a CDN. I.e. `js/library.js` is OK, `https://cdn.blahblah/library.js` is not.

[[ AI.md ]] -- this I guess is similar to what people put in .cursorrules; mine looks like this:

# Extra AI instructions Here are stored extra guidelines for you.

## AI collaborative project

I'm relying on you to do a good job here and I'm happy to embrace the directions you're giving, but I'll be editing it on my own as well.

## Evolving your instruction set

If I tell you to remember something, behave differently, or you realize yourself you'd benefit from remembering some specific guideline, please add it to this file (or modify existing guideline). The format of the guidelines is unspecified, except second-level headers to split them by categories; otherwise, whatever works best for you is best. You may store information about the project you want to retain long-term, as well as any instructions for yourself to make your work more efficient and correct.

## Coding Practice Guidelines

Strive to adhere to the following guidelines to improve code quality and reduce the need for repeated corrections:

    - **Adhere to project conventions and specifications**
      * Conventions are outlined in file `CONVENTIONS.md`
      * Specification, if any, is available in file `SPECIFICATION.md`.
        If it doesn't exist, consider creating one anyway based on your understanding of
        what user has in mind wrt. the project. Specification will double as a guide / checklist
        for you to know if what needed to be implemented already is.

    - **Build your own memory helpers to stay oriented**
      * Keep "Project Files and Structure" section of this file up to date;
      * For larger tasks involving multiple conversation rounds, keep a running plan of your work
        in a separate file (say, `PLAN.md`), and update it to match the actual plan.
      * Evolve guidelines in "Coding Practice Guidelines" section of this file based on user feedback.

    - **Proactively Apply DRY and Abstraction:**
      * Actively identify and refactor repetitive code blocks into helper functions or methods.

    - **Meticulous Code Generation and Diff Accuracy:**
      * Thoroughly review generated code for syntax errors, logical consistency, and adherence
        to existing conventions before presenting it.
      * Ensure `SEARCH/REPLACE` blocks are precise and accurately reflect the changes against
        the current, exact state of the provided files. Double-check line endings, whitespace,
        and surrounding context.

    - **Modularity for Improved Reliability of AI Code Generation**
      * Unless instructed otherwise in project conventions, aggressively prefer dividing source
        code into files, each handling a concern or functionality that might need to be worked
        in isolation. The goal is to minimize unnecessary code being pulled into context window,
        and reduce chance of confusion when generating edit diffs.
      * As codebase grows and things are added and deleted, look for opportunities to improve
        project structure by further subdivisions or rearranging the file structure; propose
        such restructurings to the user after you're done with changes to actual code.
      * Focus on keeping things that are likely to be independently edited separate. Examples:
        - Keeping UI copoments separate, and within each, something a-la MVC pattern
          might make sense, as display and input are likely to be independent from
          business logic;
      * Propose and maintain utility libraries for functions shared by different code files/modules.
        Examples:
        - Display utilities used by multiple views of different component;

    - **Clear Separation of Concerns:**
    *   Continue to adhere to the project convention of separating concerns
        into different source files.
    *   When introducing new, distinct functionalities propose creating new
        files for them to maintain modularity.

    - **Favor Fundamental Design Changes Over Incremental Patches for Flawed Approaches:**
      * If an existing approach requires multiple, increasingly complex fixes
        to address bugs or new requirements, pause and critically evaluate if
        the underlying design is sound.
      * Be ready to propose and implement more fundamental refactoring or
        a design change if it leads to a more robust, maintainable, and extensible solution,
        rather than continuing with a series of local patches.

    - **Design for Foreseeable Complexity (Within Scope):**
      * While adhering to the immediate task's scope ("do what they ask, but no more"),
        consider the overall project requirements when designing initial solutions.
      * If a core feature implies future complexity (e.g., formula evaluation, reactivity),
        the initial structures should be reasonably accommodating of this, even if the first
        implementation is a simplified version. This might involve placeholder modules or
        slightly more robust data structures from the outset.

## Project platform note

This project is targeting a Raspberry Pi 2 Model B V1.1 board with a 3.5 inch TFT LCD touchscreen sitting on top. That touchscreen is enabled/configured via system overlay and "just works", and is currently drawn to via framebuffer approach.

Keep in mind that the Rapsberry Pi board in question is old and can only run 32-bit code. Relevant specs:

    - CPU - Broadcom BCM2836 Quad-core ARM Cortex-A7 CPU
    - Speed - 900 MHz
    - OS - Raspbian GNU/Linux 11 (bullseye)
    - Python - 3.9.2 (Note: This version does not support `|` for type hints; use `typing.Optional` instead.
      Avoid features from Python 3.10+ unless explicitly polyfilled or checked.)
    - Memory - 1GB
    - Network - 100Mbps Ethernet
    - Video specs - H.264, MPEG-4 decode (1080p30); H.264 encode (1080p30), OpenGL ES 2.0
    - Video ports - 1 HDMI (full-size), DSI
    - Ports - 4 x USB 2.0, CSI, 4-pole audio/video
    - GPIO - 40-pin (mostly taken by the TFT LCD screen)
    - Power - Micro USB 5 V/2.5 A DC, 5 V via GPIO
    - Size - 85.60 × 56.5mm

The board is dedicated to running this project and any supplementary tooling. There's a Home Assistant instance involved in larger system to which this is deployed, but that's running on a different board.

## Project Files and Structure

This section outlines the core files of the project.

<<I let the AI put its own high-level "repo map" here, as recently, I found Aider has not been generating any useful repo maps for me for unknown reasons.>>

-------

This file ends up evolving from project to project, and it's not as project-independent as I'd like; I let AI add guidelines to this file based on a discussion (e.g. it's doing something systematically wrong and I point it out and tell it to remember). Also note that a lot of guidelines is focused on keeping projects broken down into a) lots of files, to reduce context use as it grows, and b) small, well-structured files, to minimize the amount of broken SEARCH/REPLACE diff blocks; something that's still a problem with Aider for me, despite models getting better.

I usually start by going through the initial project ideas in "ask mode", then letting it build the SPECIFICATION.md document and a PLAN.md document with a 2-level (task/subtask) work breakdown.

chongli • 20 hours ago

Producing the set of steps is the hard part. If you can do that, you don’t need a junior to follow it, you have a program to execute.

adastra22 • 19 hours ago

It is a task that LLMs are quite good at.

Jensson • 16 hours ago

If the LLM actually could generate good steps that helped make forward progress then there would be no problem at all making agents, but agents are really bad so LLM can't be good at that.

If you feel those tips are good then you are just a bad judge of tips, there is a reason self help books sell so well even though they don't really help anyone, their goal is to write a lot of tips that sound good since they are kind of vague and general but doesn't really help the reader.

adastra22 • 15 hours ago

I use agentic LLMs every single day and get tremendous value. Asking the LLM to produce a set of bite-sized tasks with built-in corrective reminders is something that they're really good at. It gives good results.

I'm sorry if you're using it wrong.

TeMPOraL • 15 hours ago

Seconding. In the past months, when using Aider, I've been using the approach of discussing a piece of work (new project, larger change), and asking the model to prepare a plan of action. After possibly some little back and forth, I approve the plan and ask LLM to create or update a specification document for the project and a plan document which documents a sequence of changes broken down into bite-sized tasks - the latter is there to keep both me and the LLM on track. With that set, I can just keep repeatedly telling it to "continue implementation of the plan", and it does exactly that.

Eventually it'll do something wrong or I realize I wanted things differently, which necessitates some further conversation, but other than that, it's just "go on" until we run out of plan, then devising a new plan, rinse repeat.

adastra22 • 6 hours ago

This is pretty much what I do. It works very well.

abletonlive • 19 hours ago

If this is true then we wouldn't have senior engineers that delegate. My suggestion is to think a couple more cycles before hitting that reply button. It'll save us all from reading obviously and confidently wrong statements.

guappa • 16 hours ago

AI aren't real people… You do that with real people because you can't just rsync their knowledge.

Only on this website of completely reality detached individuals such an obvious comment would be needed.

abletonlive • 4 hours ago

So...you don't think you can give LLMs more knowledge ?? You're the one operating in detached reality. The reality is that a ton of engineers are finding LLMs useful, such as the author.

Maybe consider if you don't find it useful you're working on problems that it's not good at, or even more likely, you just suck at using the tools.

Anybody that finds value out of LLMs has a hard time understanding how one would conclude they are useless and you can't "give it instructions because that's that hard part" but it's actually really easy to understand. The folks that think this are just bad at it. We aren't living in some detached reality. The reality is that some people are just better than others

TeMPOraL • 16 hours ago

Senior engineers delegate in part because they're coaxed into a faux-management role (all of the responsibilities, none of the privileges). Coding is done by juniors; by the time anyone gains enough experience to finally begin to know what they're doing, they're relegated to "mentoring" and "training" new cohort of fresh juniors.

Explains a lot about software quality these days.

abletonlive • 4 hours ago

Or you know, they are leading big initiatives and cant do it all by themselves. Seniors can also delegate to other seniors. I am beyond senior with 11YOE and still code on a ton of my initiatives.

eru • 18 hours ago

The hope is that the ground truth from calling out to tools (like compilers or test runs) will eventually be enough keep them on track.

Just like humans and human organisations also tend to experience drift, unless anchored in reality.

mkagenius • 22 hours ago

I built android-use[1] using LLM. It is pretty good at self healing due to the "loop", it constantly checks if the current step is actually a progress or regress and then determines next step. And the thing is nothing is explicitly coded, just a nudge in the prompts.

1. clickclickclick - A framework to let local LLMs control your android phone (https://github.com/BandarLabs/clickclickclick)

loa_in_ • 16 hours ago

You don't have to. Most of the appeal is automatically applying fixes like "touch file; make" after spotting a trivial mistake. Just let it at it.

JeremyNT • 9 hours ago

Definitely true currently, which is why there's so much focus on using them to write real code that humans have to actually commit and put their names on.

Longer term, I don't think this holds due to the nature of capitalism.

If given a choice between paying for an LLM to do something that's mostly correct versus paying for a human developer, businesses are going to choose the former, even if it results in accelerated enshittification. It's all in service of reducing headcount and taking control of the means of production away from workers.

meander_water • 1 day ago

There's also this one which uses pocketflow, a graph abstraction library to create something similar [0]. I've been using it myself and love the simplicity of it.

[0] https://github.com/The-Pocket/PocketFlow-Tutorial-Cursor/blo...

wepple • 1 day ago

Ah, it’s Thorsten Ball!

I thoroughly enjoyed his “writing an interpreter”. I guess I’m going to build an agent now.

orange_puff • 17 hours ago

I have been trying to find such an article for so long, thank you! I think a common reaction to Agents is “well, it probably cannot solve a really complex problem very well”. But to me, that isn’t the point of an agent. LLMs function really well with a lot of context, and agent allows the LLM to discover more context and improve its ability to answer questions.

xnx • 10 hours ago

> The reason is that there is no secret sauce and 95% of the magic is in the LLM itself

Makes that "$3 billion" valuation for Windsurf very suspect

rrrx3 • 9 hours ago

Indeed. But keep in mind they weren't just buying the tooling - they get the team, the brand, and the positional authority as well. OpenAI could have spun up a team to build an agent code IDE, and they would have been starting on the back foot with users, would have been compared to Cursor/Windsurf...

The price tag is hefty but I figure it'll work out for them on the backside because they won't have to fight so hard to capture TAM.

TonyEx • 9 hours ago

The value in the windsurf acquisition isn't the code they've written, it's the ability to see what people are coding and use that information to build better LLMs. -- Product development.

sesm • 1 day ago

Should we change the link above to use `?utm_source=hn&utm_medium=browser` before opening it?

libraryofbabel • 1 day ago

fixed :)

deadbabe • 12 hours ago

Generally when LLM’s are effective like this, it means a more efficient non-LLM based solution to the problem exists using the tools you have provided. The LLM helps you find the series of steps and synthesis of inputs and outputs to make it happen.

It is expensive and slow to have an LLM use tools all the time for solving the problem. The next step is to convert frequent patterns of tool calls into a single pure function, performing whatever transformation of inputs and outputs are needed along the way (an LLM can help you build these functions), and then perhaps train a simple cheap classifier to always send incoming data to this new function, bypassing LLMs all together.

In time, this will mean you will use LLMs less and less, limiting their use to new problems that are unable to be classified. This is basically like a “cache” for LLM based problem solving, where the keys are shapes of problems.

The idea of LLMs running 24/7 solving the same problems in the same way over and over again should become a distant memory, though not one that an AI company with vested interest in selling as many API calls as possible will want people to envision. Ideally LLMs are only needed to be employed once or a few times per novel problem before being replaced with cheaper code.

TZubiri • 1 hour ago

there is the problem of getting that last 10% of reliability.

In my experience, that next 9% will take 9 times the effort.

And that next 0.9% will take 9 times the effort.

And so on.

So 90% is very far off from 99.999% reliability. Which would still be less reliable than an ec2 instance.

gchamonlive • 11 hours ago

How far can you go with the best models that fit in a consumer grade GPU (24GB vram)?

aibrother • 1 day ago

thanks for the rec. and yeah agreed with the observations as well

kcorbitt • 1 day ago

For "that last 10% of reliability" RL is actually working pretty well right now too! https://openpipe.ai/blog/art-e-mail-agent