suninsight 16 hours ago

It only seems effective, unless you start using it for actual work. The biggest issue - context. All tool use creates context. Large code bases come with large context out of the bat. LLM's seem to work, unless they are hit with a sizeable context. Anything above 10k and the quality seems to deteriorate.

Other issue is that LLM's can go off on a tangent. As context builds up, they forget what their objective was. One wrong turn, and in the rabbit hole they go never to recover.

The reason I know, is because we started solving these problems an year back. And we aren't done yet. But we did cover a lot of distance.

[Plug]: Try it out at https://nonbios.ai:

- Agentic memory → long-horizon coding

- Full Linux box → real runtime, not just toy demos

- Transparent → see & control every command

- Free beta — no invite needed. Works with throwaway email (mailinator etc.)

3
bob1029 13 hours ago

> One wrong turn, and in the rabbit hole they go never to recover.

I think this is probably at the heart of the best argument against these things as viable tools.

Once you have sufficiently described the problem such that the LLM won't go the wrong way, you've likely already solved most of it yourself.

Tool use with error feedback sounds autonomous but you'll quickly find that the error handling layer is a thin proxy for the human operator's intentions.

suninsight 12 hours ago

Yes, but we dont believe that this is a 'fundamental' problem. We have learnt to guide their actions a lot better and they go down the rabbit a lot less now than when we started out.

k__ 13 hours ago

True, but on the other hand, there are a bunch of tasks that are just very typing intensive and not really complex.

Especially in GUI development, building forms, charts, etc.

I could imagine that LLMs are a great help here.

toit4wing 15 hours ago

Looks interesting. How do you manage context ?

suninsight 15 hours ago

So managing context is what takes the maximum effort. We use a bunch of strategies to reduce it, including, but not limited to:

1. Custom MCP server to work on linux command line. This wasn't really a 'MCP' server because we started working on it before MCP was a thing. But thats the easiest way to explain it now. The MCP server is optimised to reduce context.

2. Guardrails to reduce context. Think about it as prompt alterations giving the LLM subtle hints to work with less context. The hints could be at a behavioural level and a task level.

3. Continuously pruning the built up context to make the Agent 'forget'. Forgetting what is not important is what we believe a foundational capability.

This is kind of inspired by the science which says humans use sleep to 'forget' not useful memories and is critical to keeping the brain healthy. This translates directly to LLM's - making them forget is critical to keep them focussed on the larger task and their actions alligned.

moffkalast 14 hours ago

Some of the thinking models might recover... with an extra 4k tokens used up in <thinking>. And even if they were stable at long contexts, the speed drops massively. You just can't win with this architecture lol.

suninsight 14 hours ago

That is very accurate with what we have found. <thinking> models do a lot better, but with huge speed drops. For now, we have chosen accuracy over speed. But speed drop is like 3-4x - so we might move to an architecture where we 'think' only sporadically.

Everything happening in the LLM space is so close to how humans think naturally.