fxtentacle 8 days ago

For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same beginner's mistake as the last person on the last day. Eventually, I'd rather spend half an hour of my own time than to explain the problem once more...

Why anyone thinks having 3 different PRs for each jira ticket might boost productivity, is beyond me.

Related anime: I May Be a Guild Receptionist, But I'll Solo Any Boss to Clock Out on Time

5
simonw 8 days ago

One of the (many) differences between junior developers and LLM assistance is that humans can learn from their mistakes, whereas with LLMs it's up to you as the prompter to learn from their mistakes.

If an LLM screws something up you can often adjust their prompt to avoid that particular problem in the future.

skerit 7 days ago

> One of the (many) differences between junior developers and LLM assistance is that humans can learn from their mistakes

One would think so, but I've had some developers repeat the same mistake a hundred times, where eventually they admit they just keep forgetting it.

The frustration you feel when telling a human for the Xth time that we do not allow yoda-conditions in our codebase is incredibly similar to when an AI does something wrong.

albrewer 7 days ago

> often

Often being about 30% of the time in my experience

noodletheworld 8 days ago

It may not be as stupid as it sounds.

Randomising LLM outputs (temperature) results is outputs that will always have some degree of hallucination.

That’s just math. You can’t mix a random factor in and magically expect it to not exist. There will always be p(generates random crap) > 0.

However, in any probabilistic system, you can run a function k times and you’ll get an output distribution that is meaningful if k is high enough.

3 is not high enough.

At 3, this is stupid; all you’re observing is random variance.

…but, in general, running the same prompt multiple times and taking some kind of general solution from the distribution isn’t totally meaningless, I guess.

The thing with LLMs is they scale in a way that actually allows this to be possible, in a way that scaling with humans can’t.

… like the monkeys and Shakespeare, there probably a limit to the value it can offer; but it’s not totally meaningless to try it.

horsawlarway 8 days ago

I think this is an interesting idea, but I also somewhat suspect you've replaced a tedious problem with a harder, more tedious problem.

Take your idea further. Now I've got 100 agents, and 100 PRs, and some small percentage of them are decent. The task went from "implement a feature" to "review 100 PRs and select the best one".

Even assuming you can ditch 50 percent right off the bat as trash... Reviewing 50 potentially buggy implementations of a feature and selecting the best genuinely sounds worse than just writing the solution.

Worse... If you haven't solved the problem before anyways, you're woefully unqualified as a reviewer.

namaria 7 days ago

Theory of constraints is clear: speeding up something that isn't a bottleneck worsens system performance.

The idea that too little code is the problem is the problem. Code is liability. Making more of it faster (and probabilistic) is a fantastically bad idea.

cyanydeez 6 days ago

What if we replace PR with QA test rig? Then hire a bunch of QA monkies to find bugs in them and select the "bug free" one.

skeledrew 8 days ago

There should be test cases ran, coverage ensured. This is trivially automated. LLMs should also review the PRs, at least initially, using the test results as part of the input.

lolinder 7 days ago

Who tests the tests? How do you know that the LLM-generated tests are actually asserting anything meaningful and cover the relevant edge cases?

The tests are part of the code that needs to be reviewed in the PR by a human. They don't solve the problem, they just add more lines to the reviewer's job.

horsawlarway 8 days ago

So now either the agent is writing the tests, in which case you're right back to the same issue (which tests are actually worth running?) or your job is now just writing tests (bleh...).

And for the llm review of the pr... Why do you assume it'll be worth any more then the original implementation? Or are we just recursing down a level again (if 100 llms review each of the 100 PRs... To infinity and beyond!)

This by definition is not trivially automated.

skeledrew 8 days ago

The LLMs can help with the writing of the tests, but you should verify that they're testing critical aspects and known edge cases are covered. A single review-promoted LLM can then utilize those across the PRs and provide a summary for acceptance the the best. Or discard all and do manually; that initial process should only have taken a few minutes, so minimal wastage in the grand scheme of things, given over time there are a decent amount of acceptances, compared to the alternative 100% manual effort and associated time sunk.

CognitiveLens 7 days ago

The linked article from Steve Yegge (https://sourcegraph.com/blog/revenge-of-the-junior-developer) provides a 'solution', which he thinks is also imminent - supervisor AI agents, where you might have 100+ coding agents creating PRs, but then a layer of supervisors that are specialized on evaluating quality, and the only PRs that a human being would see would be the 'best', as determined by the supervisor agent layer.

From my experience with AI agents, this feels intuitively possible - current agents seem to be ok (thought not yet 'great') at critiquing solutions, and such supervisor agents could help keep the broader system in alignment.

kgeist 7 days ago

>but then a layer of supervisors that are specialized on evaluating quality

Why would supervisor agents be any better than the original LLMs? Aren't they still prone to hallucinations and subject to the same limitations imposed by training data and model architecture?

It feels like it just adds another layer of complexity and says, "TODO: make this new supervisor layer magically solve the issue." But how, exactly? If we already know the secret sauce, why not bake it into the first layer from the start?

pama 7 days ago

Similar to how human brains behave, it is easier to train a model to select a better solution between many choices than to check an individual solution for correctness [1], which is in turn an easier task to learn than writing a correct single solution in the first place.

[1] the diffs in logic can suggest good ideas that may have been missed in subsets of solutions.

immibis 7 days ago

Just add a CxO layer that monitors the supervisors! And the board of directors watches the CEO and the shareholders monitor the board of directors. It's agents all the way up!

CuriouslyC 7 days ago

LLMs are smarter in hindsight than going forward, sort of like humans! only they don't have such flexible self reflection loops so they tend to fall into local minima more easily.

sdesol 7 days ago

This reads like it could result in "the blind, leading the blind". Unless the Supervisor AI agents are deterministic, it can still be a crapshoot. Given the resources that SourceGraph has, I'm still surprised they missed the most obvious thing, which is "context is king" and we need tooling that can make adding context to LLMs dead simple. Basically, we should be optimizing for the humans in the loop.

Agents have their place for trivial and non-critical fixes/features, but the reality is, unless the agents can act in a deterministic manner across LLMs, you really are coding with a loaded gun. The worst is, agents can really dull your senses over time.

I do believe in a future where we can trust agents 99% of the time, but the reality is, we are not training on the thought process, for this to become a reality. That is, we are not focused on the conversation to code training data. I would say 98% of my code is AI generated, and it is certainly not vibe coding. I don't have a term for it, but I am literally dictating to the LLM what I want done and have it fill in the pieces. Sometimes it misses the mark, sometimes it aligns and sometimes it introduces whole new ideas that I have never thought of, which will lead to a better solution. The instructions that I provide is based on my domain knowledge and I think people are missing the mark when they talk about vibe coding, in a professional context.

Full Disclosure: I'm working on improving the "conversation to code" process, so my opinions are obviously biased, but I strongly believe we need to first focus on better capturing our thought process.

CognitiveLens 5 days ago

I'm skeptical that we would need determinism in a supervisor in order for it to be useful. I realize it's not exactly analogous, but the current human parallel, with senior/principal/architect-level SWEs reviewing code from less experienced devs (or even similarly-/more-experienced devs) is far from deterministic, but certainly improves quality

Think about how differently a current agent behaves when you say "here is the spec, implement a solution" vs "here is the spec, here is my solution, make refinements" - you get very different output, and I would argue that the 'check my work' approach tends to have better results.

AdieuToLogic 7 days ago

> It may not be as stupid as it sounds.

It is.

> However, in any probabilistic system, you can run a function k times and you’ll get an output distribution that is meaningful if k is high enough.

This is the underlying flaw in this approach. Attempting to use probabilistic algorithms to produce a singular verifiably correct result requires an external agent to select what is correct in the output of "k times" invocations. This is a person capable of making said determination.

> The thing with LLMs is they scale in a way that actually allows this to be possible, in a way that scaling with humans can’t.

For the "k times" generation of text part, sure. Not for the determination of which one within k, if any, are acceptable for the problem at hand.

EDIT: clarified "produce a verifiably correct result" to be "produce a singular verifiably correct result"

AdieuToLogic 7 days ago

> like the monkeys and Shakespeare, there probably a limit to the value it can offer; but it’s not totally meaningless to try it.

Whenever someone uses this analogy, a question never addressed is:

  Assuming sufficient monkeys, typewriters, and time, how
  would anyone know if a Shakespearean work was produced
  unless one reviewed *all* output?

fn-mote 7 days ago

> taking some kind of general solution from the distribution

My instinct is that this should be the temperature 0K response (no randomness).

regularfry 8 days ago

Three is high enough, in my eyes. Two might be. Remember that we don't care about any but the best solution. With one sample you've only got one 50/50 shot to get above the median. With three, the odds of the best of the three being above the median is 87.5%.

Of course picking the median as the random crap boundary is entirely arbitrary, but it'll do until there's a justification for a better number.

bbatchelder 8 days ago

Even with human junior devs, ideally you'd maintain some documentation about common mistakes/gotchas so that when you onboard new people to the team they can read that instead of you having to hold their hand manually.

You can do the same thing for LLMs by keeping a file with those details available and included in their context.

You can even set up evaluation loops so that entries can be made by other agents.

cyanydeez 6 days ago

The problem with LLMs is they have a defacto recency bias, so they only really remember that last few things you said. I've yet to see any real LLM manage a proper listing of rules.

ghuntley 7 days ago

correct.

abc-1 8 days ago

Darn I wonder if systems could be modified so that common mistakes become less common or if documentation could be written once and read multiple times by different people.

danielbln 8 days ago

We feed it conventions that are automatically loaded for every LLM task, do that the LLM adheres to coding style, comment style, common project tooling and architecture etc.

These systems don't do online learning, but that doesn't mean you can spoon feed them what they should know and mutate that knowledge over time.

nico 8 days ago

> For me, a team of junior developers that refuse to learn from their mistakes is the fuel of nightmares. I'm stuck in a loop where every day I need to explain to a new hire why they made the exact same

This is a huge opportunity, maybe the next big breakthrough in AI when someone figures out how to solve it

Instead of having a model that knows everything, have a model that can learn on the go from the feedback it gets from the user

Ideally a local model too. So something that runs on my computer that I train with my own feedback so that it gets better at the tasks I need it to perform

You could also have one at team level, a model that learns from the whole team to perform the tasks the team needs it to perform

freeone3000 8 days ago

Continual feedback means continual training. No way around it. So you’d have to scope down the functional unit to a fairly small lora in order to get reasonable re-training costs here.

regularfry 8 days ago

That's not quite true. The system prompt is state that you can use for "training" in a way that fits the problem here. It's not differentiable so you're in slightly alien territory, but it's also more comprehensible than gradient-descending a bunch of weights.

immibis 7 days ago

If you treat it as vectors instead of words, it might be differentiable.

nico 8 days ago

Or maybe figure out a different architecture

Either way, the end user experience would be vastly improved

sdesol 8 days ago

> This is a huge opportunity, maybe the next big breakthrough in AI when someone figures out how to solve it

I am not saying I solved it, but I believe we are going to experience a paradigm shift in how we program and teach and for some, they are really going to hate it. With AI, we can now easily capture the thought process for how we solve problems, but there is a catch. For this to work, senior developers will need to come to terms that their value is not in writing code, but solving problems.

I would say 98% of my code is now AI generated and I have 0% fear that it will make me dumber. I will 100% become less proficient in writing code, but my problem solving skills will not go away and will only get sharper. In the example below, 100% of the code/documentation was AI generated, but I still needed to guide Gemini 2.5 Pro

https://app.gitsense.com/?chat=c35f87c5-5b61-4cab-873b-a3988...

After reviewing the code, it was clear what the problem was and since I didn't want to waste token and time, I literally suggested the implementation and told it to not generate any code, but asks it to explain the problem and the solution, as shown below.

> The bug is still there. Why does it not use states to to track the start @@ and end @@? If you encounter @@ , you can do an if else on the line by asking if the line ends with @@. If so, you can change the state to expect replacement start delimiter. If it does not end with @@ you can set the state to expect line to end with @@ and not start with @@. Do you understand? Do not generate any code yet.

How I see things evolving over time is, senior developers will start to code less and less and the role for junior developers will not only be to code but to review conversations. As we add new features and fix bugs, we will start to link to conversations that Junior developers can learn from. The Dooms day scenario is obviously, with enough conversations, we may reach the point where AI can solve most problems one shot.

Full Disclosure: This is my tool

ghuntley 7 days ago

> For this to work, senior developers will need to come to terms that their value is not in writing code, but solving problems.

This is the key reason behind authoring https://ghuntley.com/ngmi - developers that come to terms with the new norm will flourish yet the developers who don't will struggle in corporate...

arrosenberg 7 days ago

Seems more likely that Orange would just be producing 16x the mediocre C-tier work they were doing before. This scenario assumes that the problem from the lower-tier programmers is their volume of work, but if you aren't able to properly understand the problems or solutions in the first place, LLMs are not going to fix it for you.

Chances are Grape and Apple will eventually adopt LLMs because they need to in order to fix the mistakes Orange is now producing at scale.

sdesol 7 days ago

Nice write up. I don't necessary think it is "if you adopt it, you will flourish" as much as, if you have this type of "personality you will easily become 10x, if you have this type of personality you will become 2x and if you have this type, you will become .5x".

I'm obviously biased, but I believe developers with a technical entrepreneur mindset, will see the most benefit. This paradigm shift requires the ability to properly articulate your thoughts and be able to create problem statements for every action. And honestly, not everybody can do this.

Obviously, a lot depends on the problems being solved and how well trained the LLM is in that person's domain. I had Claude and a bunch of other models write my GitSense Chat Bridge code which makes it possible to bring Git's history into my chat app and it is slow as hell. It works most of the time, but it was obvious that the design pattern was based on simple CRUD apps. And this is where LLMs will literally slow you down and I know this because I already solved this problem. The LLM generated chat bridge code will be free and open sourced but I will charge for my optimized indexing engine.