mattmanser 6 days ago

I had a quick look at the repo as I wondered what you meant by multiple specialized agents.

Funsamentally each of those 24 agents seem to be just:

"load from pdf > put text into this prompt > Call OpenAI API"

So is it actually just posting 24 different prompts to a generalist AI?

I'm also wondering about the prompts, one I read said "find 3-4 problems per section....find 10-15 problems per paper". What happens when you put up a good paper, does this force it to find meaningless, nit-picky, problems? Have you tried papers which are acknowledged to be well written on it?

From a programming perspective the code has got a lot of room from improvements.

The big one is if you'd used the same interface for each "agent" you could have had them all self register and call themselves in a loop rather than having to do what you've done in this file:

https://github.com/robertjakob/rigorous/blob/main/Agent1_Pee...

TBH, that's a bit of a WTF file. The `def _determine_research_type` method looks like a placeholder you've forgotten about too, as it use a pretty wonky way to determine the paper type.

Also, you really didn't need specialized classes for each prompt, you could have just had the prompts as text files a single class loaded as templates that you just replace text into. That will mean you're going to have a lot of work whenever you need to update the way your prompting works, having to change 24 files each time, probably cut/pasting which is error prone.

I've done it before where you have the templates in a folder, and the program just dynamically loads them. So you can add more really easily. Next stage is to add pre-processor directives to your loader that allows you to put some config at the top of each text file.

I'm also not looking that hard at the code, but it seems you dump the entire paper into each prompt, rather than just the section it needs to review, which seems like an easy money saver if you asked an AI to chop up the paper, then just inject the section needed to reduce your costs for tokens. Although you then run the risk of it chopping it up badly.

Finally, and this is a real nitpick but it's twitch inducing when reading the prompts, comments in javascript are two forward slashes not a hash.

1
rjakob 5 days ago

Best feedback so far!

You're right: In the current version each "agent" essentially loads the whole paper, applies a specialized prompt, and calls the OpenAI API. The specialization lies in how each prompt targets a specific dimension of peer review (e.g., methodological soundness, novelty, citation quality). While it’s not specialization via architecture yet (i.e., different models), it’s prompt-driven specialization, essentially simulating a review committee, where each member is focused on a distinct concern. We’re currently using a long-context, cost-efficient model (GPT-4.1-nano style) for these specialized agents to keep it viable for now. Think of it as an army of reviewers flagging areas for potential improvement.

To synthesize and refine feedback, we also run Quality Control agents (acting like an associate editor), which reviews all prior outputs from the individual agents to reduce redundancy and surface the most constructive insights (and filter out less relevant feedback).

On your point about nitpicking: we’ve tested the system on several well-regarded, peer-reviewed papers. While the output is generally reasonable and we did not discover "made up" issues yet, there are occasional instances where feedback is misaligned. We're convinced, however, we can almost fully reduce such noise in future iterations (Community Feedback is super important to achieve this).

On the code side: 100% agree. This is very much an MVP focused on testing potential value to researchers, and the repeated agent classes were helpful for fast iteration. However, your suggestion of switching to template-based prompt loading and dynamic agent registration is great and would improve maintainability and scalability. We'll 100% consider it in the next version.

The _determine_research_type method is indeed a stub. Good catch. Also, lol @ the JS comment hashes, touché.

If you're open to contributing or reviewing, we’d love to collaborate!