So now either the agent is writing the tests, in which case you're right back to the same issue (which tests are actually worth running?) or your job is now just writing tests (bleh...).
And for the llm review of the pr... Why do you assume it'll be worth any more then the original implementation? Or are we just recursing down a level again (if 100 llms review each of the 100 PRs... To infinity and beyond!)
This by definition is not trivially automated.
The LLMs can help with the writing of the tests, but you should verify that they're testing critical aspects and known edge cases are covered. A single review-promoted LLM can then utilize those across the PRs and provide a summary for acceptance the the best. Or discard all and do manually; that initial process should only have taken a few minutes, so minimal wastage in the grand scheme of things, given over time there are a decent amount of acceptances, compared to the alternative 100% manual effort and associated time sunk.