That's an interesting space to explore! I'm wondering about the baseline in the benchmarks. Which prompts did you use for those? I'm asking because some of the resulting prompts seem fairly generic, and I'm wondering if you could just blanket add them to each prompt and also see an improvement. Things like "Identify the question (what are you trying to find?)".
In the same vein, wouldn't it be interesting to measure which part of the prompt most contributed to better solving the problem? Surely some parts will be just noise and can be trimmed away.
Also wondering what this does, since the model probably won't (can't?) actually read the problem multiple times:
> Read the problem carefully (multiple times).
Re-reading the problem apparently works well - https://arxiv.org/abs/2309.06275
Here the system seems to have discovered this strategy by itself. The prompts are generic because during learning there is a part to refine and combine them. I haven’t experimented yet by adding all prompts to every query, given the large context it will be interesting to see.
Okay, but it looks like in the paper, they are actually adding the question twice in the prompt, not just instructing the model to read it twice. Or am I missing something?