Aren’t prompts seeking to offload reasoning though? Is that really a fair data point for this?
When people are claiming they can't reason, then yes, benchmarking against average human should be a bare minimum. Arguably they should benchmark against below-average humans too, because the bar where we'd be willing to argue that a human can't reason is very low.
If you're testing to see whether it can replace certain types of work, then it depends on where you would normally set the bar for that type of work. You could offload a whole lot of work with something that can reliably reason at below an average human.