Trope subversion is always difficult since there is naturally so little training data in most checkpoints (SD, XL, etc.) that reverses things like this (mermaid with human legs and fish head, Cerberus with five heads, a piano where natural keys are black and the sharps/flats are white, etc.)
Outside of manual control like ControlNets / Inpainting or a custom LoRa, there's not much you can do except "re-roll" hoping you'll get lucky.
I wonder if when language models are better integrated with image generation models it’ll get any better at this. Or is this a fundamental issue that can’t be solved - it’s not like we’re going to add these edge cases to the training data