This is a fun idea, but whatever version of stable diffusion being used isn’t very prompt adherent so any kind of complex scene or interaction between characters is mostly ignored, which reduces the fun a bit.
Whenever I show things like this to my kids they always say something tbat’ll be really hard - like ‘a unicorn riding a princess’ and then everything comes back as princess riding a unicorn and they say ‘this sucks’
Trope subversion is always difficult since there is naturally so little training data in most checkpoints (SD, XL, etc.) that reverses things like this (mermaid with human legs and fish head, Cerberus with five heads, a piano where natural keys are black and the sharps/flats are white, etc.)
Outside of manual control like ControlNets / Inpainting or a custom LoRa, there's not much you can do except "re-roll" hoping you'll get lucky.
I wonder if when language models are better integrated with image generation models it’ll get any better at this. Or is this a fundamental issue that can’t be solved - it’s not like we’re going to add these edge cases to the training data