It’s quite difficult to explain why this isn’t how diffusion models work without giving people (and myself!) a headache, but I think understanding how these tools work is extremely important.
Here’s my mildly-informed attempt at explaining it:
DALL·E is not a multi-modal large language model. It does not reason or plan.
Its sole job is to try to denoise a bunch of pixels to create an image based on the text prompt it was given.
It is not a graphic designer, or a skilled artist drawing something from scratch.
Adding to what @Foxalabs said, an infographic (or any other form of media with lots of facts and figures) is not something you can usually get a quality version of by just “denoising”. As GPT-4 and other SOTA multi-modal models get better at providing multi-modal outputs, this should start to fill in that gap.
Hopefully that explanation wasn’t too far from the truth!