Dall-E is sooo bad at recognizing letters and numbers - any advice?

It’s quite difficult to explain why this isn’t how diffusion models work without giving people (and myself!) a headache, but I think understanding how these tools work is extremely important.

Here’s my mildly-informed attempt at explaining it:
DALL·E is not a multi-modal large language model. It does not reason or plan.

Its sole job is to try to denoise a bunch of pixels to create an image based on the text prompt it was given.

It is not a graphic designer, or a skilled artist drawing something from scratch.

Adding to what @Foxalabs said, an infographic (or any other form of media with lots of facts and figures) is not something you can usually get a quality version of by just “denoising”. As GPT-4 and other SOTA multi-modal models get better at providing multi-modal outputs, this should start to fill in that gap.

Hopefully that explanation wasn’t too far from the truth!

4 Likes