I have tried many times by asking dalle-3 to put a text in the image. For example,
draw a large file icon and write "ZIP" on it
But often dalle-3 will put wrong texts, such as “ZP”, “Z”, “ZAP”, for 10 reties, only about 4 of them will produce the correct text. Why? How can make dalle-3 to produce the exact text I ask for in the prompt?
There’s specific reference to this limitation in the DALL-E-3 paper:
5.2 Text rendering
When building our captioner, we paid special attention to ensuring that it was able to include prominent words found in images in the captions it generated. As a result, DALL-E 3 can generate text when prompted. During testing, we have noticed that this capability is unreliable as words are have missing or extra characters. We suspect this may have to do with the T5 text encoder we used: when the model encounters text in a prompt, it actually sees tokens that represent whole words and must map those to letters in an image. In future work, we would like to explore conditioning on character-level language models to help improve this behavior.