When ChatGPT creates an image with incorporated text labels, the text is often misspelled or garbled. ChatGPT explains that this happens because the image generator prioritizes visual rendering over text accuracy. My proposed solution (which ChatGPT agrees with) is to generate the image with blank speech bubbles and labels, then pass the location, size, and shape of these text areas to a subsequent step where proofread text is correctly generated and superimposed. This would significantly improve the usability of AI-generated infographics, comics, and labeled illustrations.
That sounds like the kind of thing you can ask the AI to produce when it makes the image - or command ChatGPT to never ask for speech or captions in the image description it sends at all. It will take a delicate touch of the right words for there not to be text, as DALL-E image generation model has the uncanny ability to insert words from prompts into images (like “coffee coffee”) even when not requested, and doesn’t really understand “blank” or “empty”.