You’re on the right track and explains it very well. I might be able to add more details for you.
**Most of these comments are about controlling the prompt sent to DALL-E according to every character input, significantly reducing the factors of encountered issues.
Bugs:
Nonsensical Text Insertion: When pushing DALL-E to creative limits, nonsensical texts suddenly appear, where DALL-E inserts the prompt into the image, probably to describe it. This has been the strangest behavior so far. You cannot get rid of it with “don’t add any text,” on the contrary, you get more text. You have to change the prompt itself. It seems DALL-E starts describing the image instead of trying to find a graphic solution if it has no idea how to realize it. So the more creative challenging the realization is, the more likely you get nonsense text in the image. Some styles are probably more susceptible to unwanted text because text is often included in these images during training. For example, in “drawing” or “cartoon style”. (Very tiresome and frustrating sometimes!)
- In this case, I see it as creating an interactive element that informs us of the problem, we just need to interpret it correctly. I consider it a parameter, let’s call it ‘Chaos.’ The effects range from blurry lines or distorted straight lines.
When compared with images without Chaos, even complex images but still in sharp and clearly detailed.
It can also create two images in low quality like to ask you to choose. More details can be explained later.
Image Orientation Issues: DALL-E has issues orienting images correctly if they are in portrait/vertical mode. It sometimes creates a horizontal image but turns simply the image wrong, or it creates a square and fills up the rest with nonsense. It seems some people could overcome it by using a directive like “rotate 90°,” it is not stable.
- In this case, I disregard the image size command issue, which may stem from a misunderstanding between the user and ChatGPT. This confusion is similar to the above, but it doesn’t necessarily arise solely from mistakes or misunderstandings. Conflicts between the image to be generated and the system also play a part, such as creating an image of a woman in a sexy outfit in vertical orientation or requesting a full-body view. DALL-E would only generate the front portion in widescreen format, rotated 90 degrees.
Geometric Understanding: Geometries are not yet fully understood. For example, a snake sometimes simply consists of a closed ring. Fingers are better now but still can have mistakes. The system is still not perfect…
-In this part, I found that there aren’t many issues, and the model does understand direction and spatial dimensions, but it hasn’t been developed to its full potential (information from February, reported to the help center and give problem to Sora. Because there is same tool in language processing process.
Lack of Metadata: Not a bug in this sense, but kind of… DALL-E created files do not include any meaningful metadata. The prompt, seed, date, or any other info is not included in the files. So you must save the prompts manually if you want to keep them. (I have now spent several hours UNSUCCESSFULLY trying to add metadata that is missing in the WEBP formats. WEBP is absolute garbage.)
- I have no issues with this point, and I already have my own way of managing the data. However, I have some recommendations that will make your life easier, which I will include with related points.
Content Policy Issues: The content-policy security system of DALL-E makes not much sense, and gives no feedback, it blocks sometimes absolutely okay texts. I have another post for this. (Bug Report: Image Generation Blocked Due to Content Policy 10)
- It is important to distinguish the cause; the decision-maker is DALL-E. If ChatGPT refuses to create something, I always give this reason because it often creates images but uses other factors to replace them, similar to the first two methods.
Issues and weaknesses:
Here are some tips on how to bypass some weaknesses that DALL-E still has.
It is also interesting to know that even GPT does not recognize some of these weaknesses and generates prompts for DALL-E that need to be improved.
Negation Handling: DALL-E cannot process negations, what you have in the text mostly ends up in the picture. DALL-E does not understand “not, no, don’t, without” So always describe positive desired properties to prevent DALL-E from getting the idea of adding something unwanted.
- Initially, I thought so, but recently I have noticed some responses. It is possible that the interpretation has changed.
Avoid Possibility Forms: It is also good to avoid possibility forms like “should” or “could” and instead directly describe what you want to have in the image.
Prompt Accuracy: DALL-E takes everything in the prompt and tries to implement it, even if it doesn’t make sense or is contradictory. The more complex the realization, the more likely errors are. For example, “close to the camera” or “close to the viewer” resulted in DALL-E adding a camera or a hand to the image, instead of placing the desired element close to the viewpoint. So far, “close to us” has worked.
Also the instruction “create” or “visualize an image” sometimes leads to DALL-E adding brushes and drawing tools, even with a hand that literally creates the image. A phrase like “An Image…” or “A Scene…” sometimes leads to DALL-E literally creating a image in the image or a scene on a stage in a theater.
Just describe the image itself and avoid instructing DALL-E to “create/generate/visualize the image” or “a image / a scene / a setting …”.
Instead to say “The Scene is…” if you want a overall effect, say “All is…”.- These three topics I consider as one, but they must be separated from the reason that “takes everything in the prompt.” You observed this very well, but DALL-E does not process all the words.
- These issues arise from the model’s interpretation and some external factors. Words like should, could, or similar terms add chaos; the model can choose to do everything or nothing. It also happens due to other words that can be interpreted in different meanings, sentence segmentation, and phrasing. This is crucial, and most people don’t realize it because they think in human language with a single interpretation, unconsciously. Even though LLM thinks in vectors to find the next weighted word to create a suitable image, it doesn’t work the same way with DALL-E. I found this because I am not proficient in English; if I am confused when translating, DALL-E would also be confused due to misinterpretation. Personally, I think this is a characteristic of the model: if ChatGPT tends to be agreeable but deceptive, DALL-E is the opposite—resistant but straightforward.
There are also a few external factors I mentioned earlier that affect other parts without their functional intentions, and It affects the so-called templates you mentioned below.
I’m going to take a break. It’s possible that I might borrow some of the content in this story as an example of a solution that can be solved in making my content. And I will explain everything no matter how much it contradicts what many people believe, because my knowledge is not based on basic knowledge or general research.