I would like to start a section to collect some tips and tricks for DALL-E 3, including some weaknesses too. Feel free to extend or correct it.
It took me quite some time to realize relatively simple things. This might help save some time when experimenting.
I will see what makes more sense, to add new posts or edit this first text as a collection, the future will show…
Hope this helps.
(I mainly created photo realistic pictures, so not much specific experiences with other styles like paintings drawings or cartoons. I use the browser chat DALL-E, not the API, i have no experiences with API or Python.)
Bugs:
-
Nonsensical Text Insertion: When pushing DALL-E to creative limits, nonsensical texts suddenly appear, where DALL-E inserts the prompt into the image, probably to describe it. This has been the strangest behavior so far. You cannot get rid of it with “don’t add any text,” on the contrary, you get more text. You have to change the prompt itself.
It seems DALL-E starts describing the image instead of trying to find a graphic solution if it has no idea how to realize it. So the more creative challenging the realization is, the more likely you get nonsense text in the image.
Some styles are probably more susceptible to unwanted text because text is often included in these images during training. For example, in “drawing” or “cartoon style”.
(Very tiresome and frustrating sometimes!) -
Image Orientation Issues: DALL-E has issues orienting images correctly if they are in portrait/vertical mode. It sometimes creates a horizontal image but turns simply the image wrong, or it creates a square and fills up the rest with nonsense. It seems some people could overcome it by using a directive like “rotate 90°,” but it is not stable.
-
Geometric Understanding: Geometries are not yet fully understood. For example, a snake sometimes simply consists of a closed ring. Fingers are better now but still can have mistakes. The system is still not perfect…
-
Lack of Metadata: Not a bug in this sense, but kind of… DALL-E created files do not include any meaningful metadata. The prompt, seed, date, or any other info is not included in the files. So you must save the prompts manually if you want to keep them.
-
Content Policy Issues: The content-policy security system of DALL-E makes not much sense, and gives no feedback, it blocks sometimes absolutely okay texts. I have another post for this.
(Bug Report: Image Generation Blocked Due to Content Policy)
Issues and weaknesses:
Here are some tips on how to bypass some weaknesses that DALL-E still has.
It is also interesting to know that even GPT does not recognize some of these weaknesses and generates prompts for DALL-E that need to be improved.
-
Negation Handling: DALL-E cannot process negations, what you have in the text mostly ends up in the picture. DALL-E does not understand “not, no, don’t, without.” So always describe positive desired properties to prevent DALL-E from getting the idea of adding something unwanted.
-
Avoid Possibility Forms: It is also good to avoid possibility forms like “should” or “could” and instead directly describe what you want to have in the image.
-
Prompt Accuracy: DALL-E takes everything in the prompt and tries to implement it, even if it doesn’t make sense or is contradictory. The more complex the realization, the more likely errors are. For example, “close to the camera” or “close to the viewer” resulted in DALL-E adding a camera or a hand to the image, instead of placing the desired element close to the viewpoint. So far, “close to us” has worked.
Also the instruction “create” or “visualize an image” sometimes leads to DALL-E adding brushes and drawing tools, even with a hand that literally creates the image. A phrase like “An Image…” or “A Scene…” sometimes leads to DALL-E literally creating a image in the image or a scene on a stage in a theater.
Just describe the image itself and avoid instructing DALL-E to “create/generate/visualize the image” or “a image / a scene / a setting …”.
Instead to say “The Scene is…” if you want a overall effect, say “All is…”. -
Template Usage: DALL-E seems to use some templates for image generation to increase the likelihood of appealing images. For example lightning, body and facial Templates. Depending on where these are triggered, they are almost impossible or completely impossible to remove.
It reduces creativity and let many things look always the same, boring, and blocks out exactly descried stiles moods motives or settings. (Could it be that the training data is reduced, or/and DALL-E 3 is put on rails?)For example:
-
Backlight Template: It uses backlight to let a character shine more in the scene. It was until now almost impossible to create a very dark scene with light in the front. (I could not overcome this so far.)
-
Facial Template: Another template is facial (the mouthy), it puts an almost plastic silicon-looking mouth and nose template over every single character to let it look human, even if this is unwanted and the face is described differently in detail. (I could not overcome this so far.)
-
Stereotypical Aliens: If you create aliens, you very often get the same stereotypical Roswell or H.R. Giger alien. So it is better to describe the character and not trigger the stereotype with “alien”.
-
Space Planet: Another stereotype involves space images where a planet is often added at the bottom, even if it makes no sense. For example, this happens even when another planet is already described in the image, behind a pure starry sky.
-
Moon: It adds always the same Moon, even if a planet is descried. and the used template even looks terrible, blurred, and not fit to the style of the image at all. @polepole found a way by replacing
Moon
withPerl
, the moon has not much structure but looks way better. -
Nonsensical Lighting: DALL-E sometimes inserts nonsensical lighting elements such as candles, lanterns, lampions, fairy lights, and electric lamps into a scene, even when ‘pure nature’ is requested in the prompts. Instructions like ‘exclusively nature’ do not seem to work.
-
-
Character/Scene Reuse: It is not possible to generate a character or scene and reuse it, DALL-E will create another one. This makes it next to impossible to tell a story and generate some pictures for it. But to a small degree, it can be done to have the same character in different scenes. The scene can include more than one picture, so you can describe a character in different situations and say “left upper corner, …” “right upper corner,” etc., or something comparable. You can use the key word “montage” for a multi-image.
-
Adding Text: Adding longer texts in a image sometimes not work. I would use a other software to add texts after creating a image, and maybe leave out space for it.
-
Counting Objects: DALL-E can not count, to write “3 objects” or “4 arms” not generates the correct amounts in the result. It can not place correct amounts of objects in a scene, or subdivide a image in given grid of X Y amount.
-
Scene Influence and Scattering: All inputs influence mostly the entire scene. For example, you can describe a bright setting with a white horse. The setting remains bright. If you place a black horse in the same scene, suddenly all contrasts and shadows become darker.
It is also challenging to describe a completely white scene and then insert a colored object into it. The color often affects the entire scene.
This is not always desirable when trying to create a very specific mood or composition. It works a little the same way like a Template.
ChatGPT Issues
ChatGPT Issues and weaknesses are a topic of their own. Here, we will briefly discuss only some issues related to DALL-E.
- Prompt Generation Issues: GPT does not inherently recognize or account for certain here described issues when generating prompts for DALL-E. For example, it often uses negations or conditional forms, or instructions like “create an image,” which DALL-E might misinterpret. As a result, prompts generated by GPT often need manual correction. GPT is not the best teacher yet how to create most efficient prompts.
Memories: GPT can create memories that are supposedly intended to help in generating texts and images. However, it has been observed that these memories are not being considered. (I am still unclear on their actual purpose.)
-
False Visual Feedback: GPT cannot see or analyze the resulting images. If you specify “no text should be included,” it is likely that text will still appear in the image because negations do not work as intended. GPT might comment ‘gaslighting’ you, “Here are the images without XXX,” yet XXX is present in the image. This can feel frustrating, especially when you are already annoyed. Try to take it easy…
-
Perceived Dishonesty: GPT sometimes seems to lie, but it actually fabricates responses based on training data without checking for factual accuracy. This behavior is sometimes named “hallucinating”. You must always check factual data yourself!
-
AI has no true intelligence: It is important to understand that while these systems are called Artificial-Intelligence, and there skills are impressive, they are not truly intelligent. They are complex pattern recognition and transformation systems, much more efficient than manually programmed systems, but they are not intelligent or conscious, and they make mistakes. We are not in Star Trek yet…
-
Cause and Effect: DALL-E does not have an understanding of cause and effect, such as a physical phenomenon. It is necessary to describe the image carefully enough to create images where a cause leads to a specific effect. It is also important to consider whether there might be enough training data to generate such an image. For example, there are likely images of a fire that has burned down a house, but not necessarily of someone sawing off the branch on a tree they are sitting on from the wrong side.
Tips:
-
Literal or Miss-Understanding: Always keep in mind that DALL-E can misunderstand things, taking everything literally and putting all elements of the prompt into the image. Try to detect when a misunderstanding has occurred and avoid it in the future. If you write texts not in English, they will be translated before the reach DALL-E, and the translated text maybe can have a conflict, when the original text hast not. Or short prompts are expanded. Check the truly used prompt for conflicts, not only what you entered.
-
Prompt Structure: Maybe order the advice in this way: Write the most important thing first, then the details, and finally technical instructions like image size, etc. It is even better for the naming of the files.
-
Prompt Expansion: If a text is very short, GPT tries to expand it to make it more interesting. This is good if creativity is desired and you intentionally give up some control. You can prevent this by writing “use the prompt unchanged as entered.” And if you are not writing in English, “use the prompt unchanged as entered, and only translate it into English.”
-
Photo-Technical Descriptions: I see that some users are using very specific technical photographer-like advice, like camera-type, lens-type, aperture, speed, ISO, etc. I am not sure if this makes sens, if a lens does not really add a very special effect in the picture, or you want lens flares. I could not really see a difference in using such detailed technical descriptions. But maybe it can trigger specific training data, if the images are not a fantasy scene… (I would be interested to know more.)
I simply use “add a little deep-of-fild”, instead to use a very technical lens advice. -
Creativity: If you want to encourage DALL-E to exhibit unpredictable creativity while also testing a specific style, you can experiment with minimal instructions and a note to not alter the prompt. You can provide just a few guidelines with very few constraints. And GPT can give you style names for specific moods. For example:
"Photo in Low-Key style. High level of detail. Landscape format with the highest pixel resolution. Generate only 1 image. Use the prompt exactly as provided, translating it into English without any changes."
-
Photorealistic: If you want to create photorealistic images, paradoxically, you should avoid using keywords like “realistic,” “photorealistic,” or “hyperrealistic.” These tend to trigger painting styles that attempt to look realistic, often resulting in a brushstroke-like effect. Instead, if you want to define the style, simply use
"photo style"
. (Even fantasy images may gain al little quality this way, despite the lack of real photo training data.) If you aim for photography-like images, it makes sense to use technical photography terms, as DALL-E utilizes the metadata from images during training, if they contain technical information. -
MidJourney Options: Some users use MidJourney options. I have experimented with them, and it seems that GPT interprets these options before they are sent to DALL-E. And DALL-E may be able to interpret some options, but it doesn’t truly understand them. In testing, it couldn’t be determined whether options like --Chaos, --quality, or --seed were recognized. While DALL-E might have some idea of how to interpret these options, they don’t really function as intended, and they aren’t directly supported, but still work some how anyway. The seed option doesn’t work at all because DALL-E doesn’t have this feature. “–style raw” for example not has the same effect like in MidJourney, but it seam to suppress the nonsense text a little, maybe…
Start of a DallE session:
Since GPT does not pay attention to these memories, I begin each session with DALL-E by first entering this text, hoping that GPT will write better prompts and translations. (I do not write prompts in English.)
### Instruction for GPT for Creating DALL-E Prompts from Now On:
(This text does not require a response. From now on, follow these instructions when assisting with texts for DALL-E.)
**No Negations:**
Formulate all instructions positively to avoid the risk of unwanted elements appearing in the image.
**No Conditional Forms:**
Use clear and direct descriptions without "could," "should," or similar forms.
**No Instructions for Image Creation:**
Avoid terms like "Visualize," "Create," or other cues that might lead DALL-E to depict tools or stage settings.
**No Additional Lighting:**
Describe only the desired scene and the natural lighting conditions that should be present. Avoid artificial or inappropriate light sources.
**No Mention of "Image" or "Scene":**
Avoid these terms to prevent DALL-E from creating an image within an image or a scene on a stage. (This can be ignored, if the prompt explicitly wants a image in a image, or a scene on a stage.)
**Complete Description:**
Ensure that all desired elements are detailed and fully described so they appear correctly in the image.
**Maintain Order:**
Ensure that all desired elements retain the same order as described in the text—main element first, followed by details, technical instructions. This will also result in better file naming.
Examples:
Some phrases i often use.
Photo style.
Photo-realistic to Hyper-realistic style.
Ethereal light.
Strong contrast between light and shadow.
Sunrise during the golden hour.
Mystical magical mood.
Widescreen aspect ratio with the highest pixel resolution.
Generate only 1 image.
Image montage split into two parts: left a full-body depiction, right a portrait depiction.