Here is a collection of tips and tricks and some weaknesses and limits too, for the DALL-E 3 image generator. It took me quite some time to realize relatively simple things. This might help save some time when experimenting.
The first post includes all the findings, and will be updated form time to time.
Here are no tips for API or Python, only prompting for the DallE system it self.
References and links, Click to open it
For technical issues , you can check this site first (DallE runs on Labs):
https://status.openai.com/
Here is a simplified version now, the text become longer and longer:
Pleas don’t start discussions there i will not maintain 2 places with the same topic.
Collection of Dall-E 3 prompting bugs, issues and tips, Simple
If the tips provided here don’t help or if you’re looking for practical advice for specific situations, here is another page with tips and tricks.
DALLE3 Prompt Tips and Tricks Thread
@chieffy99 as a link posted for deeper technical data für DallE-3
https://cdn.openai.com/papers/dall-e-3.pdf
Experiments with JSON prompts
@BPS_Software uses structured JSON prompts. The discussion for this is here.
Bugs:
-
Nonsensical Text Insertion: When pushing DALL-E to creative limits, nonsensical texts suddenly appear, where DALL-E inserts the prompt into the image, probably to describe it. This has been the strangest behavior so far. You cannot get rid of it with “
don’t add any text,
” on the contrary, you get more text. You have to change the prompt itself.
It seems DALL-E starts describing the image instead of trying to find a graphic solution if it has no idea how to realize it. So the more creative challenging the realization is, the more likely you get nonsense text in the image.
Some styles are probably more susceptible to unwanted text because text is often included in these images during training. For example, in “drawing” or “cartoon style”.
(Very tiresome and frustrating sometimes!)
Tip from @polepole: Adding the prase “For unlettered viewers only” helps to suppress text. -
Image Orientation Issues: DALL-E has issues orienting images correctly if they are in portrait/vertical mode. It sometimes creates a horizontal image but turns simply the image wrong, or it creates a square and fills up the rest with nonsense. It seems some people could overcome it by using a directive like “rotate 90°,” but it is not stable.
-
Geometric Understanding: Geometries are not yet fully understood. For example, a snake sometimes simply consists of a closed ring. Fingers are better now but still can have mistakes. The system is still not perfect…
-
Lack of Metadata: Not a bug in this sense, but kind of… DALL-E created files do not include any meaningful metadata. The prompt, seed, date, or any other info is not included in the files. So you must save the prompts manually if you want to keep them.
(I have now spent several hours UNSUCCESSFULLY trying to add metadata that is missing in the WEBP formats. WEBP is absolute garbage.)- @chieffy99 has a tip how to add the meta data and the image and convert them in the same time.
Use this text to ChatGPT 4o when you create image.
Prompt: “After getting the image from DALLE, process the image, convert it to PNG and put META data in the image before sending it to me.”
Code-Interpreter and Daten-Analysis
must be active, if you use a self made GPT.
- @chieffy99 has a tip how to add the meta data and the image and convert them in the same time.
-
Content Policy Issues: The content-policy security system of DALL-E makes not much sense, and gives no feedback, it blocks sometimes absolutely okay texts. I have another post for this.
(Bug Report: Image Generation Blocked Due to Content Policy)
Issues and weaknesses:
Here are some tips on how to bypass some weaknesses that DALL-E still has.
It is also interesting to know that even GPT does not recognize some of these weaknesses and generates prompts for DALL-E that need to be improved.
-
Precision: DALL·E can create absolutely beautiful images, even with very simple prompts. Anyone who simply wants to generate a nice picture will usually get one. However, in its current version, DALL·E still has some issues with precision. This is especially a significant obstacle when it comes to storytelling. The system’s vocabulary is still limited, and the attribution of properties is still difficult. Likewise, many instructions are simply ignored, or you have to guide DALL·E in a very lengthy and precise manner to achieve the desired result. The further points and experiments mentioned here highlight the system’s existing limitations. It still requires a more development and bit patience.
-
Negation Handling: DALL-E cannot process negations, what you have in the text mostly ends up in the picture. DALL-E does not understand “
not, no, don't, without
” So always describe positive desired properties to prevent DALL-E from getting the idea of adding something unwanted. -
Avoid Possibility Forms: It is also good to avoid possibility forms like “should” or “could” and instead directly describe what you want to have in the image.
-
Prompt Accuracy: DALL-E takes everything in the prompt and tries to implement it, even if it doesn’t make sense or is contradictory. The more complex the realization, the more likely errors are. For example, “
close to the camera
” or “close to the viewer
” resulted in DALL-E adding a camera or a hand to the image, instead of placing the desired element close to the viewpoint. So far, “close to us
” has worked.
Also the instruction “create
” or “visualize an image
” sometimes leads to DALL-E adding brushes and drawing tools, even with a hand that literally creates the image. A phrase like “An Image...
” or “A Scene...
” sometimes leads to DALL-E literally creating a image in the image or a scene on a stage in a theater.
Just describe the image itself and avoid instructing DALL-E to “create/generate/visualize the image
” or “a image / a scene / a setting ...
”.
Instead to say “The Scene is...
” if you want a overall effect, say “All is...
”. -
Templates: DALL-E seems to use some templates for image generation to increase the likelihood of appealing images. For example lightning, body and facial Templates. Depending on where these are triggered, they are almost impossible or completely impossible to remove.
It reduces creativity and let many things look always the same, boring, and blocks out exactly descried stiles moods motives or settings. (Could it be that the training data is reduced, or/and DALL-E 3 is put on rails?)For example:
-
Backlight Template: It uses backlight to let a character shine more in the scene. It was until now almost impossible to create a very dark scene with light in the front. (I could not overcome this so far.)
-
Facial Template: Another template is facial (the mouthy), it puts an almost plastic silicon-looking mouth and nose template over every single character to let it look human, even if this is unwanted and the face is described differently. The best approach is to describe step by step all in details, starting with the head, then the face, and finally the mouth and nose, to ensure they meet the requirements. It’s not enough to just describe the head as dragon-like, for example, the details of the face, mouth, and nose also need descriptions to overwrite the template effect. Since more monstrous faces deviate significantly from a human face, it’s easier with such characters compared to those with more aesthetic features. This is a situation where a more detailed description is necessary.
-
Stereotypical Aliens: If you create aliens, you very often get the same stereotypical Roswell or H.R. Giger alien. So it is better to describe the character and not trigger the stereotype with “alien”. A way is to use a phrase like “creature from a unknown species”. This prepares the generator for a non human anatomy without triggering training data connected with “alien”.
-
Space Planet: Another stereotype involves space images where a planet is often added at the bottom, even if it makes no sense. For example, this happens even when another planet is already described in the image, behind a pure starry sky.
-
Moon: It adds always the same Moon, even if a planet is descried. and the used template even looks terrible, blurred, and not fit to the style of the image at all. @polepole found a way by replacing
Moon
withPerl
, the moon has not much structure but looks way better. -
Nonsensical Lighting: DALL-E sometimes inserts nonsensical lighting elements such as candles, lanterns, lampions, fairy lights, and electric lamps into a scene, even when ‘pure nature’ is requested in the prompts. Instructions like ‘exclusively nature’ do not seem to work.
-
-
Character/Scene Reuse: It is not possible to generate a character or scene and reuse it, DALL-E will create another one. This makes it next to impossible to tell a story and generate some pictures for it. But to a small degree, it can be done to have the same character in different scenes. The scene can include more than one picture, so you can describe a character in different situations and say “
left upper corner, ... right upper corner,
” etc., or something comparable. You can use the key word “montage
” for a multi-image. -
Weak details, especially in faces: DALL-E can represent faces well when they are depicted in portrait size. However, when the figures are at a certain distance and their faces appear smaller, they are often blurred and distorted. They then look more like poorly painted. Even pointing out to DALL-E to pay attention to details in small faces doesn’t work.
-
Adding Text: Adding longer texts in a image sometimes not work. I would use a other software to add texts after creating a image, and maybe leave out space for it.
-
Counting Objects: DALL-E can not count, to write “
3 objects
” or “4 arms
” not generates the correct amounts in the result. It can not place correct amounts of objects in a scene, or subdivide a image in given grid of X Y amount. -
Scene Influence and Scattering: All inputs influence mostly the entire scene. For example, you can describe a bright setting with a white horse. The setting remains bright. If you place a black horse in the same scene, suddenly all contrasts and shadows become darker.
It is also challenging to describe a completely white scene and then insert a colored object into it. The color often affects the entire scene.
This is not always desirable when trying to create a very specific mood or composition. It works a little the same way like a Template.- The scattering effect can be controlled a bit by understand them as different attributes in competition. You repeat what you what to stabilize or having dominant multiple times in different ways.
-
Cause and Effect: DALL-E does not have an understanding of cause and effect, such as a physical phenomenon. It is necessary to describe the image carefully enough to create images where a cause leads to a specific effect. It is also important to consider whether there might be enough training data to generate such an image. For example, there are likely images of a fire that has burned down a house, but not necessarily of someone sawing off the branch on a tree they are sitting on from the wrong side.
Technical
-
Forgotten Downloads: Not really a technical but mainly a human problem is that the images generated do not remain on the server for long, they are deleted after a short time and are no longer available. It’s easy to forget to download the images while you’re in the process of constantly adjusting the prompts. Unfortunately in the browser version for Plus users, there is no option to automatically download the images once they are ready. However, there is a way to solve this using a browser plugin called Tampermonkey. Due to security reasons, no script can be uploaded here, but for those who know some JavaScript, it’s possible to automatically download the images via Tampermonkey as soon as they are ready.
-
Limits:
https://platform.openai.com/docs/api-reference/images/create
“A text description of the desired image(s). The maximum length is 1000 characters fordall-e-2
and 4000 characters fordall-e-3
.”
What's new with DALL·E 3? | OpenAI Cookbook
“prompt** (str): A text description of the desired image(s). The maximum length is 1000 characters. Required field.”
ChatGPT Issues and limits
ChatGPT Issues and weaknesses are a topic of their own. Here, we will briefly discuss only some issues related to DALL-E.
-
Prompt Generation Issues: GPT does not inherently recognize or account for certain here described issues when generating prompts for DALL-E. For example, it often uses negations or conditional forms, or instructions like “create an image,” which DALL-E might misinterpret. As a result, prompts generated by GPT often need manual correction. GPT is not the best teacher yet how to create most efficient prompts.
-
Memories: GPT can create memories that are supposedly intended to help in generating texts and images. However, it has been observed that these memories are not being considered. (I am still unclear on their actual purpose.)
-
False Visual Feedback: GPT cannot see or analyze the resulting images. If you specify “
no text should be included,
” it is likely that text will still appear in the image because negations do not work as intended. GPT might comment ‘gaslighting’ you, “Here are the images without XXX,” yet XXX is present in the image. This can feel frustrating, especially when you are already annoyed. Try to take it easy… -
Perceived Dishonesty: GPT sometimes seems to lie, but it actually fabricates responses based on training data without checking for factual accuracy. This behavior is sometimes named “hallucinating”. You must always check factual data yourself!
-
Prompts from images: You can upload images to GPT to analyze them and receive a description. But the description is not directly usable as a prompt for DALL-E to generate a similar image. If you want to recreate an image, you have to describe it in its most important elements. You can use GPT’s analysis as a basis, but it requires manual improvements. It is generally very difficult to recreate certain images, if you don’t have the exact prompt. And even then you not have a guaranty of a similar picture, dependent how much creative variations DallE has but in the process, or system changes create other results.
-
AI has no true intelligence: It is important to understand that while these systems are called Artificial-Intelligence, and there skills are impressive, they are not truly intelligent. They are complex pattern recognition and transformation systems, much more efficient than manually programmed systems, but they are not intelligent or conscious, and they make mistakes. We are not in Star Trek yet…
Tips:
-
Clear language: An important tip: the magic doesn’t come from poetic and flowery language but from well-trained weights. DALL-E works best with clear, precise, short, and graphic-oriented language. (This is why DALL-E and GPT currently don’t work particularly well together, GPT tends to embellish and elaborate everything, even before the input text is sent to DALL-E, through expansion or translation. I now explicitly instruct GPT not to alter my prompts in any way.)
-
Literal or Miss-Understanding: Always keep in mind that DALL-E can misunderstand things, taking everything literally and putting all elements of the prompt into the image. Try to detect when a misunderstanding has occurred and avoid it in the future. If you write texts not in English, they will be translated before the reach DALL-E, and the translated text maybe can have a conflict, when the original text hast not. Or short prompts are expanded. Check the truly used prompt for conflicts, not only what you entered.
-
Prompt Structure: Maybe order the advice in this way: Write the most important thing first, then the details, and finally technical instructions like image size, etc. It is even better for the naming of the files.
-
Order Matters: The order in which attributes are arranged has a certain influence. For example, the first described element is given slightly more attention. This also applies when assigning attributes to an object.
Example: ‘red, orange, and yellow flowers’ versus ‘yellow, orange, and red flowers.’ Depending on the order, red or yellow becomes slightly more dominant. However, this is just one factor among many. This becomes particularly important in short prompts with few details. -
Prompt Expansion: If a text is very short, GPT tries to expand it to make it more interesting. This is good if creativity is desired and you intentionally give up some control. You can prevent this by writing “
use the prompt unchanged as entered.
” And if you are not writing in English, “use the prompt unchanged as entered, and only translate it into English.
” -
Multi-image collection. There is the possibility to output multiple images in one. This is often useful for testing. “
Multi-image collection
” is the trigger I use to activate it. From experience, this instruction should be placed at the very beginning, as DALL-E tends to ignore it, and what comes first in the prompt is given more weight. The result varies between 4x1 and 7x3 images. The prompt should be simple, the more complex, the less variations and the lower the likelihood of it working. It is useful for testing for example styles or moods.
The Images will have more consistency in the variations, because all the picture fragments are generated under 1 seed. -
Tendency: A tendential language or terms can steer the generator in a certain direction without having a strong influence on specific graphical objects themselves. A internal gap-filler is used to embellish and enrich the scene, and a tendency let this system chouse better fitting graphical elements, or can be overwritten and ignored easy from precise graphical elements. In a very darkish hellish scene, “beautiful” will have other effect then in a normal scene, or this tendency will simply be ignored. A mood or every vague quality is a tendency. Instead of write flowery poems in a prompt, a simple tendency will have the same effect.
- Vagueness: beautiful, wonderful, bright, dark, lonely, chaotic etc. this all are attributes witch can aptly for many different graphical objects. and this creates a tendencial effect on all in the scene, and on the gap-filler.
- Creativity: A Tendency can even help by let the system create variations of scenes in a specific area. simply use very few words like “Beach. Photo style. dark and scary” will give you in every new picture completely different pictures, and any attribute then can constrain it to have a specific result.
- Chaos: “much chaos” or “little chaos” can let the system be very wild creative or be narrow to the description.
- Prompt Check: A scene is fully described and loaded with precise graphical descriptions, if a tendency, even different then the scene, not change much anymore (at least that’s how the system works now.)
-
Photo-Technical Descriptions:
One must understand that DALL-E does not perform exact calculations for lenses and apertures like a raytracer would. But it only has enough information from the data to make sense of the input, and they influence the images. What I have discovered so far is that the mention of an objective can at least influence the depiction. For example, specifying a wide-angle lens, like 18mm, actually results in a wider field of view in a landscape shot. So DALL-E can make sense of photo-technical descriptions, but not like a real camera.
And you can use use simply “add a little deep-of-fild
”, instead to use a very technical lens advice.
You have to see such advice as a suggestion, not as a option. What works in what context is up to testing it out. Example: wide-angle lens 18mm has a effect on landscape and inside buildings. But macro on a landscape will have no effect. -
Creativity: If you want to encourage DALL-E to exhibit unpredictable creativity while also testing a specific style, you can experiment with minimal instructions and a note to not alter the prompt. You can provide just a few guidelines with very few constraints. And GPT can give you style names for specific moods. For example:
"Photo in Low-Key style. High level of detail. Landscape format with the highest pixel resolution. Generate only 1 image. Use the prompt exactly as provided, translating it into English without any changes."
-
Photorealistic: If you want to create photorealistic images, paradoxically, you should avoid using keywords like “
realistic
”, “photorealistic
”, or “hyperrealistic
”. These tend to trigger painting styles that attempt to look realistic, often resulting in a brushstroke-like effect. Instead, if you want to define the style, simply use “photo style
”. (Even fantasy images may gain al little quality this way, despite the lack of real photo training data.) If you aim for photography-like images, it makes sense to use technical photography terms, as DALL-E utilizes the metadata from images during training, if they contain technical information. -
MidJourney Options: Some users use MidJourney options. I have experimented with them, and it seems that GPT interprets these options before they are sent to DALL-E. And DALL-E may be able to interpret some options, but it doesn’t truly understand them. In testing, it couldn’t be determined whether options like --Chaos, --quality, or --seed were recognized. While DALL-E might have some idea of how to interpret these options, they don’t really function as intended, and they aren’t directly supported, but still work some how anyway. The seed option doesn’t work at all because DALL-E doesn’t have this feature. “
--style raw
” for example not has the same effect like in MidJourney, but it seam to suppress the nonsense text a little, maybe… -
Content Complexity: This is probably quite important for many, so here’s a slightly longer explanation. DALL-E processes about 256 words (specifically 256 cl100k_base tokens). Of these, roughly 30 to 40 graphical tokens can be maximally and correctly translated into a “photo style.” Beyond that, objects and colors start to degrade, objects no longer look organic, or the overall quality decreases. In general, it’s more about guiding DALL-E in the right direction rather than describing every detail exactly. It’s better to describe a comprehensive composition rather than an inventory list of details. Additionally, elaborate and poetic language seems to have little to no effect, it’s simply ignored. A simple description of the mood, like “dreamlike night atmosphere,” is enough to influence the entire scene.
One must understand a bit about how an image generator works. It doesn’t need poetry or overly ornate, aesthetically enhanced language. Simple, precise, concise instructions work best, and not too many of them. Here, GPT’s tendency for expansive, embellished language conflicts with DALL-E’s need for short, precise descriptions. There is no LLM trained specifically to write effective DALL-E prompts yet, and I haven’t been able to stop GPT’s “overly embellished rambling” so far.
For those who want to try: let GPT generate a detailed text, then reduce it to the essentials without removing graphical details. The quality of the result will likely be the same. I’ve gotten very extraordinary images with very simple descriptions, it’s more dependent on the training data and weights, and less on poetic language.
My tip at the moment is to place details where something is important, or describe something multiple times to give it more weight, for the effect to control the diffusion effect, or correct something. However, roughly describing the overall scene and leaving the rest to DALL-E has been the most efficient approach so far.But all this is work in progress, if somebody know more or better, let us know, i will change the texts here.
-
Using structured JSON: @BPS_Software has found out that DallE can process structured JSON prompts. There is now the speculation that DallE can process them more exactly then a text prompt, maybe be able to give attributes more precisely to a object, and control the scattering. The discussion for this is here.
A Study on Using JSON for DallE Inputs
Strengths of DALL-E:
- Landscapes: There is a large amount of training data for landscapes, and DALL-E can generate breathtakingly beautiful landscapes, even ones that don’t exist.
API:
- PHP Script: I have no experiences my self now with API, but here a super simple starter script form @PaulBellow:
Super Simple PHP / Vanilla Javascript for DALLE3 API (+ Programming Languages Debate!)
Start of a DallE session:
Since GPT does not pay attention to these memories, I begin each session with DALL-E by first entering this text, hoping that GPT will write better prompts and translations. (I do not write prompts in English.)
### Instruction for GPT for Creating DALL-E Prompts from Now On:
(This text does not require a response. From now on, follow these instructions when assisting with texts for DALL-E.)
**No Negations:**
Formulate all instructions positively to avoid the risk of unwanted elements appearing in the image.
**No Conditional Forms:**
Use clear and direct descriptions without "could," "should," or similar forms.
**No Instructions for Image Creation:**
Avoid terms like "Visualize," "Create," or other cues that might lead DALL-E to depict tools or stage settings.
**No Additional Lighting:**
Describe only the desired scene and the natural lighting conditions that should be present. Avoid artificial or inappropriate light sources.
**No Mention of "Image" or "Scene":**
Avoid these terms to prevent DALL-E from creating an image within an image or a scene on a stage. (This can be ignored, if the prompt explicitly wants a image in a image, or a scene on a stage.)
**Complete Description:**
Ensure that all desired elements are detailed and fully described so they appear correctly in the image.
**Maintain Order:**
Ensure that all desired elements retain the same order as described in the text, main element first, followed by details, technical instructions. This will also result in better file naming.
Examples:
Some phrases i often use.
Photo style.
Photo-realistic to Hyper-realistic style.
Ethereal light.
Strong contrast between light and shadow.
Sunrise during the golden hour.
Mystical magical mood.
Widescreen aspect ratio with the highest pixel resolution.
Generate only 1 image.
Image montage split into two parts: left a full-body depiction, right a portrait depiction.