Template Usage: DALL-E seems to use some templates for image generation to increase the likelihood of appealing images. For example lightning, body and facial Templates. Depending on where these are triggered, they are almost impossible or completely impossible to remove. It reduces creativity and let many things look always the same, boring, and blocks out exactly descried stiles moods motives or settings. (Could it be that the training data is reduced, or/and DALL-E 3 is put on rails?)
- Initially, I didn’t consider using templates. What stood out clearly was the moon, but I wondered about the weight of image selection at the level of image:vector by adding one characteristic.
- Facial Template: Another template is facial (the mouthy), it puts an almost plastic silicon-looking mouth and nose template over every single character to let it look human, even if this is unwanted and the face is described differently in detail. (I could not overcome this so far.)
- Is IT?
- But it can use skills in describing appearances combined with appropriate prompt techniques to create a face or an expression. It’s challenging because DALLE3 doesn’t have an interface that makes it easy, relying solely on the prompt. However, it doesn’t mean it’s impossible. The images I’ve created when using DALLE3 during its launch were portraits of people. I provided the original to GPT to create an initial prompt, and then edited and refined it until it closely matched what I was capable of achieving.
- Stereotypical Aliens: If you create aliens, you very often get the same stereotypical Roswell or H.R. Giger alien. So it is better to describe the character and not trigger the stereotype with “alien”.
- If it’s an image of the gods of Olympus, you’re going to get the gods of Marvel that you don’t want. And this is going to be a big problem in the public data space in the future, where models are learning from the wrong data.
Character/Scene Reuse: It is not possible to generate a character or scene and reuse it, DALL-E will create another one. This makes it next to impossible to tell a story and generate some pictures for it. But to a small degree, it can be done to have the same character in different scenes. The scene can include more than one picture, so you can describe a character in different situations and say “left upper corner, … right upper corner,” etc., or something comparable. You can use the key word “montage” for a multi-image.
- I think there is a way to control it. But…
Counting Objects: DALL-E can not count, to write “3 objects” or “4 arms” not generates the correct amounts in the result. It can not place correct amounts of objects in a scene, or subdivide a image in given grid of X Y amount.
- In the research paper developing DALLE3 by OpenAI, it has been detailed that the Instruction prompt of vision GPT, which is an assistant to measure the results of images, will be checked in what DALLE can do. At that point, there is still a counting of the number of objects. It is possible to control it. Since its launch, the number that exceeds the specified number in the system (not sure if it is 3 or 5) will always increase or decrease within a certain range. And a few months ago it was found that there is a chance of generating the specified number randomly (20%).
Cause and Effect: DALL-E does not have an understanding of cause and effect, such as a physical phenomenon. It is necessary to describe the image carefully enough to create images where a cause leads to a specific effect. It is also important to consider whether there might be enough training data to generate such an image. For example, there are likely images of a fire that has burned down a house, but not necessarily of someone sawing off the branch on a tree they are sitting on from the wrong side.
Here is the translation of your text into English:
- From several factors such as the platform you’re using, OpenAI’s system, and foundational data. For example, the prompt sent out will be processed by DALLE-3’s recaptioner, which generates a large number of synthetic captions. Imagine what you see in the image that wasn’t in the prompt—that’s part of the recaptioner’s work. Additionally, it’s possible that the prompt may be divided into sub-prompts, which are then created as layers for each prompt. Although the model doesn’t synthesize text that affects the image’s meaning, it may result in unintended elements in the image. However, the recaptioner’s function aligns with the use of the gen-id, as we generally don’t use the gen-id just to get what we wrote in the prompt but also to include things not present in the prompt. To establish relationships between image changes and the prompt, you need something that tells the recaptioner that the synthetic captions must align with the image we want.
In ChatGPT, the only tool I have is the prompt, but my prompts are structured like templates to control variables that affect four types of images, including methods for using prompts to create an environment that helps regulate outcomes. Since I have trouble reading and constantly use a translator for my prompts, I don’t view the prompt as text but as an object. When creating an image, I extract the necessary components from the prompt I’ve compiled, designating a main prompt and replacing elements where I want to make changes. This is similar to how the sub-prompt system works. It’s possible that breaking down the text I use helps divide prompts more easily.
I recommend reading additional papers, such as the research paper on developing DALLE-3 by OpenAI (https://cdn.openai.com/papers/dall-e-3.pdf). After the bibliography, you will find research data that can help you understand the model’s behavior through an analysis of the researchers’ writing, such as system prompts. Additionally, research related to RPG (recaption, plan, generate) is worth reading, and you should also check out OpenAI’s research on CLIP. Because it will explain the principles of making 4o do what I’m going to talk about next.
ChatGPT Issues
ChatGPT Issues and weaknesses are a topic of their own. Here, we will briefly discuss only some issues related to DALL-E.
Prompt Generation Issues: GPT does not inherently recognize or account for certain here described issues when generating prompts for DALL-E. For example, it often uses negations or conditional forms, or instructions like “create an image,” which DALL-E might misinterpret. As a result, prompts generated by GPT often need manual correction. GPT is not the best teacher yet how to create most efficient prompts.
- How do you think an AI that doesn’t know DALLE-3 would understand how to create a proper prompt? Its foundational knowledge is based on the time that not far from DALLE-3 first launched. It only knows DALLE-2, but…
False Visual Feedback: GPT cannot see or analyze the resulting images. If you specify “no text should be included,” it is likely that text will still appear in the image because negations do not work as intended. GPT might comment ‘gaslighting’ you, “Here are the images without XXX,” yet XXX is present in the image. This can feel frustrating, especially when you are already annoyed. Try to take it easy
- You need to first separate the roles of the prompts you’re using correctly. I’ve found that many people know how to set image sizes using the API but don’t know how to prompt for image sizes in ChatGPT. Similarly, with “no text should be included,” who are you speaking to—GPT or DALLE? You should clearly distinguish which part of the prompt communicates to whom. This factor is part of how I structure my prompts. Additionally, if the text that appears functions as a meaning within the prompt sent to generate an image, that indicates that the prompt contains factors that create chaos, as mentioned earlier.
More importantly, 4o can now access and view the images created by DALLE-3 and process them immediately. The reason I know this is that I’m one of the few out of billions of users who observed the output in time dimension and noticed abnormalities, which led me to study and gather related information. You can verify this by creating an image and sending this prompt in the same message as the one that generates the image: “Once the image is received from DALLE-3, convert it to PNG, add the image’s metadata, and send it back to me.”
Perceived Dishonesty: GPT sometimes seems to lie, but it actually fabricates responses based on training data without checking for factual accuracy. This behavior is sometimes named “hallucinating”. You must always check factual data yourself!!
- Remember what I wrote earlier about ChatGPT having a tendency to follow along but sometimes lie. This behavior is not related to hallucinations in the context of mismatched input and output. Creating answers out of its own lack of information isn’t a hallucination but a behavior stemming from habit, training, and the system prompt that influences this kind of behavior.
AI has no true intelligence: It is important to understand that while these systems are called Artificial-Intelligence, and there skills are impressive, they are not truly intelligent. They are complex pattern recognition and transformation systems, much more efficient than manually programmed systems, but they are not intelligent or conscious, and they make mistakes. We are not in Star Trek yet…
- AI is not trained to answer incorrectly. Predicting human needs is not easy. Errors, beyond its tendencies, limitations, habits, and hallucinations, often arise because the question doesn’t align with the user’s actual intent. Think about how we would answer a question ourselves—AI thinks in the same way. The most effective use is not asking for an answer but asking for an opinion.
Tips:
Literal or Miss-Understanding: Always keep in mind that DALL-E can misunderstand things, taking everything literally and putting all elements of the prompt into the image. Try to detect when a misunderstanding has occurred and avoid it in the future. If you write texts not in English, they will be translated before the reach DALL-E, and the translated text maybe can have a conflict, when the original text hast not. Or short prompts are expanded. Check the truly used prompt for conflicts, not only what you entered.
- I once mistakenly sent a prompt in Thai, and I found that it tried to generate Thai characters—like a child learning something new.
Prompt Structure: Maybe order the advice in this way: Write the most important thing first, then the details, and finally technical instructions like image size, etc. It is even better for the naming of the files.
- In this regard, I am different. Most people believe that the placement (beginning or end) matters, but I found that this is not true. Regardless of position, if the details or meaning are significant enough, the model will prioritize that object. Understanding “what is the smallest change in the prompt that will result in the biggest change in the image?” is key to this. Also, I have never specified image size within the prompt used to generate an image.
Photo-Technical Descriptions: I see that some users are using very specific technical photographer-like advice, like camera-type, lens-type, aperture, speed, ISO, etc. I am not sure if this makes sens, if a lens does not really add a very special effect in the picture, or you want lens flares. I could not really see a difference in using such detailed technical descriptions. But maybe it can trigger specific training data, if the images are not a fantasy scene… (I would be interested to know more.) I simply use “add a little deep-of-fild”, instead to use a very technical lens advice.
- In this point, I include meanless text, other model parameters, and DALLE’s parameters as well. These texts, when included in a prompt, play different roles depending on the word. It may have no impact on the system, but the meaning of certain words can influence the image’s meaning, like “vibrant” or text that the model understands, like camera lenses, parameters from other models like MidJourney, or groups of meaningless text, may not affect the meaning of the image itself but act as part of the prompt’s structure. They influence the randomness of the image and can be used to alter the image without changing its meaning. However, using the same text in a new prompt functions similarly to naming a character or defining the meaning of an image. This also helps explain the case of inserting a continuous gen-id to create images while maintaining relationships, even if that ID is fake. Additionally, conflicting image sizes can arise when we ask GPT to specify a particular size (vertical-horizontal) but include contradictory terms in the prompt.
Photorealistic: If you want to create photorealistic images, paradoxically, you should avoid using keywords like “realistic”, “photorealistic”, or “hyperrealistic”. These tend to trigger painting styles that attempt to look realistic, often resulting in a brushstroke-like effect. Instead, if you want to define the style, simply use “photo style”. (Even fantasy images may gain al little quality this way, despite the lack of real photo training data.) If you aim for photography-like images, it makes sense to use technical photography terms, as DALL-E utilizes the metadata from images during training, if they contain technical information.
- Correct. The use of the word “realistic” is appropriate for other styles to achieve realism. You cannot change a photograph that is already real. Focusing on light, color, and texture within the image also plays an important role, meow!!!
Strengths of DALL-E:
Landscapes: There is a large amount of training data for landscapes, and DALL-E can generate breathtakingly beautiful landscapes, even ones that don’t exist. API:
PHP Script: I have no experiences my self now with API, but here a super simple starter script form @PaulBellow: Super Simple PHP / Vanilla Javascript for DALLE3 API Start of a DallE session:
Since GPT does not pay attention to these memories, I begin each session with DALL-E by first entering this text, hoping that GPT will write better prompts and translations. (I do not write prompts in English.)
Instruction for GPT for Creating DALL-E Prompts from Now On: (This text does not require a response. From now on, follow these instructions when assisting with texts for DALL-E.)
No Negations: Formulate all instructions positively to avoid the risk of unwanted elements appearing in the image.
No Conditional Forms: Use clear and direct descriptions without “could,” “should,” or similar forms.
No Instructions for Image Creation: Avoid terms like “Visualize,” “Create,” or other cues that might lead DALL-E to depict tools or stage settings.
No Additional Lighting: Describe only the desired scene and the natural lighting conditions that should be present. Avoid artificial or inappropriate light sources.
No Mention of “Image” or “Scene”: Avoid these terms to prevent DALL-E from creating an image within an image or a scene on a stage. (This can be ignored, if the prompt explicitly wants a image in a image, or a scene on a stage.)
Complete Description: Ensure that all desired elements are detailed and fully described so they appear correctly in the image.
Maintain Order: Ensure that all desired elements retain the same order as described in the text—main element first, followed by details, technical instructions. This will also result in better file naming.
- Some of the words you mentioned I haven’t noticed, possibly because the style of the generated images doesn’t result in the effect of those words.
The content towards the end is very good. Many people face similar problems but don’t think of solving them this way. It’s one way to create a pre-controlled image environment. Besides this method, creating certain images to stimulate the intended direction of the image is another way. For example, making a word or phrase hold a specific meaning for the image to be generated in the session, simplifying the subsequent prompts.
Lastly, you should be aware that the current version of DALLE on the ChatGPT platform has been significantly controlled and limited in its capabilities, for various reasons ranging from ethical governance to business considerations. The fact that you’ve studied and encountered issues along with their solutions within these limitations will enable you to effectively use models based on DALLE3 (despite minor limitations or differences in interfaces). Most importantly, DALLE3 now shows improvements in handling abstract or meaningless text. The occurrence of phrases that fail to generate coherent images and instead produce text has decreased, with the model now producing images that communicate interpretable concepts and express emotions more effectively.
Thank you very much for your post, which described everything in clear detail. It helped me revisit things I had taken for granted as normal and see aspects I had never known before. This made me take time to gather my thoughts and write as thoroughly as possible.