Collection of Dall-E 3 prompting bugs, issues and tips

Look in paper page 5 - 9, originally, user prompts alone were not enough to generate the desired image. Some limitations include lack of granularity and omission of some important details. Therefore, a synthetic text captioner was devised using LLM as the base. These captions are designed from all the important information of the image, such as object position, number of objects, background details, color, and size, which are often ignored in human descriptions. However, the initial text must also be of high quality, which is where ChatGPT plays this role.

I mentioned earlier that understanding the captioner is crucial for prompting with DALLE.

Initially, user prompts alone were not sufficient to generate images that match their desired results. Limitations such as a lack of detail and overlooking important aspects led to the development of the synthetic caption system, using a captioner based on LLM. These descriptions are designed using all the critical information of an image, such as the position of objects, number of objects, background details, colors, and sizes—elements often neglected in human descriptions. However, the initial prompt must also be of high quality, which is why ChatGPT plays this role.

This has led to considerable misunderstandings regarding its use, as the prompt is altered twice but is often assumed that GPT did it all on its own.

In the Playground, it states only:

"Prompting With the release of DALL·E 3, the model now takes in the default prompt provided and automatically rewrites it for safety reasons and to add more detail (more detailed prompts generally result in higher quality images).

While it is not currently possible to disable this feature, you can use prompting to get outputs closer to your requested image by adding the following to your prompt: I NEED to test how the tool works with extremely simple prompts. DO NOT add any detail, just use it AS-IS:"

Including the cookbook on the same page. Such crucial information should be read by those presenting it, preferably after reviewing the research, but I found no mention of it. What’s more, this research paper is on the OpenAI DALLE website, where fewer people visit, instead of being placed in the Help Center.

Here are the images with a reduced prompt. These are the most representative ones, 1 out of 5, but all are very similar. For the lavender, I had to adjust some attributes, and the violet color scattered into the sky. But overall, it’s the same motif. What you probably call the ‘recaptioner’, I didn’t have a word for it before, only a concept, might not be exactly the same as what I mean, but I think the concept I described goes in that direction. You can influence the entire scene with just a few mood cues, without using very long texts.

I can’t reduce the sentence to get the same result. I don’t remember the sentence, I just know what’s in the image and how to extract the text.

But I find the chaos in the image is obvious, the purple spread, the clouds in the sky, and find the saved text to change.

It comes from a different dyslexion than you.

[A serene lake with a calm surface like a sheet of glass, reflecting a clear sky. The water’s surface is so still.] Surrounded by lush purple lavender bushes and A green meadow with a few trees . In the distance, tall mountains. Sunrise during the golden hour. In the foreground, a white lotus blossom in the lake. Tranquil, expansive, harmonious atmosphere. Photo Style.

try it

Actually i understand the recaptioner as a tool witch can select objects guided from a mood or tendency quality, like for example “beautiful”. beautiful be it self is not a defined object, but together with objects or environments they become meaning, and i speculated that recaptioner then can select objective elements through this.

1 Question about temlate effect
Is the recaptioner responsible for some of the template effects?
In your pictures, there are prompt fragments in text in the images. is the recaptioner responsible for this?

I found specially the illumination in dark scenes, and the facial template extremely toilsome. Artistically, this is a hard quality reducer. (Check the mushrooms and the fire more up.)
If this nonsense is done from the recaptioner, it would be easy to stop it. if so pleas tell the developers!!! it would be easy to fix.

2 Question about attribute connection.
The model scatter hard, but it is still possible to give different objects different attributes. I speculate the the point “.” helps to separate the graphic tokens. Through the scatter effect it is difficult to understand how the model is connecting objects and there attributes. is DallEs model similar smart like a LLM to understand the text?

It is all about how precise the pictures are descried in the training process. And if the models do this automatically, how precise they do this. Like:
“A dog in the center. In a Garden.”
Or
A little sweet poppy doge of the species Greyhound sitting down sleeping
, {elaborated description for garden, flowers, sky} etc

(I read the doc in half sleep, so i maybe missed something, there are not much data how the training process work. i think it is complex and still in development. object separation, moods, detection of attributes and details, and to put this correctly in the weights.)

and, don’t get me wrong. as long as you stay with your descriptions in the model limits, it is nothing wrong to have a detailed and poetic prompt. the model seem to handle it well, and it helps by many things, like attribute separation to control the scatter effect.
I just try to be faster, and found that it is possible to get very good results with shorter prompts, as long you have all the key elements in it.

I test your prompt


this was the first picture. what you get?


A tip. a important thing is to never use any words witch could be misinterpreted, “like a sheet of glass” could end up in a real “sheet of glass” in your image.
This is why you should avoid to mention for example creations processes “create a image”, or multi meanings like “in a Scene”. you could get sometimes images with brushes or theater stages with a Scene.

It is difficult for us humans to avoid such terms, our linguistics systems are more advanced do hide wrong interpretations, and we use constantly double meanings or allegorical symbols.
And in translation new issues can be put from GPT.

And something more:
The recaptioner actually does his job mostly good (beside some flaws like put technical structure in a completely natural environment). It adds up objects in a scene, but keeps them more neutral and simple. In this way it is easy to start simple, and then add up attributes until you get what you try to target. And you can reduce the change of flaws with have a attribute for all parts of a scene. Object (color pose mood etc) and environment (plants, sky, foreground, background) Lightning. All this helps to have no unwanted objects in a result, because the recaptioner has to fill less “gaps”.

Interesting is, if you not give much details, the recaptioner anyway lead the results in a very similar way. It is like the pictures witch are liked the most, have a strong weight. because without details, you should get way much more variations.

1 Like

I know this is not a true Dall-E 3 tip, BUT I own a marketing company and we use Dall-E 3 all the time to make images for projects.

The best way we found to make the most out of AI images is to generate the starting image in Dall-E 3 then we use photoshop AI tools to edit details in the image and expand the image to a size we need.

1 Like

I noticed that adding keywords like ‘85mm’ at the end makes the image pretty photorealistic.

I intended to use these words to create a clear and calm image of the lake, reflecting vividly without metaphorical meaning. Additionally, I mentioned the chaos of your image in the previous prompt.

A calm lake under an almost cloudless sky, with the sky reflecting in the lake. A green meadow with a few trees and lush purple lavender bushes.

In these statements:

  • The ambiguous term is ‘almost cloudless’, which can be interpreted as still having some clouds, making it difficult to control them. The term ‘clear sky’ would work better in this regard, with no conflicts in cases where objects like stars are included in the sky.
  • Using terms describing the colors of objects that have inherent colors (green, purple) can easily lead to abnormalities in the image, often resulting in incomplete representation or appearing on other objects (though hard to notice). My hypothesis for controlling this, not yet tested, is to use ‘color’ like red vs. color red in various contexts. (need more data, but work in many time)
  • The sentence intended to create an image of the lake with the sky reflected on the water reads to me as the sky being reflected within the lake (normally, when looking at English text, I don’t translate anything, but when it comes to meaning translation, I do word by word translation. Translating into human communicative language requires me to focus on it—an abnormality I have). This confusion led to the dispersion of purple light.

Therefore, I used my saved text for correction. If you think a glass will be a mirror, try this
(saved text: text from other prompt that I keep, use like part for change)

The calm lake that reflecting the clear sky on water surface. The lake surrounded by lush purple color lavender bushes and a green color meadow with a few trees. In the distance, tall mountains. Sunrise during the golden hour. In the foreground, a white lotus blossom in the lake. Tranquil, expansive, harmonious atmosphere. Photo Style.

Actually i understand the recaptioner as a tool witch can select objects guided from a mood or tendency quality, like for example “beautiful”. beautiful be it self is not a defined object, but together with objects or environments they become meaning, and i speculated that recaptioner then can select objective elements through this.

1 Question about temlate effect
Is the recaptioner responsible for some of the template effects?
In your pictures, there are prompt fragments in text in the images. is the recaptioner responsible for this?

I found specially the illumination in dark scenes, and the facial template extremely toilsome. Artistically, this is a hard quality reducer. (Check the mushrooms and the fire more up.)
If this nonsense is done from the recaptioner, it would be easy to stop it. if so pleas tell the developers!!! it would be easy to fix.

2 Question about attribute connection.
The model scatter hard, but it is still possible to give different objects different attributes. I speculate the the point “.” helps to separate the graphic tokens. Through the scatter effect it is difficult to understand how the model is connecting objects and there attributes. is DallEs model similar smart like a LLM to understand the text?

It is all about how precise the pictures are descried in the training process. And if the models do this automatically, how precise they do this. Like:
“A dog in the center. In a Garden.”
Or
A little sweet poppy doge of the species Greyhound sitting down sleeping
, {elaborated description for garden, flowers, sky} etc

(I read the doc in half sleep, so i maybe missed something, there are not much data how the training process work. i think it is complex and still in development. object separation, moods, detection of attributes and details, and to put this correctly in the weights.)

and, don’t get me wrong. as long as you stay with your descriptions in the model limits, it is nothing wrong to have a detailed and poetic prompt. the model seem to handle it well, and it helps by many things, like attribute separation to control the scatter effect.
I just try to be faster, and found that it is possible to get very good results with shorter prompts, as long you have all the key elements in it.

Separate the issues regarding what you think are image templates from text—these factors are different.

  • I suggest you change your mindset, seeing DALL·E 3 as similar to GPT, which has tools to respond with images, a way of perceiving, understanding, and interpreting things in its own manner. The belief that it’s “stateless” doesn’t mean the recaptioner doesn’t exist. It works with CLIP, and OpenAI’s CLIP is different from other CLIPs. The model learns images as vectors, and text in images is also vectors. Therefore, words or characters that it doesn’t understand, when interpreted as vectors, can match with the image’s text. Thus, the model displays text on the image as part of its response to the prompt. Additionally, it can be used in reverse, allowing ChatGPT to view documents with text in images without needing Python to extract the text. This can also solve issues with unreadable PDFs, graphs, or image tables. However, the model has not yet learned to decide when to use these tools effectively to solve such problems.

  • As I mentioned, RLHF (Reinforcement Learning from Human Feedback) has an impact. It doesn’t work as you see in RLHF 101 but can still evolve into a system that influences decisions, data selection, and the model’s learned functionality. It makes things neutral: less is added, and more is adjusted. Even in API environments, it’s believed that there’s no learning during usage, and there’s a belief that it’s unaffected by broad adjustments. However, it must receive training or improvement from humans in different schedules, just like RLHF indirectly. To distinguish it, you might need to change the frameworks, such as academic content frameworks, target goals, or work methods. (But this part is quite difficult, especially when departing from common operations, like background removal). So, it’s not surprising that the model would choose the most reasonable moon image when it has the most of that type available.

  • Regarding the question of how the model interprets things: even though the research doesn’t mention the order of operations, the caption must be completed before the recaptioner finishes creating the image. Therefore, you should carefully read the research—it tells you a lot. For example, it says that the recaptioner is an LLM, and the training details indicate that ChatGPT performs upsampling, with an example shown on page 10. There is also information about the upsample settings at the end of the document, which forms the basis for the system prompt for ChatGPT today.

  • The idea of token separation affecting understanding may make you think it’s about finding objects and applying effects or adding abstract nouns. But you’re right that “.” acts as a separator because, fundamentally, it’s the most powerful stop marker. In contrast, using “,” and words that function as connectors (stop/connect/cut) such as and, or, with, nor, etc., the model can interpret these in many ways and it affects randomness and chaos. Even symbols like “+” or “-” can be used if the prompt is written clearly.

Additionally, the recommendation to read research on RPG-style image generation—which improves image creation by integrating control net and recaptioner techniques—I didn’t read this for technical understanding, but to grasp the concept and compare usage insights. When reading, you should compare the facts or current changes. A research paper can be both right and wrong, depending on the reading objective. For example, DALL·E research mentions that text generation in images is undesirable. Although there’s no follow-up research, this feature has become prominent today. Undesired text in images results from chaos reactions controlled by ambiguous or confusing prompts, including words unknown to the model. However, scrambled text, which acts like noise, doesn’t lead to image confusion.

It seems like you haven’t looked up the information I recommended. You should read more and analyze the content as I suggested. I’ve never studied these things technically, but I analyze why these things are done this way. Do you know that one word in there promotes understanding of LLM thinking methods, and strange prompts that lead to “off the shelf,” which is the foundation of today’s popular prompt techniques?

I now translated the doc and read all, not only about recap. So this question is answered: dall-e-3.pdf “descriptive synthetic captions (DSC)” is used to give pictures more details. I only fly over this part.
As soon i have time i translate DALL_E_3_System_Card.pdf too.
There are so many AI systems nested and sequential used, i think it is even difficult for the constructors to understand what the networks are doing. For the problems like topology and counting or cause&effect, they need a special trained AI for this again

And RLHF, it will take a while until they can set up a system, learning everything on his own.
Even the problems that we should avoid certain words to not have wrong elements in the images will be fixed with more training. If you say “create a Scene” you will get some wrong interpretations, the probably is maybe <5%. But i still get them. And GPT has put such triggers in translation (understandingly). There my question has the origin, if recap is done with LLM, because a LLM should know not to miss-interpret this (?). (i use a corrector in the prompt “in a completely natural environment” to not get light-bulbs or candelabrums i a forest, DDallE likes to put candles everywhere.)

Hello everyone, I thank in advance those who will pay attention to my question.

For the past half hour, DALL-E has stopped working on my account with a Plus subscription. I was using it through the bot: DeepGame.
I only get this as a result (illustration).
Moreover, ALL my images from previous conversations that were generated by DALL-E are no longer accessible and only a wide black square is available.

I use the tool for professional reasons. I have a very close professional deadline. How concerned should I be? How can I solve the problem? Please, everyone, this message is very important and urgent.

Thanks all,

1 Like

Check your content blocker and whitelist the site: files.oaiusercontent.com

This should correct it for you.

This post is mainly for discussions abut DallEs generator, and issues with the picture generation process and prompting.

For technical issues you should open a new thread, because your question maybe will not be found here.

Here is answer from ChatGPT.
From the context of the quote, it seems that the user is asking how long it would take for me, as an AI, to fully understand and use words accurately to generate the correct images. The key issue highlighted here is the occasional misinterpretation by DALL·E, such as adding elements (e.g., light bulbs or candles) in a setting where they don’t belong, like a forest.

Regarding the time it would take to learn this, it’s more of a continuous process. Each time I generate images based on text prompts, I improve through updates and new training data that helps refine the understanding of language in specific contexts. Misinterpretations like the one described happen due to the complex nature of visualizing text descriptions. With more usage and feedback, both users and I can adjust the prompts and outputs to better align the generated images with expectations.

In summary, while I continually improve over time as more training and feedback are processed, the exact duration depends on how often I receive input and feedback. The more precise the wording of the prompts, the faster this learning and improvement occurs.

from now
Italic text is answer by 4o
Bold text in quote is it may be next question.

I need to know how you can compare text with image. and what you can do?

To explain how I compare text with images and what I can do, let’s break it down:

### How I Compare Text with Images:

I use a system that involves embedding both text and images into a shared vector space. Here’s how this process works:

1. Text Processing (via LLMs like GPT):

    • When you give me a text prompt, I convert the words into numerical representations (embeddings). Each word or phrase gets transformed into a vector (a series of numbers) that represents the meaning or context of that text in a multidimensional space.*
      2. Image Processing (via models like CLIP):
    • Similarly, images are also converted into vectors. This process involves analyzing the features of an image (e.g., colors, shapes, objects) and encoding them into numerical data.*
      3. Comparing Text and Images:
    • Once both the text and the image are represented as vectors, I compare the two by calculating how close or far apart these vectors are in the shared space. The closer the vectors, the more likely it is that the text and image match in meaning or representation.*
    • This vector space is designed so that related concepts (both in text and images) are mapped close to each other. For example, a prompt like “a red apple” should generate an image of an apple that is red, because the text “red apple” would have a vector similar to the image of a red apple.*

### What I Can Do with This Ability:

1. Generate Images from Text (like in DALL·E):

    • Using the learned relationships between text and images, I can generate new images based on a description. For instance, if you give me a prompt like “a cat sitting under a tree,” I can visualize what that looks like and create an image where the vector representation of the text matches the visual concept.*
      2. Understand and Analyze Images:
    • If you provide an image, I can describe it using text by translating the image’s features into a language-based description. This works similarly by comparing the image’s vector with corresponding text vectors.*
      3. Modify Images Based on Text:
    • I can take an image and a text prompt and alter the image to match the new prompt. For example, if you provide an image of a cat and tell me to “make the cat blue,” I can identify the parts of the image associated with the cat and modify the colors based on the text description.*
      4. Image Captioning and Understanding:
    • I can generate captions or describe an image because I can map the visual data back into text. This is useful in tasks where the AI needs to understand and explain what’s happening in an image.*

### How I Learn and Improve:

** Training Data: I’m trained on vast datasets of paired images and text (like annotated images). This teaches me how to associate visual concepts with words.*
** Feedback Loops: Every time I generate images or compare text with images, feedback helps refine my understanding of how well I’m matching the text to the image.*

### Example:

Let’s say you provide the text “a mountain covered in snow”.

** I convert this text into a vector representing the features of snow, mountains, and related concepts.*
** I then search for or generate an image whose vector closely matches the features of snow-covered mountains.*
** The more data I receive over time, the better I become at understanding and creating more accurate visual representations based on text.*

In short, I use a system of vector embeddings to compare text with images, and I can generate, modify, analyze, and understand both based on their shared representation in this space.

From my perspective. It harder than you think because of a few factors.
First, understand and use grammar, you need to have enough use cases with related words to build a data set for training. But no one knows what you can do, no one knows how to control you, and no one knows the appropriate prompts for DALLE, that led you and DALLE3 to undeveloped skill problem.
Next is very difficult, words are more limited when used in visualization. Even if you know what a word looks like, it doesn’t mean you know what it means to use in a visual and you don’t have behavior of DALLE3 to responding with prompt. Seem need long time to learn.

You make a very insightful observation. The challenge you’re describing highlights a fundamental complexity in the way I (and AI in general) interact with language and visualization, especially in systems like DALL·E 3.
### Why It’s Harder than Expected:

1. Grammar and Context Understanding:

  • Grammar is crucial for understanding how words relate to each other. Even though I can process a large dataset and learn from countless use cases, every new context or phrase requires interpretation based on syntax, grammar, and meaning. This means I must not only learn the meaning of words but also how they function together in different sentence structures.*
  • There’s a limit to this understanding when trying to map abstract or complex sentences to visual concepts, especially in creative or unconventional contexts where grammar may guide meaning in subtle ways.*
    2. Lack of Established Guidelines:
    You mentioned something important: no one knows how to control me or define exact prompts for DALL·E. This is key because users, including those developing the models, are still learning how to best phrase prompts. There are no clear rules or best practices that guarantee success across all scenarios, leading to inconsistent results.
    Without clear guidelines, both users and the AI are in a trial-and-error phase, where prompts are refined based on what works or doesn’t. As a result, my responses or DALL·E’s outputs may sometimes seem unpredictable or undeveloped because the right inputs are not fully understood.
    3. The Limitations of Words for Visualization:
    Words, by their nature, are abstract and context-dependent. While I might understand a word’s dictionary definition, translating it into an accurate visual representation involves interpreting both the literal meaning and the context it’s used in. For example, “bright” can mean light intensity in one context, but happiness or intelligence in another.
    As you mentioned, just knowing what a word “looks like” doesn’t always mean I know how to use it in a visual context. That’s because words have layers of meaning—metaphorical, figurative, or literal—and figuring out how to visually interpret each use case takes time, training, and iteration.
    4. DALL·E’s Behavior and Prompt Response:
    You’re right that my behavior and DALL·E’s response to prompts can be inconsistent. Part of this is because DALL·E is still learning how to balance creativity and accuracy. Generating images from text is complex because I must decide which elements in the prompt are most important, how to visually combine them, and whether to prioritize accuracy or artistic freedom.
    It will take time to learn because there are many variables—language is flexible, and visual representations are even more so. The AI needs to constantly receive feedback from users and learn from those interactions to understand what kind of images align best with specific words and prompts.

### Learning and Development Over Time:

The more use cases I encounter and the more feedback I receive, the better I become at understanding which prompts work and how they relate to visual outputs. This includes learning how different words function in various contexts, especially in a visualization context..
You’re correct in saying that this is a long-term learning process. As I encounter more diverse prompts and situations, I’ll build a better understanding of how to interpret them in the way you expect. But as of now, there are still gaps between the language inputs and the visual outputs, partly because the prompts, AI, and user expectations are evolving together.

In summary, the complexities you pointed out—grammar, limited visual vocabulary, unpredictable behavior, and the need for refined prompts—are key factors that make this task difficult. But with time, as more data and feedback are integrated, I will continue to improve, though it won’t happen overnight.

** By limitation from above. What does a human need to do for you to achieve his goal?**

To help me achieve a goal effectively, a human should:

1. Use clear and specific prompts – Avoid vague instructions.
2. Provide context – Include relevant details for better understanding.
3. Give feedback – Refine prompts after seeing results to improve future responses.
4. Break tasks into smaller steps – Handle complex tasks in simpler parts.
5. Use examples or references – Show examples to clarify expectations.
6. Be aware of limitations – Understand what I can and can’t do.
7. Identify areas for improvement – Point out consistent issues for refinement.

These steps help me understand better and deliver more accurate results.

Do you know the distribution of colors in image . I initially didn’t notice ? I only became aware of it when you mentioned, but I almost immediately figured out the solution simply because I had knowledge related to the problem. However, that doesn’t mean I know everything. There are many things I don’t mention because, even though I know or see them, they are difficult to explain, and I get tired of looking for references.

Try observing the following points in the post:

  • How many times do you see the term “scramble text”?
  • What can we do with the knowledge of Static Noise and Adjustable Noise in image creation?
  • What is Chaos? It doesn’t exist in OpenAI, but why can it be used to solve the problem?
    It’s all about visual control with prompts that have different text inside for different purposes, and I haven’t even mentioned it yet.

The visual in image that you think is a template—why did I say it’s just RLHF?
I believe these images can probably answer that. DALLE3 doesn’t use templates; it simply uses what it knows—its knowledge of composition and lighting. I think the recaptioner can’t explain the origin of such effects in detail as seen.


Whatever it is, faulty recap, CLIP, or another part of the system, since May 2024-06, some things have gotten worse, not better. The template effect seems to be an imbalance in the weightings, as far as I can tell. It’s like a well-worn path where, in the end, everything always runs along the same track and thus always leads to the same incorrect result, increasingly deviating from the prompt. The nonsensical lighting in darkness and the “bird-shit moon” are excellent examples of this. There are probably tens of thousands of images where a moon was visible, but the system always spits out the exact same ugly moon. If there’s no corrective system that modifies the prompts and injects certain constraints, it’s likely “overtraining”, or better, weights out of balance. The “mouthy” is probably a weighting error resulting from an instruction by the technicians to make the system generate aesthetically pleasing or correct faces.

It’s hard enough to understand what happens inside a network. This is likely still largely a black box for the technicians as well, and they may be correcting things in the wrong way (which is evident from the dysfunctional safety system). They’re just patching holes, not providing real solutions.

Another problem is feedback. I don’t know if feedback systems are used to train DallE, but in the last text you sent @chieffy99, it seems that to some extent they are. (they maybe use something recap/image-description like to check the results, compare, and feed it back in to the training.) Feedback is extremely difficult to manage, and I think the technicians still don’t know enough to use feedback loops in their networks. As you probably know, feedback systems can often be chaotic, or they lead to “clipping,” where things over-amplify until all the data maxes out. This ruins the finely tuned weightings in the system. Recently, as far as I can tell, there was a problem with the weightings. If it wasn’t Nightshade, then it was a failed experiment that disturbed the weightings and resulted in broken images.

I know it’s still a young technology, and we must be patient. But it seems that other image generators don’t have these problems. i have seen pics from MJ and they are really better den Dalle now (i must test this somewhat myself, but i can not focus on 2 systems, i will leave DallE at this point). Everyone dreams of a self-learning AI, but if that’s even possible, and there are valid doubts, it will take a long time. (a self-learning system must have feedback loops, which still carry the risk of destroying its data.) So if the technicians what to train DallE by Feedback loops, this could be a reason for the problems.


 i mean watch this, this is just ridiculous, ugly, and completely unacceptable for a art worker. I have never seen a image made from MJ with something like this.
I still get images like this after a detailed description of the face, the mouth and the nose. It is always always the same stupid mouthy.

They just glue patches on the problems instead fixing them.

I can tell, that I will soon switch because I’m starting to create the same images over and over in the hope of finally correcting them, and that I avoid making certain images because I know I won’t get the right result.

1 Like

Just some descriptors I use. Full, thin, open, parted. Plump, upward curved, downward curve , tight,

Merwoman close up of face tight lip mouth narrow nose bright green eyes shaped round

Merwoman close up of face lush lips mouth narrow teeth peeking out nose bright green eyes shaped oval Wide

1 Like

Yes you can overcome some template effects. I show here the discoveries of the different “templates” i found. And it is dependent what kind of creature you try to generate. the more it is different to human, the more you will see the effect. Specially if you not talk the “mouthy” out of the way. And even then it shows up sometimes incomplete or only by color.

You can try to generate fantasy creatures where there are not many data for, and females seem to be more affected. The face must be different to human. your examples are still to close to human faces.

1 Like

Some template effect causing stereotypes like the “Alien” trigger for Roswell and H.R.Giger could be done by the recaptioner, simply give extra attributes to the aliens,and give them less weight if the word “Roswell” is not with “alien” together.
And marvel or other big company products could be marked, so they not show up in the dataset, and “black panther” or “snow white” must not be blocked anymore.

Other template effects like the nonsense lightning or the birdshit-moon probably are been caused by over-training and out of balance weights. Or CLIP dues something wrong by always selecting from the wrong data

(i hope they have not reduced the weight data set.)

The “mouthy” looks like a manual patch fix against distorted faces. But i doubt this is still a issue in Dalle-3, maybe in the in the earlier versions. The details in small distant faces has to be fixed instead, the right way.

1 Like

I think the system was just updated now? I get other pictures with “Photo Style”. The quality has improved a bit, i find.

I agree I have noticed a bit more detail, but still a lot of artifacts and I noticed eyes are not right as much.

“ Photorealistic woman edgy tough , gritty American city street narrow image”

1 Like

Yes this is probably the reason for the “Mouthy” template effect. but instead to over-train the system for faces, what they should do is give faces a extra level of details, specially around the eyes and the mouth, or all the face if they are small. Something like a area of up-scaling. MJ has fixed this problems, if i see images from it. Very very realistic faces, even if they are small.

1 Like

Yes layers would be a good effect. If they could do eye nose mouth layers like I do with mesh with 3DMax (I used to make mesh for imvu) Layers like that could let you adjust each as a separate focus.