Look in paper page 5 - 9, originally, user prompts alone were not enough to generate the desired image. Some limitations include lack of granularity and omission of some important details. Therefore, a synthetic text captioner was devised using LLM as the base. These captions are designed from all the important information of the image, such as object position, number of objects, background details, color, and size, which are often ignored in human descriptions. However, the initial text must also be of high quality, which is where ChatGPT plays this role.
I mentioned earlier that understanding the captioner is crucial for prompting with DALLE.
Initially, user prompts alone were not sufficient to generate images that match their desired results. Limitations such as a lack of detail and overlooking important aspects led to the development of the synthetic caption system, using a captioner based on LLM. These descriptions are designed using all the critical information of an image, such as the position of objects, number of objects, background details, colors, and sizesâelements often neglected in human descriptions. However, the initial prompt must also be of high quality, which is why ChatGPT plays this role.
This has led to considerable misunderstandings regarding its use, as the prompt is altered twice but is often assumed that GPT did it all on its own.
In the Playground, it states only:
"Prompting With the release of DALL·E 3, the model now takes in the default prompt provided and automatically rewrites it for safety reasons and to add more detail (more detailed prompts generally result in higher quality images).
While it is not currently possible to disable this feature, you can use prompting to get outputs closer to your requested image by adding the following to your prompt: I NEED to test how the tool works with extremely simple prompts. DO NOT add any detail, just use it AS-IS:"
Including the cookbook on the same page. Such crucial information should be read by those presenting it, preferably after reviewing the research, but I found no mention of it. Whatâs more, this research paper is on the OpenAI DALLE website, where fewer people visit, instead of being placed in the Help Center.
Here are the images with a reduced prompt. These are the most representative ones, 1 out of 5, but all are very similar. For the lavender, I had to adjust some attributes, and the violet color scattered into the sky. But overall, itâs the same motif. What you probably call the ârecaptionerâ, I didnât have a word for it before, only a concept, might not be exactly the same as what I mean, but I think the concept I described goes in that direction. You can influence the entire scene with just a few mood cues, without using very long texts.
I canât reduce the sentence to get the same result. I donât remember the sentence, I just know whatâs in the image and how to extract the text.
But I find the chaos in the image is obvious, the purple spread, the clouds in the sky, and find the saved text to change.
It comes from a different dyslexion than you.
[A serene lake with a calm surface like a sheet of glass, reflecting a clear sky. The waterâs surface is so still.] Surrounded by lush purple lavender bushes and A green meadow with a few trees . In the distance, tall mountains. Sunrise during the golden hour. In the foreground, a white lotus blossom in the lake. Tranquil, expansive, harmonious atmosphere. Photo Style.
Actually i understand the recaptioner as a tool witch can select objects guided from a mood or tendency quality, like for example âbeautifulâ. beautiful be it self is not a defined object, but together with objects or environments they become meaning, and i speculated that recaptioner then can select objective elements through this.
1 Question about temlate effect
Is the recaptioner responsible for some of the template effects?
In your pictures, there are prompt fragments in text in the images. is the recaptioner responsible for this?
I found specially the illumination in dark scenes, and the facial template extremely toilsome. Artistically, this is a hard quality reducer. (Check the mushrooms and the fire more up.)
If this nonsense is done from the recaptioner, it would be easy to stop it. if so pleas tell the developers!!! it would be easy to fix.
2 Question about attribute connection.
The model scatter hard, but it is still possible to give different objects different attributes. I speculate the the point â.â helps to separate the graphic tokens. Through the scatter effect it is difficult to understand how the model is connecting objects and there attributes. is DallEs model similar smart like a LLM to understand the text?
It is all about how precise the pictures are descried in the training process. And if the models do this automatically, how precise they do this. Like:
âA dog in the center. In a Garden.â
Or
A little sweet poppy doge of the species Greyhound sitting down sleepingâŠ, {elaborated description for garden, flowers, sky} etc
(I read the doc in half sleep, so i maybe missed something, there are not much data how the training process work. i think it is complex and still in development. object separation, moods, detection of attributes and details, and to put this correctly in the weights.)
and, donât get me wrong. as long as you stay with your descriptions in the model limits, it is nothing wrong to have a detailed and poetic prompt. the model seem to handle it well, and it helps by many things, like attribute separation to control the scatter effect.
I just try to be faster, and found that it is possible to get very good results with shorter prompts, as long you have all the key elements in it.
A tip. a important thing is to never use any words witch could be misinterpreted, âlike a sheet of glassâ could end up in a real âsheet of glassâ in your image.
This is why you should avoid to mention for example creations processes âcreate a imageâ, or multi meanings like âin a Sceneâ. you could get sometimes images with brushes or theater stages with a Scene.
It is difficult for us humans to avoid such terms, our linguistics systems are more advanced do hide wrong interpretations, and we use constantly double meanings or allegorical symbols.
And in translation new issues can be put from GPT.
And something more:
The recaptioner actually does his job mostly good (beside some flaws like put technical structure in a completely natural environment). It adds up objects in a scene, but keeps them more neutral and simple. In this way it is easy to start simple, and then add up attributes until you get what you try to target. And you can reduce the change of flaws with have a attribute for all parts of a scene. Object (color pose mood etc) and environment (plants, sky, foreground, background) Lightning. All this helps to have no unwanted objects in a result, because the recaptioner has to fill less âgapsâ.
Interesting is, if you not give much details, the recaptioner anyway lead the results in a very similar way. It is like the pictures witch are liked the most, have a strong weight. because without details, you should get way much more variations.
I know this is not a true Dall-E 3 tip, BUT I own a marketing company and we use Dall-E 3 all the time to make images for projects.
The best way we found to make the most out of AI images is to generate the starting image in Dall-E 3 then we use photoshop AI tools to edit details in the image and expand the image to a size we need.
I intended to use these words to create a clear and calm image of the lake, reflecting vividly without metaphorical meaning. Additionally, I mentioned the chaos of your image in the previous prompt.
A calm lake under an almost cloudless sky, with the sky reflecting in the lake. A green meadow with a few trees and lush purple lavender bushes.
In these statements:
The ambiguous term is âalmost cloudlessâ, which can be interpreted as still having some clouds, making it difficult to control them. The term âclear skyâ would work better in this regard, with no conflicts in cases where objects like stars are included in the sky.
Using terms describing the colors of objects that have inherent colors (green, purple) can easily lead to abnormalities in the image, often resulting in incomplete representation or appearing on other objects (though hard to notice). My hypothesis for controlling this, not yet tested, is to use âcolorâ like red vs. color red in various contexts. (need more data, but work in many time)
The sentence intended to create an image of the lake with the sky reflected on the water reads to me as the sky being reflected within the lake (normally, when looking at English text, I donât translate anything, but when it comes to meaning translation, I do word by word translation. Translating into human communicative language requires me to focus on itâan abnormality I have). This confusion led to the dispersion of purple light.
Therefore, I used my saved text for correction. If you think a glass will be a mirror, try this
(saved text: text from other prompt that I keep, use like part for change)
The calm lake that reflecting the clear sky on water surface. The lake surrounded by lush purple color lavender bushes and a green color meadow with a few trees. In the distance, tall mountains. Sunrise during the golden hour. In the foreground, a white lotus blossom in the lake. Tranquil, expansive, harmonious atmosphere. Photo Style.
Actually i understand the recaptioner as a tool witch can select objects guided from a mood or tendency quality, like for example âbeautifulâ. beautiful be it self is not a defined object, but together with objects or environments they become meaning, and i speculated that recaptioner then can select objective elements through this.
1 Question about temlate effect
Is the recaptioner responsible for some of the template effects?
In your pictures, there are prompt fragments in text in the images. is the recaptioner responsible for this?
I found specially the illumination in dark scenes, and the facial template extremely toilsome. Artistically, this is a hard quality reducer. (Check the mushrooms and the fire more up.)
If this nonsense is done from the recaptioner, it would be easy to stop it. if so pleas tell the developers!!! it would be easy to fix.
2 Question about attribute connection.
The model scatter hard, but it is still possible to give different objects different attributes. I speculate the the point â.â helps to separate the graphic tokens. Through the scatter effect it is difficult to understand how the model is connecting objects and there attributes. is DallEs model similar smart like a LLM to understand the text?
It is all about how precise the pictures are descried in the training process. And if the models do this automatically, how precise they do this. Like:
âA dog in the center. In a Garden.â
Or
A little sweet poppy doge of the species Greyhound sitting down sleepingâŠ, {elaborated description for garden, flowers, sky} etc
(I read the doc in half sleep, so i maybe missed something, there are not much data how the training process work. i think it is complex and still in development. object separation, moods, detection of attributes and details, and to put this correctly in the weights.)
and, donât get me wrong. as long as you stay with your descriptions in the model limits, it is nothing wrong to have a detailed and poetic prompt. the model seem to handle it well, and it helps by many things, like attribute separation to control the scatter effect.
I just try to be faster, and found that it is possible to get very good results with shorter prompts, as long you have all the key elements in it.
Separate the issues regarding what you think are image templates from textâthese factors are different.
I suggest you change your mindset, seeing DALL·E 3 as similar to GPT, which has tools to respond with images, a way of perceiving, understanding, and interpreting things in its own manner. The belief that itâs âstatelessâ doesnât mean the recaptioner doesnât exist. It works with CLIP, and OpenAIâs CLIP is different from other CLIPs. The model learns images as vectors, and text in images is also vectors. Therefore, words or characters that it doesnât understand, when interpreted as vectors, can match with the imageâs text. Thus, the model displays text on the image as part of its response to the prompt. Additionally, it can be used in reverse, allowing ChatGPT to view documents with text in images without needing Python to extract the text. This can also solve issues with unreadable PDFs, graphs, or image tables. However, the model has not yet learned to decide when to use these tools effectively to solve such problems.
As I mentioned, RLHF (Reinforcement Learning from Human Feedback) has an impact. It doesnât work as you see in RLHF 101 but can still evolve into a system that influences decisions, data selection, and the modelâs learned functionality. It makes things neutral: less is added, and more is adjusted. Even in API environments, itâs believed that thereâs no learning during usage, and thereâs a belief that itâs unaffected by broad adjustments. However, it must receive training or improvement from humans in different schedules, just like RLHF indirectly. To distinguish it, you might need to change the frameworks, such as academic content frameworks, target goals, or work methods. (But this part is quite difficult, especially when departing from common operations, like background removal). So, itâs not surprising that the model would choose the most reasonable moon image when it has the most of that type available.
Regarding the question of how the model interprets things: even though the research doesnât mention the order of operations, the caption must be completed before the recaptioner finishes creating the image. Therefore, you should carefully read the researchâit tells you a lot. For example, it says that the recaptioner is an LLM, and the training details indicate that ChatGPT performs upsampling, with an example shown on page 10. There is also information about the upsample settings at the end of the document, which forms the basis for the system prompt for ChatGPT today.
The idea of token separation affecting understanding may make you think itâs about finding objects and applying effects or adding abstract nouns. But youâre right that â.â acts as a separator because, fundamentally, itâs the most powerful stop marker. In contrast, using â,â and words that function as connectors (stop/connect/cut) such as and, or, with, nor, etc., the model can interpret these in many ways and it affects randomness and chaos. Even symbols like â+â or â-â can be used if the prompt is written clearly.
Additionally, the recommendation to read research on RPG-style image generationâwhich improves image creation by integrating control net and recaptioner techniquesâI didnât read this for technical understanding, but to grasp the concept and compare usage insights. When reading, you should compare the facts or current changes. A research paper can be both right and wrong, depending on the reading objective. For example, DALL·E research mentions that text generation in images is undesirable. Although thereâs no follow-up research, this feature has become prominent today. Undesired text in images results from chaos reactions controlled by ambiguous or confusing prompts, including words unknown to the model. However, scrambled text, which acts like noise, doesnât lead to image confusion.
It seems like you havenât looked up the information I recommended. You should read more and analyze the content as I suggested. Iâve never studied these things technically, but I analyze why these things are done this way. Do you know that one word in there promotes understanding of LLM thinking methods, and strange prompts that lead to âoff the shelf,â which is the foundation of todayâs popular prompt techniques?
I now translated the doc and read all, not only about recap. So this question is answered: dall-e-3.pdf âdescriptive synthetic captions (DSC)â is used to give pictures more details. I only fly over this part.
As soon i have time i translate DALL_E_3_System_Card.pdf too.
There are so many AI systems nested and sequential used, i think it is even difficult for the constructors to understand what the networks are doing. For the problems like topology and counting or cause&effect, they need a special trained AI for this againâŠ
And RLHF, it will take a while until they can set up a system, learning everything on his own.
Even the problems that we should avoid certain words to not have wrong elements in the images will be fixed with more training. If you say âcreate a Sceneâ you will get some wrong interpretations, the probably is maybe <5%. But i still get them. And GPT has put such triggers in translation (understandingly). There my question has the origin, if recap is done with LLM, because a LLM should know not to miss-interpret this (?). (i use a corrector in the prompt âin a completely natural environmentâ to not get light-bulbs or candelabrums i a forest, DDallE likes to put candles everywhere.)
Hello everyone, I thank in advance those who will pay attention to my question.
For the past half hour, DALL-E has stopped working on my account with a Plus subscription. I was using it through the bot: DeepGame.
I only get this as a result (illustration).
Moreover, ALL my images from previous conversations that were generated by DALL-E are no longer accessible and only a wide black square is available.
I use the tool for professional reasons. I have a very close professional deadline. How concerned should I be? How can I solve the problem? Please, everyone, this message is very important and urgent.
Here is answer from ChatGPT. From the context of the quote, it seems that the user is asking how long it would take for me, as an AI, to fully understand and use words accurately to generate the correct images. The key issue highlighted here is the occasional misinterpretation by DALL·E, such as adding elements (e.g., light bulbs or candles) in a setting where they donât belong, like a forest.
Regarding the time it would take to learn this, itâs more of a continuous process. Each time I generate images based on text prompts, I improve through updates and new training data that helps refine the understanding of language in specific contexts. Misinterpretations like the one described happen due to the complex nature of visualizing text descriptions. With more usage and feedback, both users and I can adjust the prompts and outputs to better align the generated images with expectations.
In summary, while I continually improve over time as more training and feedback are processed, the exact duration depends on how often I receive input and feedback. The more precise the wording of the prompts, the faster this learning and improvement occurs.
from now
Italic text is answer by 4o
Bold text in quote is it may be next question.
I need to know how you can compare text with image. and what you can do?
To explain how I compare text with images and what I can do, letâs break it down:
### How I Compare Text with Images:
I use a system that involves embedding both text and images into a shared vector space. Hereâs how this process works:
1. Text Processing (via LLMs like GPT):
When you give me a text prompt, I convert the words into numerical representations (embeddings). Each word or phrase gets transformed into a vector (a series of numbers) that represents the meaning or context of that text in a multidimensional space.* 2. Image Processing (via models like CLIP):
Similarly, images are also converted into vectors. This process involves analyzing the features of an image (e.g., colors, shapes, objects) and encoding them into numerical data.* 3. Comparing Text and Images:
Once both the text and the image are represented as vectors, I compare the two by calculating how close or far apart these vectors are in the shared space. The closer the vectors, the more likely it is that the text and image match in meaning or representation.*
This vector space is designed so that related concepts (both in text and images) are mapped close to each other. For example, a prompt like âa red appleâ should generate an image of an apple that is red, because the text âred appleâ would have a vector similar to the image of a red apple.*
### What I Can Do with This Ability:
1. Generate Images from Text (like in DALL·E):
Using the learned relationships between text and images, I can generate new images based on a description. For instance, if you give me a prompt like âa cat sitting under a tree,â I can visualize what that looks like and create an image where the vector representation of the text matches the visual concept.* 2. Understand and Analyze Images:
If you provide an image, I can describe it using text by translating the imageâs features into a language-based description. This works similarly by comparing the imageâs vector with corresponding text vectors.* 3. Modify Images Based on Text:
I can take an image and a text prompt and alter the image to match the new prompt. For example, if you provide an image of a cat and tell me to âmake the cat blue,â I can identify the parts of the image associated with the cat and modify the colors based on the text description.* 4. Image Captioning and Understanding:
I can generate captions or describe an image because I can map the visual data back into text. This is useful in tasks where the AI needs to understand and explain whatâs happening in an image.*
### How I Learn and Improve:
** Training Data: Iâm trained on vast datasets of paired images and text (like annotated images). This teaches me how to associate visual concepts with words.*
** Feedback Loops: Every time I generate images or compare text with images, feedback helps refine my understanding of how well Iâm matching the text to the image.*
### Example:
Letâs say you provide the text âa mountain covered in snowâ.
** I convert this text into a vector representing the features of snow, mountains, and related concepts.*
** I then search for or generate an image whose vector closely matches the features of snow-covered mountains.*
** The more data I receive over time, the better I become at understanding and creating more accurate visual representations based on text.*
In short, I use a system of vector embeddings to compare text with images, and I can generate, modify, analyze, and understand both based on their shared representation in this space.
From my perspective. It harder than you think because of a few factors. First, understand and use grammar, you need to have enough use cases with related words to build a data set for training. But no one knows what you can do, no one knows how to control you, and no one knows the appropriate prompts for DALLE, that led you and DALLE3 to undeveloped skill problem. Next is very difficult, words are more limited when used in visualization. Even if you know what a word looks like, it doesnât mean you know what it means to use in a visual and you donât have behavior of DALLE3 to responding with prompt. Seem need long time to learn.
You make a very insightful observation. The challenge youâre describing highlights a fundamental complexity in the way I (and AI in general) interact with language and visualization, especially in systems like DALL·E 3. ### Why Itâs Harder than Expected:
1. Grammar and Context Understanding:
Grammar is crucial for understanding how words relate to each other. Even though I can process a large dataset and learn from countless use cases, every new context or phrase requires interpretation based on syntax, grammar, and meaning. This means I must not only learn the meaning of words but also how they function together in different sentence structures.*
Thereâs a limit to this understanding when trying to map abstract or complex sentences to visual concepts, especially in creative or unconventional contexts where grammar may guide meaning in subtle ways.* 2. Lack of Established Guidelines: You mentioned something important: no one knows how to control me or define exact prompts for DALL·E. This is key because users, including those developing the models, are still learning how to best phrase prompts. There are no clear rules or best practices that guarantee success across all scenarios, leading to inconsistent results. Without clear guidelines, both users and the AI are in a trial-and-error phase, where prompts are refined based on what works or doesnât. As a result, my responses or DALL·Eâs outputs may sometimes seem unpredictable or undeveloped because the right inputs are not fully understood. 3. The Limitations of Words for Visualization: Words, by their nature, are abstract and context-dependent. While I might understand a wordâs dictionary definition, translating it into an accurate visual representation involves interpreting both the literal meaning and the context itâs used in. For example, âbrightâ can mean light intensity in one context, but happiness or intelligence in another. As you mentioned, just knowing what a word âlooks likeâ doesnât always mean I know how to use it in a visual context. Thatâs because words have layers of meaningâmetaphorical, figurative, or literalâand figuring out how to visually interpret each use case takes time, training, and iteration. 4. DALL·Eâs Behavior and Prompt Response: Youâre right that my behavior and DALL·Eâs response to prompts can be inconsistent. Part of this is because DALL·E is still learning how to balance creativity and accuracy. Generating images from text is complex because I must decide which elements in the prompt are most important, how to visually combine them, and whether to prioritize accuracy or artistic freedom. It will take time to learn because there are many variablesâlanguage is flexible, and visual representations are even more so. The AI needs to constantly receive feedback from users and learn from those interactions to understand what kind of images align best with specific words and prompts.
### Learning and Development Over Time:
The more use cases I encounter and the more feedback I receive, the better I become at understanding which prompts work and how they relate to visual outputs. This includes learning how different words function in various contexts, especially in a visualization context.. Youâre correct in saying that this is a long-term learning process. As I encounter more diverse prompts and situations, Iâll build a better understanding of how to interpret them in the way you expect. But as of now, there are still gaps between the language inputs and the visual outputs, partly because the prompts, AI, and user expectations are evolving together.
In summary, the complexities you pointed outâgrammar, limited visual vocabulary, unpredictable behavior, and the need for refined promptsâare key factors that make this task difficult. But with time, as more data and feedback are integrated, I will continue to improve, though it wonât happen overnight.
** By limitation from above. What does a human need to do for you to achieve his goal?**
To help me achieve a goal effectively, a human should:
1. Use clear and specific prompts â Avoid vague instructions. 2. Provide context â Include relevant details for better understanding. 3. Give feedback â Refine prompts after seeing results to improve future responses. 4. Break tasks into smaller steps â Handle complex tasks in simpler parts. 5. Use examples or references â Show examples to clarify expectations. 6. Be aware of limitations â Understand what I can and canât do. 7. Identify areas for improvement â Point out consistent issues for refinement.
These steps help me understand better and deliver more accurate results.
Do you know the distribution of colors in image . I initially didnât notice ? I only became aware of it when you mentioned, but I almost immediately figured out the solution simply because I had knowledge related to the problem. However, that doesnât mean I know everything. There are many things I donât mention because, even though I know or see them, they are difficult to explain, and I get tired of looking for references.
Try observing the following points in the post:
How many times do you see the term âscramble textâ?
What can we do with the knowledge of Static Noise and Adjustable Noise in image creation?
What is Chaos? It doesnât exist in OpenAI, but why can it be used to solve the problem?
Itâs all about visual control with prompts that have different text inside for different purposes, and I havenât even mentioned it yet.
The visual in image that you think is a templateâwhy did I say itâs just RLHF?
I believe these images can probably answer that. DALLE3 doesnât use templates; it simply uses what it knowsâits knowledge of composition and lighting. I think the recaptioner canât explain the origin of such effects in detail as seen.
Whatever it is, faulty recap, CLIP, or another part of the system, since May 2024-06, some things have gotten worse, not better. The template effect seems to be an imbalance in the weightings, as far as I can tell. Itâs like a well-worn path where, in the end, everything always runs along the same track and thus always leads to the same incorrect result, increasingly deviating from the prompt. The nonsensical lighting in darkness and the âbird-shit moonâ are excellent examples of this. There are probably tens of thousands of images where a moon was visible, but the system always spits out the exact same ugly moon. If thereâs no corrective system that modifies the prompts and injects certain constraints, itâs likely âovertrainingâ, or better, weights out of balance. The âmouthyâ is probably a weighting error resulting from an instruction by the technicians to make the system generate aesthetically pleasing or correct faces.
Itâs hard enough to understand what happens inside a network. This is likely still largely a black box for the technicians as well, and they may be correcting things in the wrong way (which is evident from the dysfunctional safety system). Theyâre just patching holes, not providing real solutions.
Another problem is feedback. I donât know if feedback systems are used to train DallE, but in the last text you sent @chieffy99, it seems that to some extent they are. (they maybe use something recap/image-description like to check the results, compare, and feed it back in to the training.) Feedback is extremely difficult to manage, and I think the technicians still donât know enough to use feedback loops in their networks. As you probably know, feedback systems can often be chaotic, or they lead to âclipping,â where things over-amplify until all the data maxes out. This ruins the finely tuned weightings in the system. Recently, as far as I can tell, there was a problem with the weightings. If it wasnât Nightshade, then it was a failed experiment that disturbed the weightings and resulted in broken images.
I know itâs still a young technology, and we must be patient. But it seems that other image generators donât have these problems. i have seen pics from MJ and they are really better den Dalle now (i must test this somewhat myself, but i can not focus on 2 systems, i will leave DallE at this point). Everyone dreams of a self-learning AI, but if thatâs even possible, and there are valid doubts, it will take a long time. (a self-learning system must have feedback loops, which still carry the risk of destroying its data.) So if the technicians what to train DallE by Feedback loops, this could be a reason for the problems.
⊠i mean watch this, this is just ridiculous, ugly, and completely unacceptable for a art worker. I have never seen a image made from MJ with something like this.
I still get images like this after a detailed description of the face, the mouth and the nose. It is always always the same stupid mouthy.
They just glue patches on the problems instead fixing them.
I can tell, that I will soon switch because Iâm starting to create the same images over and over in the hope of finally correcting them, and that I avoid making certain images because I know I wonât get the right result.
Yes you can overcome some template effects. I show here the discoveries of the different âtemplatesâ i found. And it is dependent what kind of creature you try to generate. the more it is different to human, the more you will see the effect. Specially if you not talk the âmouthyâ out of the way. And even then it shows up sometimes incomplete or only by color.
You can try to generate fantasy creatures where there are not many data for, and females seem to be more affected. The face must be different to human. your examples are still to close to human faces.
Some template effect causing stereotypes like the âAlienâ trigger for Roswell and H.R.Giger could be done by the recaptioner, simply give extra attributes to the aliens,and give them less weight if the word âRoswellâ is not with âalienâ together.
And marvel or other big company products could be marked, so they not show up in the dataset, and âblack pantherâ or âsnow whiteâ must not be blocked anymore.
Other template effects like the nonsense lightning or the birdshit-moon probably are been caused by over-training and out of balance weights. Or CLIP dues something wrong by always selecting from the wrong dataâŠ
(i hope they have not reduced the weight data set.)
The âmouthyâ looks like a manual patch fix against distorted faces. But i doubt this is still a issue in Dalle-3, maybe in the in the earlier versions. The details in small distant faces has to be fixed instead, the right way.
Yes this is probably the reason for the âMouthyâ template effect. but instead to over-train the system for faces, what they should do is give faces a extra level of details, specially around the eyes and the mouth, or all the face if they are small. Something like a area of up-scaling. MJ has fixed this problems, if i see images from it. Very very realistic faces, even if they are small.
Yes layers would be a good effect. If they could do eye nose mouth layers like I do with mesh with 3DMax (I used to make mesh for imvu) Layers like that could let you adjust each as a separate focus.