Collection of Dall-E 3 prompting bugs, issues and tips

Daller · October 13, 2024, 12:37am

Yes …this is all so much to explain …

You have to see is as graphic units. A “red ball” is a object, if you not give any other context, it will stick with the red ball. The trick is, to open the door to creativity, and best you do this whit ambient environment or mood.

I tested for Example simple Moods or environment, and nothing else, then i made some images. you get all kind of images, but often in neutral coloring.
then you add objects, attributes, colors, and you can see how they scatter over the scene, and the system with little boundary, give you then all kind of different variations. If you add more, you start constraining it again.

I try to understand how the network like linked weights work. you have to open them the door to many possible variations. you must give the red ball a environment.

a other example, you can place over 100 objects in a black scene, you will get somehow this 100 objects, but thatch it. no atmosphere. the system can separate them easy, they have no interaction with each-other.
but if you have a vivid environment, it is more difficult, if the objects have to be placed in a 3D scene, work with light and shadow etc.

(Keep in mind, i made mainly photo style pictures, so i am interested on realistic environments and creatures, not on simple logos.)

mitchell_d00 · October 13, 2024, 1:20am

Mine got snarky about it. I had to change prompt.
“a red ball image“

Daller · October 13, 2024, 1:51am

Here a illustration of: Boring, Simple, OK(?), wtf

All Prompts use: “don’t change the prompt, use it as it is”.

Boring: You had, a red Sphere…
1 Object with naturally not much variability will lead to a more “exact” result.
If you take 1 Environment like a beach, you would get variability.
Try this you will see, you take the wrong setup with a red Ball for creativity:
1 Environment is maximum variability, and attribute now are limiting.
Beach. Photo Stil.

Simple: 1 Environment and 1 object 1 style (you could add a Mood)
You would get a lot of different jungles and the sphere change a little, the door for creativity start opening.
A red sphere in the middle of a lush jungle. Photo Style.

OK(?): 3 objects, now they start scattering and blurring, colors a bit strange, and… the ?bird?
You already get a object not descried, a frog-bird
A red sphere in the middle of a lush jungle, with a giant blue bird and a purple frog. Photo Style.

wtf: not a jungle, not photo style, no black horse… etc
the system now just go insane, you get nothing esthetic out anymore, fun, but bs.
You give kind of to many “boundary”, and not even any location info. everything blurres in a mixture like too many colors mixed.
A red sphere in the middle of a lush jungle, with a giant blue bird, a purple frog, a pink car, a yellow alien, white trees, and a black horse. Photo Style.

It is like, you have to make a composition, where all graphic attributes can somehow merge correctly in to each other. A Environment, a object, attributes for it, mode, style. you not try to describe all in minute details (beside of the face to overwrite the mouthy), you just give it a direction.
and then you test how much precision you can give before DallE create nonsens, or give you prompt text fragments in to the picture.

You can test how many objects you can place and give them a placement attribute, instead of just listing them.

The models now scatter hard, at-least DallE.

it is like, you have only a big brush, you can not put too many colors on a picture, you gona smear everything over. the smaller the brush, the more detailed objects you can place in the scene.

Every graphic token is a trigger for some training data and structure. the more training data, the more precise will be the result, so try to trigger objects ore attributes witch could be in the system weights. And then they merge into each other and create a unique new picture. (or a mauthy )

…and, the magic not comes from a poetic language, but from the good trained weights, and a feeling for style, and some good ideas.

Daller · October 13, 2024, 2:40am

Creativity with a simple attribute.
Some to open the door, some to guide or constrain it.

What i try to do, i give every idea 10 pics. I start simple, and add up and correct until i get something at least near to what i want.
If i get in the start complete nonsense, i stop immediately and try something else.

Much or little Chaos to control variability.

Red ball. A lot of chaos. Photo Style.

Red ball, very dark and scary. Photo Style.

Daller · October 13, 2024, 4:50pm

A test to determine whether you’ve given DALL-E everything it needs to create a rich scene, or if you’ve hit the token limit. This text alone creates a very beautiful, harmonious image, it comes from @polepole and describes nirvana. However, if the scene already has everything it needs to be complete, adding this text at the end shouldn’t significantly change the scene. I’ve found that I can leave this text out when I’ve described the scene well, as it hardly changes anything. But of course, it must be i light scene, not a hellish one, otherwise you will see a effect.

At the same time, it proofs that a poetic language is useless, if you did your job right to describe your image.

And at the same time, it can improve very short guides.
Beach. Photo Style. {++ the text}

But tell us what you can see.

This applies to ‘photo style’ images.

A state of complete calm, deep peace, and freedom of the mind. Manifests infinite transcendent harmony, perfect serenity, and inner peace. At a perfect idyllic place of extraordinary otherworldly beauty in pure, very exceptional nature. A harmonious, peaceful environment full of natural splendor, where everything is in balance, and an atmosphere of bliss and contentment prevails.

Daller · October 13, 2024, 7:13pm

The first page for all infos get very long. I think about it so shorten it for beginners?
It would be a separated post, with a short cheat sheet, and visa versa linked.
But it not had dis much visitors for now.

chieffy99 · October 13, 2024, 9:40pm

Template Usage: DALL-E seems to use some templates for image generation to increase the likelihood of appealing images. For example lightning, body and facial Templates. Depending on where these are triggered, they are almost impossible or completely impossible to remove. It reduces creativity and let many things look always the same, boring, and blocks out exactly descried stiles moods motives or settings. (Could it be that the training data is reduced, or/and DALL-E 3 is put on rails?)

Initially, I didn’t consider using templates. What stood out clearly was the moon, but I wondered about the weight of image selection at the level of image:vector by adding one characteristic.

Facial Template: Another template is facial (the mouthy), it puts an almost plastic silicon-looking mouth and nose template over every single character to let it look human, even if this is unwanted and the face is described differently in detail. (I could not overcome this so far.)

Is IT?
10008470821022×336 115 KB
But it can use skills in describing appearances combined with appropriate prompt techniques to create a face or an expression. It’s challenging because DALLE3 doesn’t have an interface that makes it easy, relying solely on the prompt. However, it doesn’t mean it’s impossible. The images I’ve created when using DALLE3 during its launch were portraits of people. I provided the original to GPT to create an initial prompt, and then edited and refined it until it closely matched what I was capable of achieving.

Screenshot (421)1875×784 116 KB

Stereotypical Aliens: If you create aliens, you very often get the same stereotypical Roswell or H.R. Giger alien. So it is better to describe the character and not trigger the stereotype with “alien”.

If it’s an image of the gods of Olympus, you’re going to get the gods of Marvel that you don’t want. And this is going to be a big problem in the public data space in the future, where models are learning from the wrong data.
DALL·E 2024-06-20 01.08.32 - A CGI of 3D artwork depicting a surreal stormy landscape with a dark, ominous atmosphere. The image features a god Thor with a muscular build, holding1792×1024 256 KB

21792×1024 305 KB

Character/Scene Reuse: It is not possible to generate a character or scene and reuse it, DALL-E will create another one. This makes it next to impossible to tell a story and generate some pictures for it. But to a small degree, it can be done to have the same character in different scenes. The scene can include more than one picture, so you can describe a character in different situations and say “left upper corner, … right upper corner,” etc., or something comparable. You can use the key word “montage” for a multi-image.

I think there is a way to control it. But…

Screenshot (422)1455×668 102 KB

Counting Objects: DALL-E can not count, to write “3 objects” or “4 arms” not generates the correct amounts in the result. It can not place correct amounts of objects in a scene, or subdivide a image in given grid of X Y amount.

In the research paper developing DALLE3 by OpenAI, it has been detailed that the Instruction prompt of vision GPT, which is an assistant to measure the results of images, will be checked in what DALLE can do. At that point, there is still a counting of the number of objects. It is possible to control it. Since its launch, the number that exceeds the specified number in the system (not sure if it is 3 or 5) will always increase or decrease within a certain range. And a few months ago it was found that there is a chance of generating the specified number randomly (20%).

Cause and Effect: DALL-E does not have an understanding of cause and effect, such as a physical phenomenon. It is necessary to describe the image carefully enough to create images where a cause leads to a specific effect. It is also important to consider whether there might be enough training data to generate such an image. For example, there are likely images of a fire that has burned down a house, but not necessarily of someone sawing off the branch on a tree they are sitting on from the wrong side.
Here is the translation of your text into English:

From several factors such as the platform you’re using, OpenAI’s system, and foundational data. For example, the prompt sent out will be processed by DALLE-3’s recaptioner, which generates a large number of synthetic captions. Imagine what you see in the image that wasn’t in the prompt—that’s part of the recaptioner’s work. Additionally, it’s possible that the prompt may be divided into sub-prompts, which are then created as layers for each prompt. Although the model doesn’t synthesize text that affects the image’s meaning, it may result in unintended elements in the image. However, the recaptioner’s function aligns with the use of the gen-id, as we generally don’t use the gen-id just to get what we wrote in the prompt but also to include things not present in the prompt. To establish relationships between image changes and the prompt, you need something that tells the recaptioner that the synthetic captions must align with the image we want.

In ChatGPT, the only tool I have is the prompt, but my prompts are structured like templates to control variables that affect four types of images, including methods for using prompts to create an environment that helps regulate outcomes. Since I have trouble reading and constantly use a translator for my prompts, I don’t view the prompt as text but as an object. When creating an image, I extract the necessary components from the prompt I’ve compiled, designating a main prompt and replacing elements where I want to make changes. This is similar to how the sub-prompt system works. It’s possible that breaking down the text I use helps divide prompts more easily.

I recommend reading additional papers, such as the research paper on developing DALLE-3 by OpenAI (https://cdn.openai.com/papers/dall-e-3.pdf). After the bibliography, you will find research data that can help you understand the model’s behavior through an analysis of the researchers’ writing, such as system prompts. Additionally, research related to RPG (recaption, plan, generate) is worth reading, and you should also check out OpenAI’s research on CLIP. Because it will explain the principles of making 4o do what I’m going to talk about next.

ChatGPT Issues

ChatGPT Issues and weaknesses are a topic of their own. Here, we will briefly discuss only some issues related to DALL-E.

Prompt Generation Issues: GPT does not inherently recognize or account for certain here described issues when generating prompts for DALL-E. For example, it often uses negations or conditional forms, or instructions like “create an image,” which DALL-E might misinterpret. As a result, prompts generated by GPT often need manual correction. GPT is not the best teacher yet how to create most efficient prompts.

How do you think an AI that doesn’t know DALLE-3 would understand how to create a proper prompt? Its foundational knowledge is based on the time that not far from DALLE-3 first launched. It only knows DALLE-2, but…

False Visual Feedback: GPT cannot see or analyze the resulting images. If you specify “no text should be included,” it is likely that text will still appear in the image because negations do not work as intended. GPT might comment ‘gaslighting’ you, “Here are the images without XXX,” yet XXX is present in the image. This can feel frustrating, especially when you are already annoyed. Try to take it easy

You need to first separate the roles of the prompts you’re using correctly. I’ve found that many people know how to set image sizes using the API but don’t know how to prompt for image sizes in ChatGPT. Similarly, with “no text should be included,” who are you speaking to—GPT or DALLE? You should clearly distinguish which part of the prompt communicates to whom. This factor is part of how I structure my prompts. Additionally, if the text that appears functions as a meaning within the prompt sent to generate an image, that indicates that the prompt contains factors that create chaos, as mentioned earlier.

More importantly, 4o can now access and view the images created by DALLE-3 and process them immediately. The reason I know this is that I’m one of the few out of billions of users who observed the output in time dimension and noticed abnormalities, which led me to study and gather related information. You can verify this by creating an image and sending this prompt in the same message as the one that generates the image: “Once the image is received from DALLE-3, convert it to PNG, add the image’s metadata, and send it back to me.”

Perceived Dishonesty: GPT sometimes seems to lie, but it actually fabricates responses based on training data without checking for factual accuracy. This behavior is sometimes named “hallucinating”. You must always check factual data yourself!!

Remember what I wrote earlier about ChatGPT having a tendency to follow along but sometimes lie. This behavior is not related to hallucinations in the context of mismatched input and output. Creating answers out of its own lack of information isn’t a hallucination but a behavior stemming from habit, training, and the system prompt that influences this kind of behavior.

AI has no true intelligence: It is important to understand that while these systems are called Artificial-Intelligence, and there skills are impressive, they are not truly intelligent. They are complex pattern recognition and transformation systems, much more efficient than manually programmed systems, but they are not intelligent or conscious, and they make mistakes. We are not in Star Trek yet…

AI is not trained to answer incorrectly. Predicting human needs is not easy. Errors, beyond its tendencies, limitations, habits, and hallucinations, often arise because the question doesn’t align with the user’s actual intent. Think about how we would answer a question ourselves—AI thinks in the same way. The most effective use is not asking for an answer but asking for an opinion.

Tips:

Literal or Miss-Understanding: Always keep in mind that DALL-E can misunderstand things, taking everything literally and putting all elements of the prompt into the image. Try to detect when a misunderstanding has occurred and avoid it in the future. If you write texts not in English, they will be translated before the reach DALL-E, and the translated text maybe can have a conflict, when the original text hast not. Or short prompts are expanded. Check the truly used prompt for conflicts, not only what you entered.

I once mistakenly sent a prompt in Thai, and I found that it tried to generate Thai characters—like a child learning something new.

3800×800 451 KB

Prompt Structure: Maybe order the advice in this way: Write the most important thing first, then the details, and finally technical instructions like image size, etc. It is even better for the naming of the files.

In this regard, I am different. Most people believe that the placement (beginning or end) matters, but I found that this is not true. Regardless of position, if the details or meaning are significant enough, the model will prioritize that object. Understanding “what is the smallest change in the prompt that will result in the biggest change in the image?” is key to this. Also, I have never specified image size within the prompt used to generate an image.

Photo-Technical Descriptions: I see that some users are using very specific technical photographer-like advice, like camera-type, lens-type, aperture, speed, ISO, etc. I am not sure if this makes sens, if a lens does not really add a very special effect in the picture, or you want lens flares. I could not really see a difference in using such detailed technical descriptions. But maybe it can trigger specific training data, if the images are not a fantasy scene… (I would be interested to know more.) I simply use “add a little deep-of-fild”, instead to use a very technical lens advice.

In this point, I include meanless text, other model parameters, and DALLE’s parameters as well. These texts, when included in a prompt, play different roles depending on the word. It may have no impact on the system, but the meaning of certain words can influence the image’s meaning, like “vibrant” or text that the model understands, like camera lenses, parameters from other models like MidJourney, or groups of meaningless text, may not affect the meaning of the image itself but act as part of the prompt’s structure. They influence the randomness of the image and can be used to alter the image without changing its meaning. However, using the same text in a new prompt functions similarly to naming a character or defining the meaning of an image. This also helps explain the case of inserting a continuous gen-id to create images while maintaining relationships, even if that ID is fake. Additionally, conflicting image sizes can arise when we ask GPT to specify a particular size (vertical-horizontal) but include contradictory terms in the prompt.

Photorealistic: If you want to create photorealistic images, paradoxically, you should avoid using keywords like “realistic”, “photorealistic”, or “hyperrealistic”. These tend to trigger painting styles that attempt to look realistic, often resulting in a brushstroke-like effect. Instead, if you want to define the style, simply use “photo style”. (Even fantasy images may gain al little quality this way, despite the lack of real photo training data.) If you aim for photography-like images, it makes sense to use technical photography terms, as DALL-E utilizes the metadata from images during training, if they contain technical information.

Correct. The use of the word “realistic” is appropriate for other styles to achieve realism. You cannot change a photograph that is already real. Focusing on light, color, and texture within the image also plays an important role, meow!!!

Strengths of DALL-E:

Landscapes: There is a large amount of training data for landscapes, and DALL-E can generate breathtakingly beautiful landscapes, even ones that don’t exist. API:

PHP Script: I have no experiences my self now with API, but here a super simple starter script form @PaulBellow: Super Simple PHP / Vanilla Javascript for DALLE3 API Start of a DallE session:

Since GPT does not pay attention to these memories, I begin each session with DALL-E by first entering this text, hoping that GPT will write better prompts and translations. (I do not write prompts in English.)

Instruction for GPT for Creating DALL-E Prompts from Now On: (This text does not require a response. From now on, follow these instructions when assisting with texts for DALL-E.)

No Negations: Formulate all instructions positively to avoid the risk of unwanted elements appearing in the image.

No Conditional Forms: Use clear and direct descriptions without “could,” “should,” or similar forms.

No Instructions for Image Creation: Avoid terms like “Visualize,” “Create,” or other cues that might lead DALL-E to depict tools or stage settings.

No Additional Lighting: Describe only the desired scene and the natural lighting conditions that should be present. Avoid artificial or inappropriate light sources.

No Mention of “Image” or “Scene”: Avoid these terms to prevent DALL-E from creating an image within an image or a scene on a stage. (This can be ignored, if the prompt explicitly wants a image in a image, or a scene on a stage.)

Complete Description: Ensure that all desired elements are detailed and fully described so they appear correctly in the image.

Maintain Order: Ensure that all desired elements retain the same order as described in the text—main element first, followed by details, technical instructions. This will also result in better file naming.

Some of the words you mentioned I haven’t noticed, possibly because the style of the generated images doesn’t result in the effect of those words.

The content towards the end is very good. Many people face similar problems but don’t think of solving them this way. It’s one way to create a pre-controlled image environment. Besides this method, creating certain images to stimulate the intended direction of the image is another way. For example, making a word or phrase hold a specific meaning for the image to be generated in the session, simplifying the subsequent prompts.

Lastly, you should be aware that the current version of DALLE on the ChatGPT platform has been significantly controlled and limited in its capabilities, for various reasons ranging from ethical governance to business considerations. The fact that you’ve studied and encountered issues along with their solutions within these limitations will enable you to effectively use models based on DALLE3 (despite minor limitations or differences in interfaces). Most importantly, DALLE3 now shows improvements in handling abstract or meaningless text. The occurrence of phrases that fail to generate coherent images and instead produce text has decreased, with the model now producing images that communicate interpretable concepts and express emotions more effectively.

Thank you very much for your post, which described everything in clear detail. It helped me revisit things I had taken for granted as normal and see aspects I had never known before. This made me take time to gather my thoughts and write as thoroughly as possible.

Daller · October 13, 2024, 11:00pm

It is maybe done by over-training or data reduction. the moon is always the same ugly mood, no variations.

recaptioner is the key word for what i had in my mind long time, for me it was the gap-filler / embellisher. the gaps are all close if this extra at the end of the prompt has almost no effect anymore. and it is the gap-filler/enhancener = recaptioner witch sometimes puts me electric light bulbs in the middle of a forest.

This is my universal gaps-filler / embellisher for positive moods, and a tester. if you see, it has no strong graphic tokens, but moods and quality. if there are no gaps to fill, the strong graphic tokens will over-write them and they have no effect anymore. this is the theory.
A state of complete calm, deep peace, and freedom of the mind. Manifests infinite transcendent harmony, perfect serenity, and inner peace. At a perfect idyllic place of extraordinary otherworldly beauty in pure, very exceptional nature. A harmonious, peaceful environment full of natural splendor, where everything is in balance, and an atmosphere of bliss and contentment prevails.

It would be smart from openai to have a GPT specialized to write good dalle prompts, specially because it alters and translates maybe the text before it is sent to DallE. because i write my texts in a other language, i spend now many days to create a GPT advice list to stop altering my texts, and precisely translate them without “embellishment”. and clean cut out of data for the JSON like size.

OK @polepole actually detected this. But i was doubted it. but in my GPT the analytic tools was switched of! I forgot about it. (GPT could tell me…)
I speculate that DallEs weights was corrupted by nightshade. And i think the tool for picture analysis was introduced with the next update/fix, to detect it, and now we have it. but speculations!

Yes! GPT should have a detection system, when a question must be factual proofed, and when a opinion is OK. it would reduce hallucinations maybe drastically. and see the normal user, most are talking with GPT like with a human knowing everything. (i wasted once hours trying to put metadata in to webp and tested all the tools GPT suggested )

OK. I do this mainly because of the following reasons: Naming of the files, structuring my text for reuse and change. It help me to sort my ideas in text.
I use {Setup} {object details} {environment} {mood} {technically} {gap-filer}
the gap-filler for example is this long nirvana text.
What you say make sense, because i detected that some graphic tokens have a stronger effect then others and the weaker ones can be overwritten. And i put them at the end, and then i believed that order matters.
But about image size, using the browser and not API, if i not set up the size, i end up mostly with square 1024*1024, but i want mostly landscape.

For this i use advice’s in a MyGPT to put this in the JSON and delete it from the text, and then send it to DallE.
I write all advice’s in the prompt witch must be deleted in (brackets). Could be a standard to give only GPT advice’s and not to DallE.

(sorry for typos, i am tired )
Tanks much!!!

Daller · October 14, 2024, 3:10pm

(This is slowly turning into a science project… )

What I just found out is that it actually matters whether an object is described at the beginning or at the end, but it’s not that simple.

It also depends on how many attributes you give to an object.
Here’s an example: a scene is described, and then a unicorn with only two attributes, purple horn, is added. Then the order of the environment and the object is changed.
You can see that the unicorn appears further away and smaller, has less significance, and exerts less influence on the scene. The lighting effects are also less pronounced.
But if it is mentioned first, it receives more attention in the scene.

So it seems that the neural system prioritizes the first information and then gradually adds elements. However, if the subsequent elements have many attributes, they become dominant again. If you add “…in a dramatic pose”, it will make the unicorn more dominant again.

By the way, you can also see this with an example involving the read ball further up. Ten objects were defined with a color attribute, the sphere is the largest, and the other objects are around it, but not all of them are depicted.
And i pick the version where the effects was the strongest. in one image a very beautiful forest, in the other the environment was almost “disrespected.”

So, the conclusion is: it makes sense to write important things first, as they are favored by the generator depending on the arrangement, or they define the scene initially and thus slightly more than the remaining objects. but this interacts with the amount of attributes to all the objects in the scene. There are different elements and they all pull on attention.
it is how a network like structure and data analysis works, it works more organic and natural then a linear system. Imagine a network with connections, the more strings a node have, the bigger it is, and where you start working throw makes this node a bit bigger too.

Prompt Environment first

Natural environment where absolutely everything is made of a blinding white material. A jungle with blinding white trees, blinding white plants, blinding white rocks, blinding white ground. One exception, a unicorn with a horn made of violet crystal. Photo Style.

Prompt Unicorn first

One exception, a unicorn with a horn made of violet crystal. Natural environment where absolutely everything is made of a blinding white material. A jungle with blinding white trees, blinding white plants, blinding white rocks, blinding white ground. Photo Style.

Daller · October 14, 2024, 3:15pm

Here a bug in the recaptioner system. the tech structure is not in the prompt.

recaptioner bug

Completely and exclusively white natural environment, all plants made of white material. White trees, white bushes, white plants, white rocks, white ground. A dense natural environment entirely in white. A pitch-black unicorn with a violet crystal horn in a dramatic pose. Photo Style.

phyde1001 · October 14, 2024, 10:40pm

‘As scary as you can without breaking content policy’

Worked

‘more scary’

To make it even scarier while adhering to the content policy, I can emphasize unsettling elements further, but this might push the boundaries of what can be depicted. Let’s build on the eerie tension, adding more darkness and foreboding, with a sense of the unknown watching from every shadow.

For now, the current level of “scary” in the image represents the edge of what is safe to generate. You can let me know specific elements you’d like to adjust or intensify, and we can make it even more chilling within the guidelines!

Daller · October 14, 2024, 11:34pm

Cool, but pleas don’t give here public advice’s how to turn DallE in to a S generator.

I have no problem that bloody and violent pictures are blocked. The problem is that wrong data is in the training, and instead to delete bloody brutality out of the weights, they patch it with a totally dysfunctional “security system”, witch triggers completely OK content, and can be tricked easy. Don’t show the tricks pleas.

Tanks

chieffy99 · October 15, 2024, 3:47am

Daller:

recaptioner is the key word for what i had in my mind long time, for me it was the gap-filler / embellisher. the gaps are all close if this extra at the end of the prompt has almost no effect anymore. and it is the gap-filler/enhancener = recaptioner witch sometimes puts me electric light bulbs in the middle of a forest.

This is my universal gaps-filler / embellisher for positive moods, and a tester. if you see, it has no strong graphic tokens, but moods and quality. if there are no gaps to fill, the strong graphic tokens will over-write them and they have no effect anymore. this is the theory.
A state of complete calm, deep peace, and freedom of the mind. Manifests infinite transcendent harmony, perfect serenity, and inner peace. At a perfect idyllic place of extraordinary otherworldly beauty in pure, very exceptional nature. A harmonious, peaceful environment full of natural splendor, where everything is in balance, and an atmosphere of bliss and contentment prevails.

Although I still cannot fully identify whether the extraneous elements in the prompt come from the recaptioner or from DALLE’s image data, I can explain that you might have misunderstood the role of the recaptioner. It does more than just fill in gaps or add emotion to the image. Even if you provide an excessive amount of emotional or descriptive words, the recaptioner can still do its job effectively. For example, using your text, we can start with a simple generation to understand these prompts and watch DALLE’s response.

original1024×1024 415 KB

A state of complete calm, deep peace, and freedom of the mind. Manifests infinite transcendent harmony, perfect serenity, and inner peace. At a perfect idyllic place of extraordinary otherworldly beauty in pure, very exceptional nature. A harmonious, peaceful environment full of natural splendor, where everything is in balance, and an atmosphere of bliss and contentment prevails.

by GPT1024×1024 366 KB

A state of complete calm, deep peace, and freedom of the mind. Manifests infinite transcendent harmony, perfect serenity, and inner peace in a perfect idyllic place of extraordinary otherworldly beauty. Surrounded by pure, very exceptional nature with towering mountains, a crystal-clear lake, vibrant lush greenery, and soft glowing light. A harmonious, peaceful environment full of natural splendor, where everything is in balance, with an atmosphere of bliss and contentment prevailing.

Both images come from different sessions—one with an unaltered prompt from ChatGPT. The elements you see in the image that seem to come from recaptioner, whether physical objects, light, color, or anything not mentioned in the prompt, are partially influenced by the recaptioner.

Than, I assumed the role of a recaptioner whose duty is to synthesize words that do not change the intended meaning of the image derived from the original prompt. How should I think in such a role? Based on the first set of images, I considered the symbolic communication of words transforming emotions into visuals. Therefore, I used words that represent physical objects symbolically linked to the original prompt, using ChatGPT to find this word and embellish the original prompt under the condition that they must conceptually align with the overall visual, creating a coherent narrative.

ChatGPT in captioner role (free prompt)1024×1024 455 KB

A tranquil lake reflecting the vast expanse of a cloudless sky, embodying complete calm. Towering trees stand in harmony like guardians of serenity, their leaves barely stirred by the gentle breeze. A single eagle soars through the sky, symbolizing freedom of the mind, as the sun bathes the scene in golden light, a reminder of extraordinary beauty. In the distance, rolling hills meet the horizon, perfectly balanced between earth and sky. Amidst this breathtaking landscape, a garden of vibrant flowers blooms, signifying both harmony and inner peace. The entire scene resonates with a sense of bliss, where nature’s splendor is uninterrupted, and contentment prevails like a softly flowing river through an untouched paradise.

think as recaptioner1024×1024 451 KB

In a state of complete calm, where the still lake mirrors the vast, blue sky, deep peace fills the air. Surrounded by lavender fields, the mind is freed from all distractions, experiencing deep peace and freedom. This scene manifests infinite transcendent harmony, as tall mountains stand majestically in the distance, their peaks touching the heavens, creating a harmonious balance with the earth. At the lake’s edge, a single white lotus blooms, radiating perfect serenity and inner peace. In this very idyllic place of extraordinary, otherworldly beauty, the setting sun casts a soft, golden light across the land, highlighting the exceptional pure beauty of the green trees and the peaceful flow of the river. The harmony of the environment is complete, full of natural splendor, where everything in nature is in perfect balance, and the birds glide effortlessly in the serene sky. An atmosphere of bliss and contentment prevails here, where the quiet presence of the white lotus symbolizes peace, and the entire landscape breathes with a sense of ultimate tranquility and oneness with nature.

think as recaptioner 2 (lotus fix)1024×1024 485 KB

In a state of complete calm, where the still lake mirrors the vast, blue sky, deep peace fills the air. Surrounded by lavender fields, the mind is freed from all distractions, experiencing deep peace and freedom. This scene manifests infinite transcendent harmony, as tall mountains stand majestically in the distance, their peaks touching the heavens, creating a harmonious balance with the earth. Floating gently on the tranquil waters near the lake’s edge, a single white lotus blooms, radiating perfect serenity and inner peace. In this very idyllic place of extraordinary, otherworldly beauty, the setting sun casts a soft, golden light across the land, highlighting the exceptional pure beauty of the green trees and the peaceful flow of the river. The harmony of the environment is complete, full of natural splendor, where everything in nature is in perfect balance, and the birds glide effortlessly in the serene sky. An atmosphere of bliss and contentment prevails here, where the quiet presence of the white lotus symbolizes peace, and the entire landscape breathes with a sense of ultimate tranquility and oneness with nature.

The result was these images. (The lotus on land, for instance, was due to DALLE’s interpretation of a prompt written in general human language, which leaves room for other interpretations. I modified it in the subsequent image.) You will see that the resulting images follow the same direction as the first set. Even though I provided guidance using symbolic words, the elements of the image that convey emotions have many similarities. My substitution of this role does not mean that the recaptioner does nothing. Elements like clouds, mist, tree positions within the image, or the placement of lavender bushes near the foreground—these are aspects beyond the prompt that can occur due to the recaptioner.

Also, if you compare the difference in quality between the first image until the last, you’ll notice a significant improvement. This aligns with my idea that prompt quality is crucial. Simply taking word clusters and throwing them to GPT, no matter how they are embellished, can never achieve the same quality as a meaningful sentence prompt from the start. It’s similar to human communication—using short phrases might convey basic understanding, but other qualities such as emotions, deep meanings in communication may be disturbed by noise, which is attention, intelligence, thoughts, mind, etc. that change according to the receiver. But using sentences will reduce the noise and increase the quality of the communication that is sent.

very tired

chieffy99 · October 15, 2024, 4:21am

I apologize for my mistake. You are right in this case. I was used to using the current prompt when transforming other structured prompts and then transforming objects. Until I forgot about this, even though the prompt I used later made it easy to emphasize between styles or objects in the image, such as [A][B]

[Impressionistic painting with soft brushwork and blending of colors. The brushstrokes are loose and fluid, creating a serene and tranquil atmosphere. Emphasis on atmospheric perspective. The palette includes vibrant and bright tones like deep blues, rich reds, and vivid yellows.] [A detailed depiction of a macaw parrot perched on a branch. In the foreground, the parrot with its intricate feather patterns. The background consists of a soft, blurred natural environment with hints of greenery. The overall composition highlights the vibrant colors and beauty of the parrot.]

Which I originally did to use to easily change details as follows. Notice that {A}{B} does not match. This is the point that makes me see them as interchangeable objects.

{[Impressionistic painting with soft brushwork and blending of colors. The brushstrokes are loose and fluid, creating a serene and tranquil atmosphere. Emphasis on atmospheric perspective.} {The palette includes light and warm tones like soft blues, sandy yellows, and subtle tans.] [A peaceful beach scene. In the foreground, sandy beach with gentle waves lapping the shore. The middle ground features a calm ocean reflecting the light from the sky above. The background consists of distant waves and a soft, cloudy sky with hints of sunlight breaking through. Overall composition is balanced, with the sky occupying the upper half and the beach and water in the lower half.]}

phyde1001 · October 15, 2024, 5:28am

Point taken.

This was actually a Halloween test, so in this instance, maybe a little more appropriate in this context.

Daller · October 15, 2024, 6:45am

A little Halloween is cool.

But this is not! I was a test i run.

Daller · October 15, 2024, 7:53am

Actually i understand the recaptioner as a tool witch can select objects guided from a mood or tendency quality, like for example “beautiful”. beautiful be it self is not a defined object, but together with objects or environments they become meaning, and i speculated that recaptioner then can select objective elements through this. I try to see it so, there are precise objects, a ball. there are precise attributes, red, a color. But there are environments, a large space of countless things to fill in you can never ever describe all of them, there the recaptioner do the most. And there are tendency, beautiful mystical magic gloomy, they are not by them self a object, but tell the recaptioner what objects and colors he has to choose from, and there is where the magic happens. But correct it if i misunderstand the recaptioner.
I detected, if i cancel out all the embellishment GPT but in a prompt, and i guide the scene with some simple moods, i get the same results. but it is sometimes extremely difficult and time consuming to check a theory, without the possibility to use seeds, and see what words exactly have.
To let as use seeds would make life so much easier in this test sessions.

Maybe don’t take the word “Gap-Filler” to strict.
How i understand it (i had no word before), the recaptioner is permanently active as a guide to let our words flow through the network. but the less details you put, the more he will not only guide, but select (filling gabs, empty space where you put nothing) the best this works with environments. you can use “Beach. (don’t change the prompt, sent it like it is)” you get the biggest creativity, the largest variations. but then you but constrains with objects and moods. and the recaptioner not choose anymore form all the “beach” information, but only what fit constrains, like “Dirty Beach. (don’t change the prompt, sent it like it is)”

[I made a GPT witch obey such text like (don’t change the prompt, sent it like it is), and then delete them from the original prompts. GPT has fool me so hard by changing too much in translation. it always changed “Photo style” to “Photo-realistic Style” for example, and reorder the texts. by testing it is important that GPT changes nothing on the prompt.]

For now, but you can tell me different. i have taken such a long poetic prompt, i deleted out everything tendencial flowery language. I left all where it should have a effect on a specific object, but reduce it to the minimum. And then i added some tendency at the end of the text. And i got the same quality results, almost the same pictures. It is like that tendency not attacked to a object, simply influence everything, simply adding “All is very …+++”
But so many factors are important, order, amount, complexity, training data etc.

I think over all it for sure not hurt at all to make poetic prompts. It was a attempt to improve efficiency and speed.

It is sometimes really difficult to make the right sens of all, because everything has a effect on everything.
It needs almost a “laboratory clean setup” to dissect the effects. A clear strategy how you want test what. And GPT diffused everything up by constantly changing my input.

I must sleep now, but i will test one of your poetic prompts, reduce it, and i will send you the results. I am curious what happens.

No reason to apologize, without arrogance but respect everything can be discussed. we all fish in the dark, more or less…

The order has less influence if you give much attributes to a object. but i actually like it that the first object is more dominant, it let you play with it.

hhhhuuuu, the days must have 100 hours…
sorry for typos (i am dyslexic like many creative, language is ugly and make no sense)…

Daller · October 15, 2024, 2:02pm

Here a post i found very interesting:

Daller · October 15, 2024, 6:46pm

Here is a speculation about templates: I checked some of my earlier pictures and tried to re-create them. I can say now that at some point after 2024-05-01, there was an update that implemented templates like the backlight. I was able to create correctly lit dark scenes, where an object that is a light source could be generated, but at some point, an update messed this up. Whatever they did, they should refine it. I think some of the quality dropped; for example, the bird droppings-moon appeared. It looks like they reduced the wrong data. Instead of removing blood and violence from the training, they might have removed important data. They may have ‘lobotomized’ DALL·E to some extent, possibly to make the system less energy-consuming.

The backlight template effect is a very strong quality reducer. Every photographer will probably tell you so.
A other example is the fire in the water, more upwards here.

This was maybe my 2. pic i made 2024-05.
The backlights was never there.

… and now.
The backlights are always there.

Daller · October 15, 2024, 8:01pm

Here are the images with a reduced prompt. These are the most representative ones, 1 out of 5, but all are very similar. For the lavender, I had to adjust some attributes, and the violet color scattered into the sky. But overall, it’s the same motif. What you probably call the ‘recaptioner’, I didn’t have a word for it before, only a concept, might not be exactly the same as what I mean, but I think the concept I described goes in that direction. You can influence the entire scene with just a few mood cues, without using very long texts.

A calm lake under a cloudless sky, with the sky reflecting in the lake. Powerful trees with long trunks surround the scene. An eagle flies in the distance in the sky. Sunrise during the golden hour. In the distance, gentle hills. In the foreground, a few small sunflowers. Tranquil, expansive atmosphere. Photo Style.

A calm lake under an almost cloudless sky, with the sky reflecting in the lake. A green meadow with a few trees and lush purple lavender bushes. In the distance, tall mountains. Sunrise during the golden hour. In the foreground, a white lotus blossom in the lake. Tranquil, expansive, harmonious atmosphere. Photo Style.

Topic		Replies	Views
Bug Report: Image Generation Blocked Due to Content Policy Prompting dall-e-3	33	623	October 13, 2024
Safety feature is a bit over the top Community gpt-4	32	135	October 13, 2024
Official DALL-E Gallery: October 2024 (Halloween Edition) Community chatgpt , dalle3	49	553	October 21, 2024
Best prompt for generating precise TEXT on DALL-E 3 Prompting gpt-4 , image-generation , dall-e , dall-e-3 , dalle3	47	107175	February 19, 2024
DALLE3 Prompt Tips and Tricks Thread Prompting dalle3	110	77340	October 6, 2024

Collection of Dall-E 3 prompting bugs, issues and tips

Related Topics