DALL-E 3 Generating Incorrect Colors and Details Since November 11, 2024

I started having problems around the 25th of November. I also use Bing, but Microsoft Designer and Copilot show similar problems. Here is my observation thread

1 Like

I guess there may be a problem with the rendering logic, which leads to miscommunication in the part of summarizing and expressing the generated keywords, and then there is a problem in the link that issues instructions to DALL E-3, so this situation may occur. This has little to do with DALL E-3 drawing the picture. It is very likely that there are some problems with the model that issues instructions to DALL E-3. Considering that the same error occurs when Bing Image Creator and ChatGPT call DALL E-3, I think the problem may be in this link. However, last night I used the third-party website coze.com to call ChatGPT4V+ DALL-E to render a picture. I found that I could accurately understand the keyword and composition problems. This shows that it is very likely that there is no problem with DALL E-3 itself, but because there may be a problem with the logic of the upper model that issues instructions to it. This leads to the picture pollution.

Of course, this is just my assumption since I actually tested it and came to this conclusion after discussing it with ChatGPT 4o and gemini 2.0.

1 Like

I’ve noticed this change in quality as well.
There are chroma shifts, for example in images with deserts, where the shadows have a blue tint. I also observed that in a series of images, sometimes one appears that looks like it’s from a different model, it has a slightly different style, and the typical new errors are not present. Additionally, errors have mainly appeared in bright spots, such as white spots in the eyes or strange reflections on reflective surfaces. Also, in dark scenes, suddenly volumetric lights have appeared, making it nearly impossible to achieve a dark background without using prompt tricks with side effects.

There was a model update that didn’t improve the model but instead reduced the quality.

But the clearest indication is what I call the ‘bird-shitt-moon.’ This ugly moon with the very dark spots is the most clear indicator that there are 2 differed models in use, or when the update happen.

Here’s another post that describes the same problem. It seemed like there were two versions of weighting data in circulation, and it appears that the older model contains fewer errors.

And i speculate, there was a Nightshade infection, maybe some effects are a holdover of this.

1 Like

Here 2 examples:


White spots on reflections. (Head, hand, knee)
A blue shift in the shadows. (sometimes it is even more stronger)


Here a Chroma Aberration error, caused from cheap lenses.
And it is blurry, same cause, cheap lenses and bad photo images.

This is a weights data problem, they have trained the system with the wrong data, it needs filters or corrector’s to keep such data sources out of the training.

1 Like

Hi Daller,
I completely agree with your observations—especially the “bird-shit moon” and chromatic aberrations. I’ve noticed that issues like shadows with blue tints, overly bright spots, and misaligned volumetric lights have been consistent since around mid-November.

From my testing, it seems the issue lies partly in the NLP model issuing commands to the renderer. For instance, prompts for specific styles often include unintended elements. Moreover, backgrounds and textures like silk or metal often get misinterpreted, as you pointed out.

I also noticed improvements in specific aspects (e.g., clothing details), but the rendering problems persist. Your theory about training data or weight discrepancies makes sense. Hopefully, with continued feedback, the team can address these issues and perhaps revert to an earlier, more stable version of the model.

Let’s hope they take this feedback seriously—we all miss the consistent quality from before November!

2 Likes

I was just testing old prompts that previously generated beautiful images, but the prompts no longer work. The images are comparatively boring now. Everything has a kind of plastic look. On the one hand, I can still create very beautiful images with the right triggers, and some images have become photorealistic. But that’s exactly where the problem lies. They probably included more photos in the training data, with the effect that all the typical mistakes from bad photography have flowed into the weighting.

There are problems in NLP as well, but there is still room for improvement everywhere. I think that NLP might have even been improved, but at the same time, the training data may have been weakened. NLP, CLIP, ReCaptioning, GAN and who knows wath else is still in development.
Even is GPT not the best prompt writer, and it rewrites prompts before sending them to DallE, and even this can fool us, specially if prompts are translated.
OpenAI give us not much extras. No seeds, no selection of model data, no options, no feedback… etc other companies are there more advanced.

It is also possible that customers are now receiving reduced computing power due to the high number of requests. So, it’s likely several problems that overlap with each other. However, the fact that the system always generates the same moon, and that there are template effects, can really only be due to the weighting data.

The system needs a filter before training, and the data needs to be retrained.
I see that more and more people with longer Experimentes detect the differences.

DallE is a Art tool. i hope they have at least 1 artist in the team, and not only technicians.

Maybe you find this interesting too.

1 Like

I also noticed graphical errors in early September when I was testing Bing queries in GPT. It seems that Bing and other Microsoft services now use a common DALL-E model. I was very disappointed with the results then and decided not to subscribe. Why, when MD, Copilot and Image Creator provide more detailed results?

For me, the best version was the mid-summer version: it kept the detail and had a well-balanced filter on censorship. This filter wasn’t as stringent as the one before it, yet it didn’t break the overall quality.

It’s ridiculous to hear excuses about lack of capacity, because DALL-E is not originally designed for photorealistic images. The same Grok, which is probably running at a similar load now, has no problem creating photorealistic images without artefacts.

DALL-E has always stood out for its ability to easily implement any style of image, but recent updates destroy that feature. In every style, detail is important, and with blurred objects in the foreground, DALL-E loses its value.

1 Like

I hope we can soon use image generators on our own systems, and training is done in the public. (for now i am not so happy about what exists.)

And if somebody has a source where prompts are compared across many systems, it would be interesting too.

This actually not even matters in terms of performance, it is all about training data.

And yes, the “security system” is absolutely ridiculous, it goes on my nerves so much.

1 Like

As far as I know, the understanding of the number of skirt layers has only improved in recent works. Originally, when you mentioned September, I found that their understanding of court dresses seemed to be as simple as Spanish dance skirts, rather than being able to distinguish between Baroque, Rococo, Victorian styles, etc., as in January to June 2024. At the time you mentioned, I also encountered many unsatisfactory generation. But at least the images at that time were not as incomprehensible as the blackboard or paintings now. And I just searched for Bing Image Creator on pixiv and found that their recent generated images are also rough and low-quality. This means that the problem continues. And I also found one thing: many authors on pixiv who originally used Bing Image Creator no longer update. Based on my high-frequency testing and marking of wrong pictures for 12 consecutive days. I found one thing. That is, the composition is actually still excellent, but the rendering cannot maintain a high level at all.

1 Like

One problem I found in recent high-frequency tests is that the wrong images are basically rendering problems. The understanding of the quick words is not as good as before. But the wrong picture composition is still beautiful, it’s just that the rendering of these beautiful pictures is destroyed by the weird rendering logic algorithm.

However, between January 25, 2024 and June 15, 2024, the generation efficiency of these images is very advanced and constructive. One thing I found is that November 2024 is the last normal working time, as listed in the work schedule I showed in the post.

The failure began to break out before December 4, 2024, and the peak time depends on my adaptability, which is around December 15, 2024. Many works in this period can be said to be very watercolor style to a large extent, rather than normal rendering. This period involves the understanding of the prompt words, rendering logic, and composition splicing; yes, when I used Bing Image Creator, I found that Microsoft could sometimes paste the moon into the window frame.

Many good photos were ruined like this. As shown in the wrong sample photos I showed at Microsoft OneDrive, many images are well composed, but the rendering ruins their beauty. These images are rendered in the wrong style, and then there are landscape overlaps and individual problems. It seems that the landscape is not used as well as before, and the depth of field seems to be high, but the efficiency is extremely low. Because those backgrounds do not set off the beauty of the characters.

2 Likes

The most reliable way to check the quality of generation, is a mini cell skirt. It has a rather small pattern on it, and only if the generation is successful, it will be visible. In other cases, it will just be the main colour, and occasionally black, smeared lines will be visible.

I also noticed that rendering in 4:3 is almost perfect. And even if 1:1 converted to 4:3, the result will be impressive. But not perfect. For example, here is the last generation:

1 Like

I conversely liked the version that came out for Bing in mid-summer. There were no fictional censorship barriers. Even though support for many languages was disabled, which scared me at first as I was writing queries in them at the time :sweat_smile:, it eventually reduced the number of blocked queries for no reason. And the filter holds that bar even now. The whole problem is in the final stages of rendering.

If interested, here’s my thread on the Microsoft forum

1 Like

Yes, I like that the generation works best from January 25th to June 15th, 2024. Just like the pictures of the princesses I posted on this page. And you’re right, it’s just a mistake to render the final step.

That’s a pretty good time frame, and it’s already starting to plummet around November 11, 2024.

1 Like

The material I’m talking about doesn’t mean that. I mean, the style that was created around September has a problem with the understanding of frill. Several layers of skirts have been deliberately rendered into a style similar to the ‘Spanish style dance skirt’ that is, the Mexican style skirt, which is a problem that cannot be correctly understood. There is no such problem around June 15, 2024. Now the latest results are not too much of a problem in understanding. There are only rendering style issues and glitches. The composition of the scrap pieces is still very good, which is the problem I have encountered.

1 Like

On the contrary, I like it from mid-summer to today. There are no unnecessary items in the generation.

But I remember funny moments at the beginning of the year. If you generate on a military theme, there will be a grenade somewhere in some of the 4 results. Though there was nothing about it in the prompt.

1 Like

I’ve now noticed another problem. If before you could describe the number of persons and their identical costume, now only one character will be in the outfit as described. It’s not a common problem, but I’ve never had it before.

1 Like

Maybe the problem is still with the rendering algorithm and the link of giving orders to DALL E-3. Now DALL E-3 is still Rambo, but the general giving orders to him has become Private Snauf. Maybe General Patton was in charge of him before, and now Snauf is in charge of him. So, this may explain why the picture we see is quite strange.

1 Like

I have only drawn this kind of picture. I don’t quite understand your style. And I rarely dabble in that kind of thing. I don’t know if this picture is helpful to you. This is the only picture I have drawn. The setting is at a shooting range.

1 Like

OLD

DALL-E 3

.https://1drv.ms/i/c/d11a9eb3058eacfe/EWBy4vNowo9Bo08wVbNnSuoBlbF9XhrGCbX0VjkerSUrug?e=WZwLpE

DALL-E 3

.https://1drv.ms/i/c/d11a9eb3058eacfe/EYV6LZz-191CreLK-31HXgYBWVxQfX5bjUf8127AAoeFEg?e=aGxGmh

Now…

DALL-E3 PR16

.https://1drv.ms/i/c/d11a9eb3058eacfe/EXAdrvszFbxFjjmeooay_-oB92PLotZrxdT7hj9qQ6nm6A?e=fQPEfh

DALL-E 3 PR16

.https://1drv.ms/i/c/d11a9eb3058eacfe/ERfhoutnMb9Fm2xIAVjyInUBLPO3QAL8r5o8Izp0cIswKw?e=jB2qNw

As you can see, the upper image meticulously follows every detail I provided, while the lower one focuses only on the character and bold lines. In a word, the image screams that it was made ai. Even my special prompts don’t work.

Rest assured, the prompts for both images are the same.

I believe the person who made this update creates images at most six times a month. The images they create are probably very specific things… Therefore, they will never see how terrible this update is…

I’ve been waiting for it to be improved for days, but it’s still the same…

They might have thought of this update to make it work faster, but…

We want quality and originality! Not speed.

If the authorities care about us, please create an option for us to use the old version.

Finally…

I am sure that the person responsible for this update, who did not conduct feedback tests beforehand, is consoling themselves with various excuses. And I hope that someone who thinks like us will tell them all these facts to their face and revert everything to its original state. Or, the responsible person will understand the feedback and allow us to use the old model until the new one is fixed.

1 Like

Perhaps they think that faster generation will allow users to get works faster, but if users cannot achieve the desired details and quantity of works within the psychological threshold, users will try to use statistical probability expansion to increase the statistical range for batch screening to get the data they may want.

For example, this is like what we do when taking deep space pictures. If the meteorological conditions, photometry and flat field data are normal within an observation period, then the poor shooting must be due to problems with the observation period or the telescope does not work as required.

The same is true for picture generation. When users cannot get the desired data, they will count the data volume within a larger range. Once the prompt word engineering does not work, they will use large-scale generation to achieve their goals. This is contrary to their thinking. Therefore, it is not advisable to blindly pursue generation speed while ignoring the associative ability of picture generation.

Number of images * effective quality = total number of images.

The number of images tends to infinity, which is just a proportional function. But if the image quality is a decimal, or even close to 0, then the total number of images is close to 0.

I just want the result, if there are more good pictures, the curve will be proportionally upward, if there are fewer good pictures, or even close to 0, then the total number of pictures will always be equal to 0.

Because ∞*0=0

3 Likes