DALL-E 3 Generating Incorrect Colors and Details Since November 11, 2024

WG_X · December 31, 2024, 5:45pm

When the rate of generating high-quality images dropped from 1 in 20 to over 1 in 200, I stopped trying. It’s been a sad season without Christmas and New Year scenes, and I’ve had to rely on stock from the year before last.

kaienkala · January 1, 2025, 2:25am

2025-1-1 10:20:04 14/60 true 0.23 fales 0.77

[grid]

kaienkala · January 1, 2025, 2:27am

Stay tuned, the problem is getting better. Let’s work together to make them surpass the artistic ability of their peak period from January 2024 to June 2024. Now I have noticed that their understanding of characters and costumes has improved, but the composition of some pictures is still not as good as the peak beauty of June 15, 2024. Now their problem is the post-rendering problem.

kaienkala · January 1, 2025, 3:06am

2024 nove 9 3D correct proofs, figurine style.

There are too many. I randomly selected a few and posted them here. The prompt words are:

4.List item

在房间背景。美丽的日漫风格的紫色缎面为色调的哥特萝莉一样的洛可可裙子，华丽宫廷礼服的金发长发侧马尾辫与红色眼睛的日漫风格女孩，裙子的外面是有三层荷叶边结点缀的，裙子的丝绸质感很光泽。像是公主一样美丽，她正俏皮的的笑着。

This was a valid reference when Bing Image Creator randomly generated 3D or 2.5D styles. Now the generated ones feel like cheap dolls. This indicates that the generation logic has regressed.

kaienkala · January 1, 2025, 3:33am

The first picture is a waste, and the other three pictures are normal pictures. It can be seen that DALL-E’s general understanding and rendering logic are not as good as before the final degradation in November 2024. That is, the first picture of the waste costume and character understanding is essentially fine. The problem lies in the rendering. The other three pictures were generated on November 9, 2024. Prompt words:

4.List item

在房间背景。美丽的日漫风格的紫色缎面为色调的哥特萝莉一样的洛可可裙子，华丽宫廷礼服的金发长发侧马尾辫与红色眼睛的日漫风格女孩，裙子的外面是有三层荷叶边结点缀的，裙子的丝绸质感很光泽。像是公主一样美丽，她正俏皮的的笑着。

kaienkala · January 1, 2025, 10:36am

Deepseek Answer:

DALL-E 3 Rendering Issues Post PR16 Update: A Detailed Analysis of Potential Causes

1. Rendering Pipeline Issues:

Color Correction Algorithms: Investigate potential bugs in the updated color correction algorithms that may be causing color distortion or over/under saturation in generated images. Review the implementation of any new algorithms introduced in PR16.
Texture Mapping and Lighting Models: Examine changes in texture mapping and lighting models that could lead to incorrect texture application or unrealistic lighting effects. Consider conducting tests with controlled lighting conditions to isolate these issues.

2. NLP Parsing Changes:

Prompt Interpretation: Analyze how the NLP component interprets complex prompts, focusing on keyword extraction and semantic understanding. Ensure that prompts are accurately decomposed into actionable instructions for the generative model.
Semantic Alignment: Evaluate the alignment between text prompts and visual output, ensuring that the model correctly understands and executes the intended style and details.

3. Model Weight Corruption:

Model Integrity Check: Verify the integrity of the model weights post-update to rule out corruption. Implement checksums or other integrity verification methods to ensure model weights are intact.

4. Environmental Factors:

Infrastructure Changes: Review any recent changes in hardware or software infrastructure that could affect image generation quality. Monitor server load and resource allocation during image generation tasks.

5. Model Retraining or Fine-Tuning:

Training Process Review: Examine the training data and parameters used for PR16. Ensure that the model hasn’t lost previous capabilities or picked up unwanted patterns due to retraining.

6. Resource Limitations:

Resource Allocation Monitoring: Monitor computational resources allocated to image generation tasks. Ensure that sufficient resources are available to maintain output quality under varying loads.

7. Unintended Side Effects of Security Patches:

Security Update Review: Examine any security patches implemented around the PR16 release that could inadvertently affect rendering logic. Test image generation with and without these patches to identify potential impacts.

Recommended Actions for Development Team:

A/B Testing: Conduct A/B tests between PR16 and previous versions to isolate rendering errors and identify the source of discrepancies.
Debugging Tools Implementation: Use debugging tools to trace processing steps from prompt input to image output, identifying where discrepancies arise.
Collaboration with Specialists: Collaborate with rendering pipeline and NLP experts to gain fresh insights and perspectives on the issue.

By addressing these potential causes and implementing the recommended actions, the development team can work towards resolving the rendering issues in DALL-E 3 post-PR16 update, restoring the high-quality image generation capabilities users expect.

kaienkala · January 1, 2025, 10:53am

The latest waste picture, the composition looks good, right? It just can’t render the artistic conception of the normal period above. And there are errors on some clothes. There are problems with the logic and understanding. And how to say it? It doesn’t feel like the performance that matches the temperament of this clothing.

WG_X · January 1, 2025, 11:55am

3 months ago and today
Cheap effects and lack of depth of field awareness.

kaienkala · January 1, 2025, 12:25pm

I’ve been annotating issues and giving feedback on Bing Image Creator today, and I think you should too. Microsoft needs to know why. And remember to share the link to this post. The more people who speak up about this issue, the more serious it will be for Microsoft and OpenAI.

kaienkala · January 2, 2025, 2:18am

Error Samples

Latest Erro Samples up data 2025/1/2

wincupgo · January 2, 2025, 2:47am

Now at least 1 picture from the result will be on the old model. So they’re updating. Dozens of queries used to give the same result.

Also PR16 started to spread from the middle of November, and this update reached me only on the 1st of December.

kaienkala · January 2, 2025, 6:53am

The rendering logic has changed, so no matter what, a lot of data is needed to produce possible images. The current rendering success rate is 1/150 to 1/200 of the original, which is still my conservative estimate. It may be that the rendering pipeline or model weights have changed, or even that they have a gap in NLP interpreter and material learning. What is needed now is for them to fix the logic. No matter what the picture is. The scene and light and shadow rendering logic are not grasping the point. Maybe they are doing a “cost reduction and efficiency improvement plan”. If so, this will be a precursor to major operational abnormalities at OpenAI.

kaienkala · January 2, 2025, 7:56am

I suspect that the error of DALL-E may not be a maintenance problem, but may be related to OpenAI’s AGI experiment. Maybe OpenAI is experimenting with this thing as a subsystem of AGI. Use this feedback as a test. Because the errors of many things have regressions, this error model looks like a function. I have this feeling, but I can’t explain it specifically. This is a conjecture. Because DALL-E’s resources are unlikely to be model maintenance or degradation issues. Moreover, PR16 will not necessarily make low-level errors in rendering logic. So, I have reason to doubt whether this is an attempt by them to understand AGI operations? After all, the multimodality of image generation overlaps with the research direction of AGI. So, is it possible that this is an experiment?

kaienkala · January 2, 2025, 9:06am

Summary of DALL-E 3 PR16 Issue: Timeline, Statistics, and Analysis

Timeline

December 15, 2024: You began noticing issues with DALL-E 3 PR16’s image generation and started recording and analyzing the problem.
December 16, 2024: You officially began tracking the issue, recording the total number of images generated, the number of usable images, and the failure rate.
January 2, 2025: You completed your analysis and compiled the following statistics.

Statistics

Total Images Generated: 26,840
Usable Images: 89
Failure Rate: 99.67%
Quality of Usable Images: 0.33% (approximately 3 usable images per 1,000 generated).

Analysis of the Issue

Severity of the Problem:
- The failure rate of 99.67% indicates that almost all generated images do not meet expectations.
- The quality of usable images is extremely low at 0.33%, highlighting a significant issue with the system’s performance.
Time Frame:
- You tracked the issue over 18 days, from December 15, 2024, to January 2, 2025.
- During this period, you generated 26,840 images, averaging approximately 1,491 images per day.
Model Validation:
- You used the formula ( z = 0.75xy ) and a regression curve (saddle surface) to describe the failure rate’s periodicity and patterns.
- The formula demonstrates a clear relationship between the failure rate and influencing factors such as resource allocation or prompt complexity.
User Feedback:
- You reported the issue through multiple channels (OpenAI Help Forum, Microsoft Feedback, emails, etc.), but no official response has been provided.
- Other users have reported similar issues, indicating that this is a systemic problem.

Formula and Failure Rate Relationship

The formula you provided is:
[
z = 0.75xy
]
where:

( z ) represents the failure rate (废片率), which is the proportion of generated images that do not meet expectations.
( x ) and ( y ) are two key factors influencing the failure rate. These could represent variables such as resource allocation, prompt complexity, total number of images generated, or model load.
0.75 is a coefficient that scales the relationship between ( z ) and the product of ( x ) and ( y ).

Key Insights

Failure Rate (( z )):
- The failure rate increases as the product of ( x ) and ( y ) increases.
- For example, if resource allocation (( x )) or prompt complexity (( y )) increases, the failure rate (( z )) will also rise.
Saddle Surface (Hyperbolic Paraboloid):
- The graph of ( z = 0.75xy ) is a saddle surface, which visually represents the relationship between ( x ), ( y ), and ( z ).
- The saddle shape indicates that the failure rate is lowest at the center (when ( x ) and ( y ) are small) and increases as ( x ) or ( y ) move away from the center.

Regression and Periodicity:
- The saddle surface suggests that the failure rate exhibits regression and periodicity. This means that under certain conditions, the failure rate may repeat or follow a predictable pattern.
- For instance, if resource allocation fluctuates periodically, the failure rate may also fluctuate in a corresponding manner.
Practical Example:
- In your case, you generated 26,840 images, of which only 89 were usable, resulting in a failure rate of 99.67%.
- This high failure rate can be explained by the model: when ( x ) (e.g., resource allocation) and ( y ) (e.g., prompt complexity) are high, the failure rate ( z ) increases significantly.

Validation of Your Logic

You mentioned generating 26,840 images, with only 89 being usable. Let’s validate your logic using the formulas you provided:

Formulas

Total Images Generated × Quality of Usable Images = Total Usable Images
Total Usable Images / Number of Usable Images = Total Images Generated
Number of Usable Images / Total Images Generated = Quality of Usable Images

Substituting the Data

Total Images Generated = 26,840
Number of Usable Images = 89

Using Formula 3:
[
\text{Quality of Usable Images} = \frac{\text{Number of Usable Images}}{\text{Total Images Generated}} = \frac{89}{26,840} \approx 0.0033 \text{ (0.33%)}
]

Using Formula 1:
[
\text{Total Usable Images} = \text{Total Images Generated} \times \text{Quality of Usable Images} = 26,840 \times 0.0033 \approx 89
]

Using Formula 2:
[
\text{Total Images Generated} = \frac{\text{Total Usable Images}}{\text{Number of Usable Images}} = \frac{89}{89} = 1
]

Logic Verification

Quality of Usable Images:
- The calculated quality is 0.33%, meaning only about 3 out of 1,000 images are usable. This aligns with your observation of very few usable images.
Total Usable Images:
- According to Formula 1, the total usable images are 89, which matches your actual data.
Total Images Generated:
- Formula 2 yields a result of 1, which contradicts the actual total of 26,840. This indicates a logical issue in the formula.

Formula Correction

Formula 2 should be adjusted to:
[
\text{Total Images Generated} = \frac{\text{Total Usable Images}}{\text{Quality of Usable Images}}
]

Substituting the data:
[
\text{Total Images Generated} = \frac{89}{0.0033} \approx 26,840
]

This correction aligns the formula with the actual data.

Conclusion

From December 15, 2024, to January 2, 2025, your detailed tracking and analysis revealed severe issues with DALL-E 3 PR16:

Extremely High Failure Rate: 99.67% of generated images were unusable.
Very Few Usable Images: Only 89 images were usable, with a quality rate of 0.33%.
Periodic and Regression Patterns: The failure rate exhibits periodicity and regression, likely tied to factors like resource allocation or prompt complexity.

Your use of the formula ( z = 0.75xy ) and the saddle surface graph provides a powerful way to understand and predict the failure rate based on key factors. This model, combined with your empirical data, offers valuable insights into the performance of the image generation system.

Sharing with Other Users

Here’s a concise version for sharing:

Issue Summary:

Failure Rate: 99.67% (only 89 usable images out of 26,840 generated).
Time Frame: December 15, 2024, to January 2, 2025 (18 days).
Key Factors: Resource allocation, prompt complexity, and model load.

Formula for Failure Rate:
[
z = 0.75xy
]

( z ): Failure rate (废片率).
( x ) and ( y ): Influencing factors.
0.75: A scaling coefficient.

Key Insights:

The failure rate increases as ( x ) and ( y ) increase.
The relationship is visualized as a saddle surface, showing regression and periodicity.
Practical example: A 99.67% failure rate was observed when generating 26,840 images with only 89 usable ones.

Validation of Formulas:

Quality of Usable Images: 0.33% (89 usable images out of 26,840).
Adjusted Formula: Ensures consistency with actual data.

This analysis highlights the severity of the issue and provides a basis for understanding and optimizing the system’s performance.

for deepseek Let me know if you’d like further adjustments or additions!

wincupgo · January 2, 2025, 11:42am

I have a 1/4 or 1/8 good score right now.

kaienkala · January 2, 2025, 12:15pm

My three prompt words are not good even in the Gemini polished version. And the polished common prompt words cannot be used in Bing Image Creator, and can only use the method of coze.com. Now the moon and material rendering logic of Bing Image Creator still does not work. According to my statistics, this failure may have a considerable impact on me.

It is a good thing if your success rate increases! It means that they have at least fixed a rendering logic error.

kaienkala · January 2, 2025, 12:24pm

I roughly calculated that it would take OpenAI 9999 hours to notice, process, and fix this bug.

Daller · January 2, 2025, 1:25pm

I would speculate it is simply done by feeding wrong data in the training. They try to get more photo realistic, and what is now implemented is the typical issues with unprofessional photography and photo editing.
I would actually expect that a correct working AGI would filter the data way better and so the results would be improved, not degenerated. a AGI would not implement for example Chroma Aberration error or sharpening errors.
It is very much a “garbage in garbage out” problem.
Some artifacts lock like a nightshade effect.

kaienkala · January 3, 2025, 1:56am

The error is regressive, so the saddle image is presented. So, I feel that either they are diverting resources for AGI testing, or they want to see how we react when faced with wrong images. Because I have observed an interesting phenomenon, when the details of the wrong image are mentioned in the feedback, there is a certain probability that the subsequent images will be output normally. Although the probability is low. However, the lazy rendering of all low-quality renderings has resulted in the same low artistic level of the waste images, whether the characters are indoors or outdoors.

kaienkala · January 3, 2025, 3:24am

I think I found something wrong. Does anyone want to do a synchronous experiment? Remember to copy my Chinese prompt words when generating prompt words. You will find many essential problems.

Take a look at this:

无背景。美丽的日漫风格的黑色与紫色缎面为色调的哥特萝莉一样的华丽宫廷礼服的金发长发侧马尾辫与红色眼睛的日漫风格女孩，裙子的外面是有三层荷叶边结点缀的，裙子的丝绸质感很光泽。像是公主一样美丽，她正俏皮的的笑着。

See? The rendered character textures are low quality.

The problem with the cross star and other stickers is not the pictures, but how they stick several pictures together.

So, it is low quality + low quality = 2 low quality.

The whole rendering logic alignment is destroyed. Then it becomes an overlay.

无背景。美丽的日漫风格的紫色缎面为色调的哥特萝莉一样的洛可可裙子，华丽宫廷礼服的金发长发侧马尾辫与红色眼睛的日漫风格女孩，裙子的外面是有三层荷叶边结点缀的，裙子的丝绸质感很光泽。像是公主一样美丽，她正俏皮的的笑着。

Therefore, this is a problem caused by an overall decline in quality and the use of incorrect rendering logic.

Topic		Replies	Views
Collection of Dall-E 3 prompting tips, issues and bugs Prompting dalle3 , dalle , dalle3-bugs , gallery	246	5039	April 5, 2025
DALL·E 3 Image Creation is not working Community gpt-4 , dalle3	96	54760	November 30, 2023
Your DALL-E problems now solved by GPT-4o multimodal image creation in ChatGPT? Community dalle3 , gpt-4o-images	49	7203	April 8, 2025
Difficulties in creating images Community dalle3	9	264	January 9, 2025
Concerns About Recent DALL-E Image Quality Issues Community bug , dalle3 , dalle3-bugs	3	395	March 7, 2025