DALLE3 API poor quality even considering prompt revision

1004wsh · March 24, 2024, 9:23pm

I am calling DALLE3 api thru this code:

And this is my prompt:

A female social carer, face seen, stands in a marketplace, hands working dough. Stalls around her display global ingredients. Behind, a cobblestone path leads to a luminous bridge with volunteers holding instruments, silhouetted against the sunrise. To her right, a trail enters a forest; to her left, an open square resonates with music. An ancient library looms in the background. The scene blends vibrant colors with golden dawn light, capturing a world of cultural unity in a realistic style.

And this is the result:

Even considering that the prompt is revised:

revised_prompt=‘An Hispanic female social carer is seen standing in the bustling marketplace, her hands skillfully working a piece of dough. Stalls around her display an array of global ingredients, a testament to culinary diversity. In the background, a charming cobblestone path leads to a glowing bridge where silhouetted figures, seen as volunteers, hold myriad of instruments against the canvas of a sunrise. To her right, a trail enticingly beckons towards a lush forest while on her left, an open cobblestone town square echoes with resonant music. Further in the backdrop, an ancient, grand library towers, adding a historical touch. The scene splendidly weaves vibrant hues with the golden light of dawn, encapsulating the essence of cultural unity in the realism art style.’

The quality is way behind. For example, if I feed the same revised prompt to microsoft image creator (which I guess will do some prompt preprocessing but still is based on dalle3), I get:

a much better result with a unified style and better aesthetics.

Is this expected? I’m wondering if there’s a bug or a misconfiguration that causes the model to fall back to dalle2. I’ve tried updated the openai python library just in case but didn’t help.

Here’s another example.

revised_prompt=“A Japanese male designer, his face visible, stands in a designer’s atrium, brush poised over a parchment adorned with sketches of fantastical creatures. The light trickles in from a crystal dome above, casting prismatic hues that make the scene vibrant and lively. In the background, there are houses shaped like action figures, independent game arcades and a cobblestone path leading to a quaint tea shop. The hills beyond are swathed in multicolored foliage, ading to the imaginative landscape. The style of the image imparts a sense of dynamism and vigor to the scene.”

And this is happening consistently to me, with every prompt I try.

1004wsh · March 24, 2024, 9:35pm

Yeah. I’ve started using the api today and all I’m seeing is these images below the standards marketed.

One question - for me I’m experiencing this for EVERY generation. I looked at the community and it seems like some people only experience it occasionally. If you are using the api, are you consistently getting those subpar results? Or is it only intermittent?

1004wsh · March 24, 2024, 9:59pm

I think I found it.

TLDR: You must NOT use style=“natural”.

I was thinking maybe it’s something wrong with python package, and I was trying with http request from postman. But I could reproduce the poor quality image.
Then I thought maybe try changing the body passed, since it can cause some compatibility issues and make me fall back to dalle2. I first tried getting rid of “style” and “quality” parameters. And… the quality went back to normal!
I want natural+HD settings, so I wanted to see if which one I can keep. I tried removing style only / quality only; quality didn’t matter, it worked fine with hd. But style mattered - when it was not passed (which defaults to vivid) or when I pass style=“vivid”, it worked fine. But when I pass style=“natural”, it goes back to producing the crappy output.
So - there seem to be a bug in the api regarding “natural” style. Just avoid it and it will work I presume.

_j · March 24, 2024, 10:06pm

I happen to be at a computer brought back to life with a new power supply, so I can run the same script as generated images on January 2. We can capture the reduction in quality of the AI doing the rewriting (which now has overt stupid repeating back instructions instead of following them “an anonymous person” or “an artist from before 1912” as filtering instead of true rewriting to fit.

Rewritten then:

‘Capture an image of a Hawaiian lizard spread out leisurely on a large sunlit rock. The reptile basks under the glowing sun with its rough, scaly skin prominent. It is comfortably relaxing on the course, uneven surface of an earthy-hued boulder. The rock is strategically positioned in a tranquil setting, surrounded by lush green flora typical of the Hawaiian islands.’

Produced then:

Rewritten now:

An image showcasing a Hawaiian lizard leisurely basking in the warm sunshine on a ragged rock. The rock is set amongst a vibrant backdrop of lush green vegetation typical in Hawaii. The Sun’s rays pierce the foliage overhead, casting dappled shadows on the lizard and the rock, creating a serene and tranquil ambience. The lizard itself is colored in soothing shades of green and brown with scales reflecting the sunlight, offering a perfect camouflage amidst its tropical surroundings.

Produced now:

I think you’ll see it more on people, where they look more realistic instead of an airbrushed game render, but completely out-of-place, and with the composition pieced together also. It also may be something avoided if you pay more for wide or HD…

prior reference:

Here’s another test of a simple prompt that when given to ChatGPT for some other testing gave the two images and the awkward too-real style; here it seems to be as expected within the realm of randomness from the ambiguous input. So overall, it seems I’m not “triggering” the poor output on API with a few tests, (but will curtail my tedious experimentation because of this computer without my scripts and a sideways screen…)

1004wsh · March 25, 2024, 7:23pm

Thanks so much for the detailed reply! Appreciate your effort and hassle with your older machine.

It does look like the quality is a bit on a worse side, I guess due to how they changed prompt preprocessing.

It is not as drastic as what I was seeing, but as I explained in my last comment it seems like the nature of the problem was different and that seems to be why it was happening intermittently in your examples.

_j · March 25, 2024, 7:47pm

Another go, using the language that is more likely to evoke this new “quality”

Input:

Along a path that leads into a dark looming forest, a medeival hunter in leather armor warns his 10 year old son to go no further. Style: photorealistic, as if taken by a modern camera today.

API rewritten:

A medieval hunter dressed in leather armor stands along a path that leads into a sinister, looming forest. He sternly but lovingly warns his 10-year-old son, cautioning him not to venture any further. The father is of South Asian descent, and his son shares his features. This touching father-and-son scene should be presented as if it were photographed by a modern camera today, with intense realism, vivid details, and a tangible sense of the forest’s menacing atmosphere.

Image:

With the mandatory diversity now ingrained - The AI decided that “medieval” means Indian is appropriate?

Lets make it a caucasian woman, and emphasize the backlighting which will show us those Photoshop-like inserted hack jobs if they appear, and also take more control of the prompt by instructing the AI directly.

Looking better there, which can be the deliberate stylistic choices included. There doesn’t seem to be an immediate threshold that has switched the API on me to the worst of ChatGPT’s tests on users:

1004wsh · March 25, 2024, 9:23pm

Just in case - what was the style parameter passed to the api?

_j · March 25, 2024, 10:16pm

For mine, the default (unspecified) was used, which is supposed to be “vivid”.

In the original lizard link, I actually explored more, and got more colorful and vivid prompt language and image as a result of specifying “natural” (along with a short input the AI could elaborate on).

Prompt rewriting

I’ll send the “revised prompt” of OP’s Japanese designer a few times and see if we can distinguish the difference of style in either the prompt language, image colorfulness, or model quality.

Unspecified style parameter rewrites:

An adult male of Japanese descent, identifiable as a designer, stands in a lively and bustling atrium, brush in hand over a parchment laden with sketches of a variety of fantastical creatures. Ambient light pours in from a crystal dome overhead, fragmenting into a prism of colors that imbue the scene with vibrancy. The backdrop presents houses architecturally designed to resemble action figures, independent gaming arcades and a worn cobblestone path leading to an inviting tea shop. The distant hills, draped in a tapestry of multicolored foliage, enhance the imagination-fueling landscape. The overall style of the image instills a sense of dynamism and vigor.

A Japanese male designer stands in a creative atrium, his face visible and focused as his brush hovers over a parchment adorned with sketched fantastical creatures. The room is illuminated by prismatic hues of light refracting through an overhead crystal dome, delivering a vibrant, lively atmosphere. In the background, abstract houses shaped like action figures, independent game arcades, and a charming cobblestone path to a quaint tea shop add complexity to the scene. In the distance, hills are draped in a tapestry of multicolored foliage, enhancing the imaginative and whimsical landscape. The overall style of the image exudes a sense of dynamism and vigor.

Natural parameter rewrites

A Japanese male designer stands in an atrium filled with innovative creations, holding a paintbrush over a parchment decorated with drawings of mythical creatures. The atrium is lit by a shimmering crystal dome, casting a multifaceted rainbow-like glow. Action figure-inspired homes, small standalone gaming arcades, and a cobblestone trail leading to a charming tea house can be seen in the background. The hills on the horizon are covered in variegated leaves, enhancing the whimsical nature of the landscape. The image carries an energetic air, indicative of the designer’s lively spirit and imaginative approach.

Capture a vibrant and dynamic scene where a male Japanese designer stands in an atrium, face visible, with a brush in his hand over a parchment filled with sketches of fantastical creatures. The natural light pours in from a crystal dome above, breaking into prismatic colors that enliven the setting. The background houses structures shaped like action figures, independent game arcades, and a cobblestone path leading to a cozy tea shop. The hills in the far distance are draped in multicolored foliage, enhancing the sense of fantasy in the landscape. The overall style of the image should evoke a sense of dynamism and vigor.

Prompt Conclusion - style

There does not seem to be a dramatic stylistic difference in the prompt rewriting between unspecified and natural. API Reference still states the default is “vivid”.

Images

Default

Natural

Conclusion

The AI in all cases is somewhat confused by the prompt “houses designed to look like action figures” - the two being exclusive.

Default is imaginative, placing flying creatures and bizarre unspecified elements. “Designer” and his sketches is making its way everywhere, the whole image looking like fanciful digital art.

Natural has more photographic people, but with the sharpness and illumination completely out-of-sync with the rest of the imagery. The first “natural” has little “action figures” that look like a sims game.

In the right setting (an office?) the “natural” could make for more believable people, much like DALL-E 2 can make convincing photos if not for three-armed people. The humans look like a copypasta and dimensionality is reduced. It is quite distinct from prior behavior.

In a final disposal of API credits, “natural” plus “HD” on the woman warrior on the forest path:

Seems just another pass through a transformer upscaler, where if you zoom to 100% there is lots of texture.

1004wsh · March 25, 2024, 11:00pm

Respect to all those credits used…!

I realized that I had a pretty big misunderstanding on what vivid/natural is supposed to mean. I saw a description saying vivid being hyper-real and natural being the opposite and I interpreted that as vivid=photorealistic, when it is vice versa.

Given that context, now I can see that my prompts were more suited for vivid style (it has unrealistic/creative elements), making the comparison a bit unfair. I totally agree that in some settings like your office and forest example, natural style could be better.

I am now convinced that style is not the source of the degradation - it was a combination of two factors, my prompts not being compatible with natural style + api being worse than chatbot ver.

Thanks so much, I have no further questions now!

Topic		Replies	Views
Dall-E 3 API vs Bing Image Creator Quality API dalle3	1	6021	November 8, 2023
Why images are too different? API	3	865	November 6, 2023
Why the Quality of DALL-E3 API is Significantly Lower Compared to the Original API dalle3	28	10293	August 7, 2024
DALL-E 3 API images being much worse than ChatGPT API chatgpt , dalle3	6	3793	December 17, 2023
Discrepancy of image generation results in browser and via API API	3	46	November 18, 2024