Collection of GPT-4o-images prompting tips, issues and bugs

Here is a collection of tips and tricks and some weaknesses and limits too, for the GPT-image generator. It took me quite some time to realize relatively simple things. This might help save some time when experimenting.

The first post includes all the findings, and will be updated form time to time.

Here are no tips for API or Python, only prompting for the image generator system it self.


References and links, Click to open it

For API users:
Generate images with GPT Image | OpenAI Cookbook

Check this link too for the old DallE-3 Model, in case you have still access to it:
Collection of Dall-E 3 prompting tips, issues and bugs


Generically:

  • Too dark images: The images in DallE 3 were generally too bright and contained light sources that couldn’t be turned off. That was probably an attempt to solve the “too dark” problem seen in earlier versions. Now the images are often too dark, especially when darkness is part of the motif.
    You can try to describe your own light sources in the prompt to better control the mood, but this often doesn’t work well, or you end up with an image that is technically fine but still too dark, and you would like to keep it.
    Overly dark images can be manually corrected afterward using something like gamma 1.5 in an image editing program. However, this reduces the color space (Banding effect), which can become an issue during further editing.
    So it’s better to try adjusting the lighting through the prompt itself.

    • For a saver correction, (but this is still not the same like a well light image):
    • Convert 8-bit to 16-bit.
    • Debanding algorithm (smooth the steps).
    • All Corrections (for example Gamma 1.5).
    • Convert back to 8-bit.
  • Distortions caused by a poor post, processing effect: Currently, a low, quality post, processing generator is being used, because the images, unlike those from DallE 3, are often quite flat and lacking in detail. However, this effect is poor and creates unnatural textures, oversharpening, and additional darkening. Worst of all, it introduces distortions that partially alter the structures of the image. You can recognize it by the final effect that appears to “add details.”
    Since this effect is system based, it cannot be disabled.

  • More consistency: GPT-4o-Image is more consisten than the old DallE 3 system was. And just like the memory of past images, this consistency can be both helpful and limiting. More consistency also limits creativity. Previously, a well-crafted prompt could lead to very different, creative images. With the new system, you often get more or less the same thing repeatedly.

  • Memory over generations: GPT-4o-Image remembers previously created images, in order to maintain consistency across image generations. This is useful for telling consistent stories, but the system currently does not recognize when a completely new theme is being created, or when each image is meant to be independent. This leads to elements from earlier images being reused, sometimes in ways that are unwanted.
    The solution is to open a new browser window every time (this annoyed me so much that gaining insights here will probably take a while).

  • Photo-Realism: GPT-4o-Image is more photo-realistic. It can now generate realistic-looking people, which DallE 3 never quite managed. This is mostly because the training clearly included many real-world photo datasets.
    But where there’s an advantage, there’s also a downside. Compared to DallE 3, GPT-4o-Image is far less creative, especially on its own. It needs more detailed prompting for creativity, and often still doesn’t reach the imaginative quality that 3 had. Depending on the type of image you want to generate, one or the other system might be more useful. More realism comes at the cost of less fantasy.

  • Fallback from photorealism to painting: When generating more fantastical images, DallE 3 tends to fall back to a style reminiscent of airbrush art, which is often actually quite desirable. (This is probably because many of the images in its training data came from creative sources, and those tend to have that kind of style when they’re really well done.)
    GPT-4o-Image, on the other hand, falls back to a painterly style. For me, that’s quite undesirable, but for people who enjoy painted art, it might be just what they want.
    However, these images don’t reach the same level of creativity or detail as DallE 3, at least not right away. How much of that can be compensated for through prompting remains to be seen.

  • Photo weaknesses: Because of all the photo data, the weaknesses of such material have also made their way into the training. The images are grainy, have unnaturally sharpened edges, and show many typical flaws known from digital camera content.
    How much of this can be corrected remains to be seen. These image flaws cannot currently be influenced via prompting. It is based on training data.

  • Comparing Dalle 3 and 4o: At the moment, GPT-4o-Image cannot replace DallE 3 for me. The two systems complement each other, but they are not the same. In fact, it would actually be desirable to have multiple systems with different specializations. Trying to pack all capabilities into a single system would require something that doesn’t currently exist, and maybe doesn’t even need to.
    And it’s shows clearly how it is often, what is onside a strength is in the same time on the other side a weakness. You can not have all the strength at the same time, because they often exclude each other.
    There’s no reason why we shouldn’t use different systems for different tasks. Personally, I prefer the DallE 3 system because of the types of images I usually generate, and I really hope it won’t just be deleted. Ideally, it would continue to exist as open source if OpenAI decides to retire it.
    For me, GPT-4o-Image cannot (yet) replace the DallE 3 system.

Technical:

  • PNG Format: GPT-4o-Image supports PNG, which allows for lossless compression and transparent backgrounds.
    (I’ve done some tests with AVIF, and it currently seems to offer the best compression. Unfortunately, it’s still not widely supported. However, it compresses about 20,30% better in lossless mode, and with almost no visible quality loss, it can achieve compression ratios up to 4x or more. If you’re looking to save space for archiving purposes, this open format is worth considering. Just note that in the software I used for compression, metadata wasn’t preserved.)
4 Likes

Here are two examples of typical digital camera quality in the training data:

This is for users to understand, it is not fixable by prompting, at least not as such, it is dependent on training data. the only way is to try to trigger other data, and hope for the best.

A grainy image that has been enhanced using some poor algorithms.

And the typical edge artifacts caused by poor sharpening algorithms.

3 Likes

The new system is not a continuation of DALL·E 3 but a reboot. It produces a completely different style that resembles generators like Flux more than the old DALL·E 3 model.
It also seems that the developers consider the images too flat and have added something like a structure enhancer in the final phase. However, this often worsens the results. The images may become sharper and seemingly more detailed, but at the same time even darker than they already were, and above all, the changes lead to distortions and visual artifacts. After this phase, the images also appear less realistic.

Here are two examples:
in the tiger’s eye you can clearly see the distortion,
as well as in the fennec’s snout.

That pictures are often too dark, can be balanced out by simply describe light and light sources in the image. But the distortion effect can not be switched off in the prompt.

The extra phase looks like a bad photo edit job, very most of the pictures would be better without it. It would be better if such functions would be optional. (They still have no artist in the team.)

A tiger and a deer are standing close together in front of a waterfall in a lush, misty forest. (Captioned by AI)

A fennec fox with large ears is peacefully resting with its eyes closed in a lush, green environment. (Captioned by AI)

1 Like

I have noted this also, that the preview can be very natural, but the final output can almost become ghoulish.

Here’s a new one.

A girl in a tank top stands in front of a mirror, but her reflection appears as a surprised fairy with wings in a magical forest. (Captioned by AI)

A young woman with fairy wings and pointed ears looks surprised with her mouth open and wide eyes. (Captioned by AI)

1 Like

Wow, some of the distortions are horrible, specially if the face is small…
I would say 98 99% of the images loos with this effect. They can not add details in this way if the pictures are flat.
I would like to switch this effect off.

An other thing is, Dalle3 was generally too light, now it is too dark, and sometimes it is even difficult to get extra light in the scene, if it competes with darkness in the prompt. And this final effect make all even more darker.

(What is GPT4o-Image? is it a small flux with this extra, or even less? They reduced the parameters? Is it a effect in all multi-modal models to generate flat results? It is now good for ADS but not so for fantasy. Time will show.)