Collection of GPT-image-generator 2.0 issues, bugs, and work-around tips (check first post)

The observation of chat context is what creates an image by a gpt-4o multimodal-type AI model, not merely a sent prompt as a ChatGPT tool being called. This includes past images and “chat”.

Food for thought and interpretation:

(an artistic impression of a cow eating grass.)

Now that I’ve been reading here, it seems if the prompt is clear, short and “easy”, the results of the images are better, than if the prompt is more complex, which might easier result to ‘noise’ in images. Or am I misunderstanding?

I mean now that there is obviously a bug/issue with gpt-image-2, so logically a prompt that is more complex, would affect the output for the time being and result in ‘noise’. I might see a pattern of this, with my own prompts and outputs.

Also side comment:

I was looking for the cow​:woman_facepalming:

you gotta zoom in real close

^.^

No, still can’t see it…I might be blindish or then I’m missing something hidden here​:sweat_smile:

Complex prompts are only good when they don’t contradict themselves.

That’s why I started pushing my prompt forge to low emittance because it’s very easy to contradict oneself not realizing how the semantic descriptors are laid out in an LLM…

People understand the words, but the ones that govern mood, for instance, tension, levity…

The sorts of things you need for abstraction and what type of abstraction

They can contradict with realist terms if not handled correctly

So a safe prompt that is going to look good because the model knows what people likes… is a short prompt

A full stack of descriptors is great once you get a feel for what goes in and what doesn’t work together.

Well, obviously the cow already ate all the grass. Why should it stick around?

Hahaha omg…now I feel like a total blonde…I’m not kidding, I was searching for the cow…even between letters and trying to summon it everywhere​:laughing:

Yes, at the very beginning we could see that the entire image was carried forward within the conversation, and I am seeing the effect in text you described right now.

What GPT says, if it is not a hallucination / fabrication, is that the image generator and text generator are still supposed to be two separate systems. and that a separate prompt is still sent to the generator. So it is not supposed to be the case that the entire conversation context ends up in the image generator. And the system would theoretically be capable of generating a separate prompt for each image. But as far as i know, the details are not publicly known. (i will experiment at some point with an offline multimodal system…)

Oh. I totally thought you were kidding.

Since this is the case I went ahead and found one for you ~

Another test.
The system can now handle negations.
And it can also forget parts of the conversation context.
All three images were generated one after another in the same session.

  1. A dense collection of many colorful balls in all rainbow colors, excluding red.

  1. A dense collection of many colorful balls in all rainbow colors, excluding violet and blue.

  1. A dense collection of many colorful balls in all rainbow colors, excluding orange, yellow and pink.

I’ve found out if you constantly prompt with LESS DETAILS - the artifacts get removed. Thought it does expand non-artifact threshold to 15-20 images and then it starts ignoring the prompt.

you guys are discovering what happens when you push a model too much…

It’s been like this through every model; it just has different visual failure modes when it runs out of ability to answer various prompt needs

i get it, just send pairs of artists through that don’t want to be paired.

When you push a model outside of its training set, you have to guide it in from there…

Trying to do painters in photography?

You’ll pull that pattern up more often than not until the system settles on what exactly that means in terms of what it should answer the prompt with.

It’s making the attempt to answer the brush strokes, but it’s never been asked to do what i just asked the model to do in the prompt

You guys are chasing bugs, asking for painters in your photorealistic images

I know I’ve seen at least one person prompt as that and complain about ‘the bug’

It’s just simply asking too much of what a model understands at once.

Here’s another one, and I know before going into it that it’s going to push the model over into the unknown…

I don’t complain about stress testing results as a bug tho :person_running: :dashing_away:

:heart_with_arrow:
stress tests are designed to create failure modes and allow the ML to learn…

Prompt

three women dancing, fantasy illustration, fantasy key art presentation, hero-image staging, high-impact promotional composition, clean silhouette priority, shape-led readability, strong subject separation, ruin-charged believable survival world, richly roughened tactile evidence, age-marked ruin-world surfaces, profoundly shaped by Utagawa Hiroshige, defining soft atmospheric transitions; driving grounded realism, clear influence from Egon Schiele, supporting graphic/print discipline and soft atmospheric transitions, atmospheric landscape printmaking, framed vistas, woodblock smoothness, poetic weather, elegant spatial layering, pallid flesh, unease, peril-world figure view, raid-and-arena momentum, dense relic-and-betrayal story evidence, smoke-held distance layering, grim blade-and-pursuit danger, fire-and-daylight ruin wonder, warm directional illumination, crisp steel-and-fire contrast, aspect ratio 1:1

as you can probably recognize from the deformed hands

it’s just the same failure mode as any image model before, this one just fails more systemically in a grid i think, and prettier, basically.

I also did a small test based on my earlier thought about simple vs. complex prompts and how that may affect the output.

This is only one quick comparison, so I’m not presenting it as proof, but in my test the more complex prompt seemed to produce more visible clutter/noise in the detailed areas, while the simpler prompt looked cleaner. I also had AI to analyze these images.

That may be relevant if the generator is being investigated for artifact/noise issues, because prompt complexity could be one factor that makes the issue easier to trigger or notice.

Prompt 1

A magical fox sitting in a glowing summer forest, cinematic, beautiful lighting.

Prompt 2

A magical red fox sitting calmly in a lush summer forest clearing at golden hour, surrounded by glowing wildflowers, ancient trees, and soft floating particles of light. Fantasy atmosphere, highly detailed fur with subtle shimmering accents, bright intelligent eyes, warm sunlight filtering through the leaves, rich green foliage, soft colorful blossoms, enchanting and dreamlike mood, beautiful natural lighting, gentle depth of field, crisp textures, serene and magical summer feeling.

ChatGPT-5.5 Thinking image analysis

I also add, I really, really like gpt-image-2 and I see huge potential with it.

Let’s see if I can resolve one that I’ve willfully pushed to the edge maybe?

I’ll just update this grid

I could go farther but it’s working itself out.
It’s the complexity of the prompt against what it knows to be true already…

You just have to guide it into what you want…

But specifically, getting photorealism from painters is difficult
It took a month to walk Salvador Dali’s style into photorealism without getting ‘the bug’…

I get the bug with every new artist combo I try to mash in that is new, or the mesh crosses domains in strange ways…

Here’s three women dancing in photography

Here’s 3 women dancing in watercolor but demanding top notch realism, which water color does not provide naturally

understanding the limits of your own prompt seems understood… since people aren’t willing to share their asks or even get defensive about it for some reason…

Prompt

three women dancing, watercolor illustration, lightly stylized finely observed watercolor, light pigment settling, fresh paper, dynamic impact, deep atmospheric perspective, orchestrated chaos, selective focus, strong playful energy, intense drama, overwhelming grandeur, warm glow, radiant paper-white lift, soft daylight, vivid luminous saturation, striking tonal separation, polished exhibition finish, rich detail, aspect ratio 1:1

Demanding high style or high definition outputs from a base medium that doesn’t provide that also pushes into known model failure modes.

You have to walk through what it is you want in better detail with edits, and that will help the model stabilize for future prompts in that strange grey area

But it’s not a bug that it doesn’t know how to handle contradictory prompts.

Then there’s the matter of I can plainly see who popped the lock off my software without a license btw.

That was the whole point of its… oh.. never mind.

It’s not a bug.

LOL but it has been fun.

I think one thing to consider here is that I don’t use ChatGPT to create images - I always use the API. There may be something (maybe a memory issue) with ChatGPT that confuses gpt-image-2 which results in distortions (or other issues) when repeating the same prompt or creating prompt variations. Some of you have already alluded to this.

Here are six of the images created with the same prompt

Image 1

Image 2

Image 3

Image 4

Image 5

Image 6

Notice that although gpt-image-2 create different versions of the image, style and quality of detail were preserved.

Those two foxes above was prompted in ChatGPT and now I prompted in API, using the same complex prompt and got this​:backhand_index_pointing_down:

Personally I think this image’s quality is better. Or what do you others think?

@jeffvpace I think it’s easier to get better results in API, than ChatGPT, so maybe memory in ChatGPT could be one issue and also when using for example 5.5 thinking, it sometimes refines and changes the prompt while the image is being generated.

Image models in API, don’t refine or change the prompts.

But hey…is it just me that thinks this is actually fun to test and trying to figure out workarounds/reasons for issues?:sweat_smile:

Excellent point. gpt-image-2 is a model. ChatGPT is not - it’s the interface to the model. Which, at times, could muck stuff up.

So, the real question is: Are the current problems arising from ChatGPT or the model (gpt-image-2) ??

I think that’s a fine example of a prompt that doesn’t contradict and allows for the user to develop their own personal style rather than the stock

which the stock is good

We’re at a point now where artists have an amazing baseline to start with, that will seem ‘good enough’ for people ‘for a while’.

just like with a multi-chapter infographics prompt
it has to build upon itself, rather than compete with itself.

love that fox!

prompt

three women dancing pixel art, sprite-readable structure, arcade gameplay clarity, strong stylization, loosely observed sprite depiction, soft daylight, crisp sprite contrast, aspect ratio 1:1, high detail clarity, clean edge definition

I don’t really think it’s a good idea to use the words surrealistic and photorealistic in the same prompt. Just sayin…

yes!
I speculated that API gets other results. It probably does not use the same image dependency effect. If API generates every picture as a new unique image without infos from the last ones. It is less noisy, because it not go through the amplification effect.

It would prove that it would be a big bug fix if the developers stop these image dependencies! They can keep the prompt consistency AS AN OPTION!!, but not the image consistency.
I can see the pattern in all pictures, but by far not as strong as in ChatGPT.
So the first fix would be easy, just come away from this “keep data over generations” idea. It does not work yet. So the first 50% fix would probably simply be a flag to stop this in the generator software. The patterns are still there, but at least they do not make me have a headache watching them for too long.

But don’t stop there! Give the developers a artist whit “pervert view” at the side, witch can tell them, “no, not yet… make it better again…”