[Bug Report] Severe Context Contamination in GPT-4o Image Generation (Upcoming API Implications)

Hello OpenAI Community,

My colleagues and I have been extensively testing GPT-4o’s new image-generation capabilities integrated into ChatGPT, as we’re evaluating it for use in some of our development projects. While we’ve found the technology impressive overall, we’ve uncovered a significant issue related to context persistence that we believe is important to highlight - not only for current ChatGPT users but especially for developers anticipating the upcoming API release.

Specifically, we’ve repeatedly observed a phenomenon we’ve come to call “sticky visual context.” Once GPT-4o generates an image within a session, certain visual elements or stylistic choices from that initial image tend to persist stubbornly through subsequent image generations - even when explicitly instructed otherwise. We’ve consistently seen this behavior, even when providing detailed descriptions clearly specifying different elements, styles, or layouts in later prompts. The earlier visual choices seem to “contaminate” all subsequent outputs, making iterative refinement or precise adjustments extremely challenging.

In practical terms, this means that once certain unwanted visual details appear in an early image, they become nearly impossible to remove or significantly alter in later attempts, regardless of how explicit or detailed our instructions become. We’ve also heard from other users experiencing similar frustrations: visual elements or styles from early images persistently reappearing, even when explicitly contradicted by new prompts. Additionally, we’ve occasionally noticed spontaneous text fragments - clearly derived from earlier conversational context but entirely unrequested - randomly appearing in subsequent images. This hallucination further shows the model’s difficulty in isolating each new generation from the established session context.

Regarding how GPT-4o image generation actually works in ChatGPT specifically, given some confusion we’ve seen online (Twitter, Reddit, etc.), I wanted to clarify something important. When people noticed GPT-4o internally calls an image generation schema like this:

image_gen.text2im({
  prompt?: string,
  size?: string,
  n?: number,
  transparent_background?: boolean,
  referenced_image_ids?: string[],
})

many understandably assumed it was similar to previous DALL-E integrations, where the prompt string was the primary instruction sent to a separate, external model. However, our testing (and OpenAI’s documentation) clearly shows that’s not the case. GPT-4o’s image generation is deeply integrated within its multimodal conversational context in ChatGPT. In fact, we found we could even leave the prompt string completely empty, and GPT-4o would still produce images fully consistent with the ongoing conversation. This strongly suggests the schema call - especially the prompt parameter - is more of a procedural or UI-level trigger for ChatGPT to initiate and display the image generation, and perhaps serves as an internal thinking “scratchpad” for the model, rather than being the primary instruction for the image content itself. The actual image generation relies heavily on the broader session context (previous text and images), not just the immediate prompt text.

After discovering this persistent visual context (“stickiness”), we looked for potential ways to mitigate or control it. Initially, we hoped the referenced_image_ids parameter provided in the schema might let us explicitly control or limit the context…perhaps instructing GPT-4o to reference specific images exclusively and ignore others. Unfortunately, extensive testing showed this parameter had no measurable impact on reducing or controlling the persistent context issue. Additionally, we experimented with ChatGPT’s direct editing UI, trying to mask specific areas of previously generated images to guide refinements interactively. But once the visual context contamination set in, these UI masking attempts were similarly ineffective at reducing unwanted visual persistence.

Given how pronounced and consistent this behavior is, it seems clear this is an inherent characteristic of the current GPT-4o model snapshot rather than merely a UI or implementation quirk. Perhaps future post-training or fine-tuning could reduce the strength of prior context influence, improving the model’s ability to respond to explicit instructions without being overly biased by earlier images. However, as things stand now, the context persistence is simply too strong and poses real challenges for iterative creative workflows, reliable corrections, and predictable application development.

Within the ChatGPT interface, addressing this could involve straightforward UI solutions. For example, allowing users to create explicitly sandboxed sub-conversations for isolated tasks, or providing clear UI elements to select exactly which prior images and conversation segments should influence a new generation. Another approach could be to actually make the existing referenced_image_ids parameter functional within the tool call, allowing users (or the model acting on user instruction) to specify precisely which images should serve as context, thereby overriding the default persistent memory.

Regarding the upcoming API, my primary concern is how developers will manage this context programmatically. I can only assume the API will provide control over visual context, perhaps analogous to how the messages array allows precise control over conversational history (e.g., by including images within the message objects). However, given that we’ve observed prior text context also contributing to this contamination, simply omitting images in the message history might not be a complete solution. Therefore, for the API, perhaps an additional parameter to control the strength of influence from the provided context (both images and text in the message history) could be a necessary tool for developers to manage this behavior effectively and build predictable applications.

We’re posting this because we believe GPT-4o Image Gen is an incredible leap forward, but the current lack of visual context “strength” control is a significant limitation. I carefully reviewed the GPT-4o System Card and the recent addendum on native image generation, and while they detail many capabilities and known limitations, I couldn’t find any specific mention of this persistent context behavior. Given its significant impact on usability, especially for iterative work, I felt it was crucial to bring this to the community’s and the team’s attention. I find it hard to believe the OpenAI team isn’t already aware of this issue, but we wanted to clearly document our experience and stress its importance. We’d greatly appreciate hearing the team’s perspective on this and whether mechanisms for finer-grained context control might be planned for the upcoming API release. I’d also like to hear if anyone else here is facing similar issues.

Thanks,

NN

You can actually generate images with this model through API?

What is pricing cost calculation like for those images?
What model are you using specifically to enable this?
How did you write the API end-call, similar to receiving images?