Feature Request – Internal Multimodal Validation Before Image Display

Dear OpenAI Team,

I would like to suggest a structural improvement regarding the integration between language and image generation models within your multimodal systems.

Problem

Currently, there appears to be no internal mechanism for validating whether a generated image matches the user’s prompt before the image is displayed. The image generation model operates independently from the language model, and the language model cannot semantically evaluate or verify the image prior to user exposure.

This lack of internal validation leads to frequent discrepancies between what the user requested and what is shown — often requiring multiple correction loops and manual confirmations from the user. This not only slows down workflows but also impacts the reliability of the system in tasks that require visual precision or spatial accuracy.

Proposed Solution

Introduce an internal validation step immediately after image generation and before the image is displayed to the user. This step could involve:

Using a vision-language model (e.g., CLIP or a similar mechanism) to semantically analyze the image in the context of the prompt.

Detecting mismatches in object presence, count, spatial relationships, or layout orientation.

Automatically triggering a regeneration or alerting the system that the output may not be aligned with the instruction.

Long-Term Recommendation

Move toward a more tightly integrated multimodal architecture in which the model can:

Reason about both text and image simultaneously,

Understand spatial and contextual relationships,

Self-evaluate and correct outputs prior to user feedback.

This enhancement would significantly improve output consistency, reduce user frustration, and support more professional use cases where precision is essential.

Thank you for your time and continued innovation.

Best regards,
Slobodan Stojanovic
Senior UI / UX Designer
Switzerland