Inconsistent character images despite input_fidelity=high

I’m trying to build a story book for my daughter, but I’m having pretty major issues getting any sort of character consistency. My workflow with the API is:

  1. Generate character images
  2. Generate a storybook page using the character plus a scene description

Here are two images, with similar input prompts and characters but there are two major issues:

  1. The character generated does not bear any similarity to the input
  2. The art style changes from image to image (though examples below really only demonstrate #1)
{
  "model": "gpt-4o",
  "tools": [
    {
      "type": "image_generation",
      "size": "1536x1024",
      "quality": "high",
      "output_format": "jpeg",
      "background": "auto",
      "moderation": "auto",
      "input_fidelity": "high"
    }
  ],
  "tool_choice": { "type": "image_generation" },
  "background": true,
  "input": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "${page_prompt}"
        },
        { "type": "input_image", "image_url": "data:image/jpeg;base64,...", "detail": "high" },
        { "type": "input_image", "image_url": "data:image/jpeg;base64,...", "detail": "high" }
      ]
    }
  ]
}

page_prompt:

Your most important task is to maintain all facial features, skin tone, outfits and hairstyles from the input images, and use them per the scene descriptions. Adjust poses and expressions to suit the scene. Produce images consistent with input images and match the following Art Style: Soft watercolour with pencil outlines, warm pastel palette, gentle textures, cosy indoor light, expressive toddler-friendly faces. SCENE DESCRIPTION Medium close-up on the child detective opening a small notebook and holding a pencil. The tan dwarf hamster inside the clear travel ball looks up eagerly. Background shows the empty plate and the small silver fan on the counter. Soft afternoon light, slight vignette focus.

The following log from the screenshot shows the input images and output images:

Any clues or guidance on what can be done to help improve respect to the input_images?

3 Likes

Any ideas on this?

I’ve tried doing the same thing in the playground and the results align with my expectations, so I’m wondering what I’m doing wrong in the API

The API request generated by playground looks close to identical, with the exception of:
``store: true, include: [“web_search_call.action.sources”]`

import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

const response = await openai.responses.create({
  model: "gpt-4.1-mini",
  input: [
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "Your most important task is to maintain all facial features, skin tone, outfits and hairstyles from the input images, and use them per the scene descriptions. Adjust poses and expressions to suit the scene. Produce images consistent with input images and match the following Art Style: Soft watercolour with pencil outlines, warm pastel palette, gentle textures, cosy indoor light, expressive toddler-friendly faces. SCENE DESCRIPTION Medium close-up on the child detective opening a small notebook and holding a pencil. The tan dwarf hamster inside the clear travel ball looks up eagerly. Background shows the empty plate and the small silver fan on the counter. Soft afternoon light, slight vignette focus."
        },
        {
          "type": "input_image",
          "image_url": "data:image/jpeg;base64,..."
        },
        {
          "type": "input_image",
          "image_url": "data:image/jpeg;base64,..."
        }
      ]
    }
  ],
  text: {
    "format": {
      "type": "text"
    }
  },
  reasoning: {},
  tools: [
    {
      "type": "image_generation",
      "size": "1536x1024",
      "quality": "high",
      "output_format": "jpeg",
      "background": "auto",
      "moderation": "auto"
    }
  ],
  tool_choice: {
    "type": "image_generation"
  },
  temperature: 1,
  max_output_tokens: 2048,
  top_p: 1,
  store: true,
  include: ["web_search_call.action.sources"]
});

You will get the highest performance when you use the “edits” API endpoint with the model “gpt-image-1”. This allows you to place exactly the input images and the prompt language talking about them. Not a “chat”, it is without distractions like “web_search” or your language sent to a middleman AI that has to decide to call a tool to make an image.

Note that when you use “input_fidelity” high, it has the characteristic of direct replication of the input, such as ensuring that a person in a photograph still looks like themselves. Costing $0.06 more per input image, it actually prevents transformations and re-imaginings, such as a character seen from a different viewpoint or in a different style.

The best consistency will be to have a “reference character” image or two, plain pictures of the subject, and then you refer to these in the prompting, “I’ve provided two views of the character ‘Marisa’ that will appear in the new comic, in different poses and situations…” Then use consistent prompting otherwise, such as “example cartoon panel” as an image, or consistent language that describes the product desired, with only changes in the situation.

I’ve tried removing input_fidelity, but still no luck.
I understand your point about using the edit API directly, but I have found at least in playground that the added layer of a model to optimise prompts is generally beneficial for my use case.

I think I’ve narrowed down the issue a bit - The token count makes it seem that the input images aren’t being considered at all. This is my prompt, but I only see 5466 tokens total (output alone should be 6,208 tokens)

{
  "model": "gpt-4o",
  "tools": [
    {
      "type": "image_generation",
      "size": "1536x1024",
      "quality": "high",
      "output_format": "jpeg",
      "background": "auto",
      "moderation": "auto"
    }
  ],
  "tool_choice": { "type": "image_generation" },
  "temperature": 1,
  "top_p": 1,
  "max_output_tokens": 2048,
  "reasoning": {},
  "background": true,
  "text": { "format": { "type": "text" } },
  "input": [
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": "Your most important task is to maintain all facial features, skin tone, outfits and hairstyles from the input images, and use them per the scene descriptions. \n Adjust poses and expressions to suit the scene.\nProduce images consistent with the input images.\n\n\n\nSCENE DESCRIPTION\nMedium shot of the flour-dusted adult in a blue-striped apron placing star-shaped cookies on a cooling rack. The child detective watches nearby. Warm kitchen with mixing bowls and wooden spoon. Eye-level view. Late morning light, soft and bright.\n"
        },
        { "type": "input_image", "image_url": "data:image/jpeg;" },
        { "type": "input_image", "image_url": "data:image/jpeg;" }
      ]
    }
  ]
}

My b64 images are being generated with:

export async function downloadImageAsBase64(imageUrl: string): Promise<string> {
  const response = await fetch(imageUrl);

  if (!response.ok) {
    throw new Error(`Failed to download image: ${response.status} ${response.statusText}`);
  }

  const imageBuffer = await response.arrayBuffer();
  const imageBytes = new Uint8Array(imageBuffer);

  // Convert to base64 using Deno standard library
  const base64 = encodeBase64(imageBytes);
  // Get content type from response headers to determine format
  const contentType = response.headers.get("content-type") || "image/jpeg";
  const dataURI = `data:${contentType};base64,${base64}`;
  return dataURI;
}

In the meantime, I’ll give the edit endpoint a crack.

Ended up having to park this, but just came back to it.

I remember my issue with the image APIs - they don’t support async/background requests.