Bug/Request - image edits API: Support interleaved prompt and images, parity with Responses

Desired feature:

  • label individual images sent as multiple images with multiple prompts
  • Solution: either allow multiple form-data fields of “prompt” and “image” ingested into model context in order arbitrarily, or a 1:1 match of prompt array to image array.

Reasoning:

  • The responses API can do this, multimodal content parts of a user message passed as context beyond a tool prompt

Reasoning why Responses is not wanted

  • inputs into chat: pay again on the same images for “vision” multiple times for no purpose
  • tool call pattern: requires the AI to want to use the tool and then write its own prompt
  • tool call pattern: the AI model can easily write its own unwarranted refusals or chat responses without the images
  • tool call pattern: internal iterations possible, paying again and again for vision-based chat context, allowing multiple tool calls and loops
  • tool call pattern: the AI can emit tool calls wrong, or write loops, iterate on its errors for more unwanted context billing and distraction, or fail when tool counts are limited
  • input as user: can be interpreted wanting “chat” or ideas instead of an absolute single function
  • output to user: useless chatting
  • complete lack of transparency: no control over what the image tool is actually taking as input, no documentation of image_gen context scraping, injected tool instructions that are not the developers’
  • completely outside the scope of making images
  • designed as a zero-imagination “do no better than free ChatGPT at extreme expense”

Bug report + feature request

Images Edits API: duplicate prompt error when attempting multi-part repeat field key capabilities suggests using a prompt[] array syntax instead, but prompt[] is rejected; request support for multi-prompt labeling aligned with image[]

Summary

When sending multiple prompt fields in a multipart request to POST /v1/images/edits, the API returns an error instructing developers to use array syntax (prompt[]=...). However, when using prompt[], the API rejects the request with “expected a string, got an array”. This is misleading about any support and makes it unclear how to provide per-image labels/instructions. Additionally, I request a supported structured mechanism for associating prompts/labels with specific images (1:1 mapping with image[]), similar in spirit to messages-based vision inputs.

Endpoint / Model

  • Endpoint: POST https://api.openai.com/v1/images/edits
  • Model tested: gpt-image-1.5

Expected Behavior (bug)

If the API error message suggests prompt[], then one of the following should be true:

  • prompt[] should be accepted by the endpoint, OR
  • The error message should not suggest prompt[] for endpoints that require a single prompt string.

Actual Behavior (bug)

  • Repeating prompt returns a duplicate-parameter error that suggests prompt[].
  • Using prompt[] results in invalid_type (array rejected).

Steps to Reproduce (minimal)

  1. Send a normal edits request with a single prompt: works.
  2. Send the same request but include multiple prompts: rejected with duplicate_parameter and hint to use prompt[].
  3. Send prompts using prompt[]: rejected with invalid_type.

Evidence / Responses

Control request (single prompt) succeeds:

  • HTTP 200
  • Request ID: req_b4be2e3ea83f423c9c2d1623d27abf85
  • Response includes data[0].b64_json and usage.

Duplicate prompt rejected with misleading hint:

  • HTTP 400
  • Request ID: req_5d9f922c3a144d0196c5d2080aee0abd
  • Error:
{
  "error": {
    "message": "Duplicate parameter: 'prompt'. You provided multiple values for this parameter, whereas only one is allowed. If you are trying to provide a list of values, use the array syntax instead e.g. 'prompt[]=<value>'.",
    "type": "invalid_request_error",
    "param": "prompt",
    "code": "duplicate_parameter"
  }
}

Array syntax rejected:

  • HTTP 400
  • Request ID: req_22e1c6ba565f4863bf62299cde1a6159
  • Error:
{
  "error": {
    "message": "Invalid type for 'prompt': expected a string, but got an array instead.",
    "type": "invalid_request_error",
    "param": "prompt",
    "code": "invalid_type"
  }
}

Indexed syntax also rejected as array:

  • HTTP 400
  • Request ID: req_42cf5096889b46569f872e350b78e4cc
  • Error (same invalid_type).

Impact

  • Developers are given incorrect guidance by the error message and waste time trying an approach the endpoint cannot accept.
  • There is no structured way to attach per-image labels/captions/instructions to the multiple images that can be supplied for edits, which is a common need for reliable composition (“use image A as the logo reference, image B as the character reference, image C as the background style reference”, etc.).
  • This gap is especially noticeable given that other OpenAI endpoints support interleaved/structured multimodal inputs.

Requested Fix (bug)

Update the duplicate-parameter error message for /v1/images/edits (and likely /v1/images/generations) to be endpoint/schema-aware, e.g.:

  • “Duplicate parameter: prompt. Only one prompt string is supported for Images endpoints. Please provide a single prompt value.”

Or at minimum remove the prompt[] suggestion in contexts where the schema defines prompt as a string.

Requested feature

Give “edits” users parity at and beyond what ChatGPT and “chatting” tool already does. The API is for developers and original ideas, not roadblocks and gating features to only ChatGPT subscribers.

Solution: either allow multiple form-data fields of “prompt” and “image” ingested into model context in order arbitrarily

Interesting - good idea. Like that better than the array solution - just me.

Edit: What would a prompt for each image achieve exactly?

Thank you for the question. It is easy to not see the application when you have only been exposed to ChatGPT, the Responses API playground, or OpenAI “images” tool in the platform site. They are uninspired only appended without respect to language, only allowing ad-hoc image addition, unordered and without rearrangement or ability to add metadata or labeling.

Here is the type of request that one might make, an ordering and metadata and instruction combination that can be used for vision analysis with AI models, transferrable to the output techniques of the same AI understanding.

Edits: gpt-image-1.5, were it supporting free-form fields

prompt: "Create new image using this background setting of a Japanese garden. Outfill and extend the canvas into transparent areas. Remove the mask-highlighted element. image 1 -… "
image: ["base64"]
mask: "base64"
prompt: “—start of additional images, for placement into the setting—\n\n”
prompt: “image 2 - img_2244.jpg - place this woman standing on the left, overlooking a picnic scene”
image: ["base64"]
prompt: “image 3 - img_2553.jpg - place this woman seated on a checkerboard picnic blanket, on the left side, facing the camera.”
image: ["base64"]
prompt: “image 4 - mariko_beach.jpg - place this woman seated on a checkerboard picnic blanket, on the right side, facing towards the other seated woman.”
image: ["base64"]
prompt: “image 5 - kimono pattern.jpg - the standing woman is dressed instead in this apparel.”
image: ["base64"]
prompt: “image 6 - design.jpg; image 5 - new_design.jpg; - the left and right women are instead wearing kimonos in these fabric patterns, respectively.”
image: ["base64", "base64"]
prompt: “App note-Important: preserve facial features to represent the same individuals.”

Additional example: programmatic developer app

Image a developer application that is not primarily powered by UI prompting. Instead, you have a drag-and-drop layering interface, callouts where you can set the mode of each image, such as synthesize, extract subject, change aspect, with direct positional instruction.

The API call could look like an example below, what gpt-5.1 AI can already improve and realize for me:

prompt: “Create new image using this background setting”
prompt: “PNG INPUT: 1024x1536 // base canvas”
image: [“<base64_background>”]

prompt: “—start of additional images, for placement into the setting—\n\n”

prompt: “PNG INPUT: 1024x1024.
Place the extracted subject into TargetBox[
id: ‘main_subject’,
canvas_size: { width: 1024, height: 1536 },
x: 128, // top-left of hitbox on base canvas
y: 512,
width: 768, // hitbox size, like a layer frame
height: 768,
fit: ‘contain’, // maintain aspect ratio, fully inside box
padding: 0.05, // 5% inner margin
anchor: ‘bottom-center’, // align subject within the box
z_index: 10
]”
image: [“<base64_subject_1>”]

prompt: “PNG INPUT: 1024x1024.
Place the extracted subject into TargetBox[
id: ‘secondary_subject’,
canvas_size: { width: 1024, height: 1536 },
x: 64,
y: 128,
width: 896,
height: 320,
fit: ‘contain’,
padding: 0.08,
anchor: ‘center’,
z_index: 5
]”
image: [“<base64_subject_2>”]

prompt: “… additional PNG INPUT blocks may follow, each with its own TargetBox[…] definition”

prompt: "–composition rules–
For each PNG INPUT:

  1. Perform subject extraction (remove background, keep main foreground object or person).
  2. Compute a uniform scale so the subject fits inside its TargetBox (width × height) while
    maintaining aspect ratio and leaving the specified padding.
  3. Position the scaled subject inside the TargetBox according to ‘anchor’
    (e.g. ‘bottom-center’, ‘center’, ‘top-left’).
  4. Composite onto the 1024x1536 base canvas using z_index for layer order
    (lower z_index behind, higher in front)."

On DALL-E 2, because of its ability to preserve pixels and only modify where a pixel-perfect mask was placed, one could do piecemeal $0.02 turns of prompt-based infill on a much larger image, constructing and blending imagery together. This model is not only deprecated, but what’s running on the API now is extremely damaged and useless.

gpt-image can’t stitch on to your existing canvas, only being able to generate all-new unaligned pictures, but it can compose a new image - if we can pass clear communication, and OpenAI weren’t limited in imagination and feature presentation.

For awhile now, I have tried to envision an approach similar to what you are proposing and It only resulted with me scratching my head.

The only problem I see here is the requirement for a potentially extreme and complex UI/UX - or am I missing something?

Simple UI design that one might give to a “playground” now to construct even “vision” properly in chat models, and surface what the API can do:

  • A sidebar with thumbnails with the additional images you are including.
  • A prompt input field where you can drag the images into the prompt as an emoji-sized placement icon to where they shall appear in AI context.

Prompt not exposing the extreme and complex, but needing communication:

“Compose: Here is a background picture :framed_picture:① where this lady :framed_picture:② is wearing this dress :framed_picture:③ and standing alongside this other lady :framed_picture:④.”

1 Like