GPT4-V: the order of multiple image inputs

GPT-V can process multiple image inputs, but can it differentiate the order of the images? Take the following messages as an example.

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's in the first image? What's in the second image?",
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "imageA.jpg",
          },
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "imageB.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)
print(response.choices[0])

Will imageA be considered as the first image because it appears above imageB in the messages?

The AI doesn’t have a reference really of which “came first”.

Others have gone as far as putting text into the image so it can be referred to.

Not tried, but something that could be an idea: Multiple images sent as multiple user messages when obtaining a reply. You could insert synthetic text “here’s my first image…” within the messages, and see if the AI is then able to answer based on position.

I’ve been facing this same problem. I thought maybe we could interleave image inputs with text but the API doesn’t seem to like that.

My content was setup as follows:

PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Here are a few images I have on hand. I'd like you to pick the most appropriate one for a Christmas greeting card I'm sending out on behalf of my family."
            },

            {
                "type": "text",
                "text": "This is image #1"
            },
            {
                "type": "image_url",
                "image_url": image_to_base64(img1)
            },

            {
                "type": "text",
                "text": "This is image #2"
            },
            {
                "type": "image_url",
                "image_url": image_to_base64(img2)
            },
        ],
    },
]

To which I received the “I’m sorry, I cannot assist with these requests.” response that others in the forum have gotten for different reasons

Facing the same issue, any luck finding the proper solution?

There is nothing related to ordering of response with calling with multiple images in input.
https://platform.openai.com/docs/guides/vision/multiple-image-inputs

Sending a text message before each image seems to be working fine:

Main text message:

Analyze the attached images and select the best one for a finance site.

Return the results in JSON format using the following interface:
{
  images: {
    // The id of the analyzed image.
    "id": number;
    // Set this to true if it's the best image.
    "best_image": boolean;
    // One sentence feedback on why you chose the image.
    "feedback": string;
  }[];
}

Here is how i build the rest of the content in PHP:

foreach ($images as $id => $url) {
  $content[] = ['type' => 'text', 'text' => "ID for the next image: $id"];
  $content[] = ['type' => 'image_url', 'image_url' => ['url' => $url]];
}