How to identify photos when batching for gpt 4 vision

I am using batching to send multiple images to gpt-4-vision.
In my prompt, I am requesting it to rank those images according to some criteria, however, I can’t tell which image a given rank is referring to.

Asking it to include the url of image with the rank yields nothing, as it seems the model does not have access to the URLs when generating the response.

I am not sure how can I provide some sort of unique identifier for each image for the model to to return when responding.
The images are dynamic (user uploaded) so it’s not possible to add a human readable identifier (like a description)

Any ideas?

1 Like

Hi and welcome to the Dev Community!

I’ve had a bit of luck by splitting up the example images from the images I want it to focus on.
You could probably do this and enumerate each image you send, not sure if that would work but its worth trying!

Here’s the payload I send:

payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": preprompt},
                    *[
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}} for image in base64_images
                    ],
                    {"type": "text", "text": prompt},
                    *[
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}} for image in base64_user_images
                    ]
                ]
            }
        ],
        "max_tokens": 4000,
        "temperature" : 0.3
    }

Thanks for the reply and welcome!

I’m not sure exactly what’s done in this code. I mean, from what I learned recently, I cannot use anything that’s in the content “type: image_url” part to identify the images.

My code is this one:

photo_contents = [{
            "type": "image_url",
            "image_url": {
                "url": photo.url,
            },
        } for photo in photos]
        json_response = chat.invoke(
            [
                HumanMessage(
                    content=[
                        {"type": "text", "text": pick_photos_prompt(user_description)},
                        *photo_contents
                    ]
                )
            ]
        )

Now, each URL here is unique, if the model was able to tell me “I ranked X for the URL Y”, I would be able to work with it, but it seems that the model doesn’t have access to the actual urls for the sake of including them in responses.
It definitely works with the images and able to see them because if I change the prompt to “what’s in each image?” I’ll get an answer, but then it cannot return the link it refers to for each image

So what I’m saying is that the position of the images and text prompt within the payload do matter, and you can do something like this:

"role": "user",
"content": [
    *[
        {"type": "image_url", "image_url": {"url": photo}},
        {"type": "text", "text": f"Image Number: {index + 1}"}
        for index, photo in enumerate(photos)
    ],
    {"type": "text", "text": prompt}
]

This way each photo has an image number associated with it automatically.