How to identify photos when batching for gpt 4 vision

ranst91 · March 18, 2024, 12:19pm

I am using batching to send multiple images to gpt-4-vision.
In my prompt, I am requesting it to rank those images according to some criteria, however, I can’t tell which image a given rank is referring to.

Asking it to include the url of image with the rank yields nothing, as it seems the model does not have access to the URLs when generating the response.

I am not sure how can I provide some sort of unique identifier for each image for the model to to return when responding.
The images are dynamic (user uploaded) so it’s not possible to add a human readable identifier (like a description)

Any ideas?

trenton.dambrowitz · March 18, 2024, 12:45pm

Hi and welcome to the Dev Community!

I’ve had a bit of luck by splitting up the example images from the images I want it to focus on.
You could probably do this and enumerate each image you send, not sure if that would work but its worth trying!

Here’s the payload I send:

payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": preprompt},
                    *[
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}} for image in base64_images
                    ],
                    {"type": "text", "text": prompt},
                    *[
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}} for image in base64_user_images
                    ]
                ]
            }
        ],
        "max_tokens": 4000,
        "temperature" : 0.3
    }

ranst91 · March 18, 2024, 2:17pm

Thanks for the reply and welcome!

I’m not sure exactly what’s done in this code. I mean, from what I learned recently, I cannot use anything that’s in the content “type: image_url” part to identify the images.

My code is this one:

photo_contents = [{
            "type": "image_url",
            "image_url": {
                "url": photo.url,
            },
        } for photo in photos]
        json_response = chat.invoke(
            [
                HumanMessage(
                    content=[
                        {"type": "text", "text": pick_photos_prompt(user_description)},
                        *photo_contents
                    ]
                )
            ]
        )

Now, each URL here is unique, if the model was able to tell me “I ranked X for the URL Y”, I would be able to work with it, but it seems that the model doesn’t have access to the actual urls for the sake of including them in responses.
It definitely works with the images and able to see them because if I change the prompt to “what’s in each image?” I’ll get an answer, but then it cannot return the link it refers to for each image

trenton.dambrowitz · March 18, 2024, 2:34pm

So what I’m saying is that the position of the images and text prompt within the payload do matter, and you can do something like this:

"role": "user",
"content": [
    *[
        {"type": "image_url", "image_url": {"url": photo}},
        {"type": "text", "text": f"Image Number: {index + 1}"}
        for index, photo in enumerate(photos)
    ],
    {"type": "text", "text": prompt}
]

This way each photo has an image number associated with it automatically.

Topic		Replies	Views
Referring to multiple images in vision API API gpt-4	7	4559	October 26, 2024
How to best work with 100s of images API gpt-4	0	1495	January 17, 2024
I give 5 images to gpt4-vision and need to identify 2 similar images? API gpt-4-vision	11	5756	January 18, 2024
Frame unique identification API gpt-4-vision	5	291	May 17, 2024
GPT4-V: the order of multiple image inputs API gpt-4-vision	4	10836	October 26, 2024

How to identify photos when batching for gpt 4 vision

Related topics