I give 5 images to gpt4-vision and need to identify 2 similar images?

I have a set of images.
I want to separate the common images after the identification from vision.

How can I do that. It does not accept any meta data or name according to my knowledge. Then how can I achieve that?

Welcome to the dev forum, @zainsheikh!

You can try by prompting the model to return a JSON object with indices of the matches.

Also, beware that there’s a safety mechanism built-in which prevents the abuse of the model for bypassing captchas.

1 Like

I will try that. But how would I know which image relates which json object?

Similar in what way? Could you not just create embeddings of the images and then compare them that way?

E.g there are total 5 images . And 2 of them are related to the same product say front and back side of the same cloth. Now it can reply that the first two images are similar but how would I know that what images are they referring to? Are you getting my point?

Yes. You can accomplish the same effect by simply embedding the images yourself.

But if you must use GPT-4V you could contain each product as 512x512px and then apply a label to each one for GPT to refer to.

2 Likes

Just one thing. What do you mean by applying label. Giving meta data for vector storage?

You can get the AI to identify the images by their index in the JSON you provided. In my experience, it works well up to 5-6 images but after that it bugs and messes up the order or skips some indexes. Just ask in the prompt to tell you which images are similar by their indexes. There is no way to provide an ID with each image as of today, which is a shame in my opinion and I hope this feature gets added soon.

You already have the answer from the response:
"the first two images are similar"

But if you want to further extract the specific number, assuming you have access to the array of images you sent to gpt-4V, run the result in another chat completions API with response format in JSON mode, using a system prompt like this:

You are a helpful assistant designed to output JSON.  
You will help extract the indices of items in an array based on the ordinal numbers mentioned in a text. 
# example
If you provide me with a text like this: 
"There are 5 images in total. The first two images are similar.", 
you will understand that it refers to the indices 0 and 1 in an array (since we start counting from zero).
# output JSON format.
{ images: [0, 1] }

Sample GPT-4V:

Running the post processing…

You’ll end up with:

{ "images": [1, 4] }
1 Like

This may work well for 4-5 images, or some small number. But what if you want to pass 400 images – this wouldn’t be reliable at all and prone to hallucinations no?

I tested your question if it indeed can handle such case. Let say I got the following response from GPT-4V:

A blue dot is found in the following images, from 102nd to 128th, 146th, 138th and from 251st to 286th images out of 400 images submitted.

Using conversion prompt, we get:

{ “images”: [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 145, 137, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286] }

Please note that in ordinal numbers, we add one to the index because the array indices start at 0.

2 Likes

Very interesting!

I’m curious if you verified whether the returned indices were correct and not hallucinated?
I don’t know the specific images you’re testing it with, but even if it was ~80% accurate then that’s quite useful in my particular case.

1 Like