Frame unique identification

Hi fellow explorers,

I’m doing the vision examples from here https://platform.openai.com/docs/guides/vision

I’m interested in getting the exact frame back that matches my prompt (just for my own interest, checking how OpenAI is doing against visual inspection). I originally asked for the frame number (it hallucinated the number), then I asked for the array index (hallucination) then I asked for the image_url itself to be returned an it hallucinated an imgur url (404 error, nothing there, obvs) from a base64 data image source.

I’m kinda out of ideas. Does anyone know of a way to get some way of matching the frame that openai decided matched your prompt and linking it back to a frame locally? As well as the request for the index (above) I asked OpenAI to give a description of the frame and it certainly was choosing the right frame to match the prompt but just not returning the correct frame number or url.

Cheers for the help
Jenko

Edit: I just hosted the video frames online to see if that would help (having a real url rather than a data: url), even when it had real urls for each frame it still hallucinated a random imgur link (even though the frames were hosted on our company website)

Are you using the video sample from the cookbook? I am curious of how are you formatting the frames in the messages. In the cookbook, it just adds the image data as is whereas in the doc, it suggests to use some format for text and image.

yeah, I’m using the cookbook - I’ve tried it now with both a url and a data: base64 link…it correctly describes the person but hallucinates the url…I can’t post a link here, but it’s an imgur url with random characters and a 404 response if you navigate to it

I also tried to put it in the exif for the frame but got hallucinations back for that also. I’m scratching my head about how chatgpt can communicate to me which frame it found the person in.

message = [
    {"role": "system", "content":"""Please return the exact value of the image_url field (do not hallucinate), and a description of the frame, of the first image that clearly shows each new conversation participant's face (so, the first time someone enters the conversation and is clearly visible).  Do this for each new person identified."""},
    {"role": "user", "content": [
        "These are the frames from the video.",
        *map(lambda x: {"type": "image_url",
                        "image_url": {"url": f'{x}', "detail": "low"}}, urls),
                        #"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
        {"type": "text", "text": f"The audio transcription is: {transcription.text}"}
    ],
     }
]

fyi, I added to the prompt “(if you do not know the url that was used to load the image do not hallucinate a url, say “unknown”)” and it replaced the hallucinations with “unknown”.

I suspect it does not have any memory of where the image was loaded from.

I don’t think chatgpt can read exif data (can someone else confirm?)

ok, for anyone else who hits this issue, you can solve the problem by writing any meta data directly onto the image (I chose the top right corner of the frame image) using cv2.putText - I just tested it and chatgpt-4o quite successfully read the meta data from the corner of the identified frame.

Interspersing text context and image context in the user message may be another technique for labeling and numbering. The way that images are tokenized and received is not documented to judge if this would be effective.

It is more picturing the attention and understand the AI has - if you ask it to produce a number value token for the best looking cat, you have a token predictor that is not strongly driven by the contents without having that numbering.

You can have AI describe all the images in a numbered list, and then that pretext will all more language focus on the answering.