Hello, excellent humans!
I am passing in multimodal input (text and image) for inference and I must correlate them with timestamps as they represent spoken dialog and keyframes from a vocational training video I am having the AI analyze and summarize, with summary fields pointing back to applicable timestamps as back-correlation tags.
I found that if I extended the “image_url” structure (which typically just has “url” and “detail” members) with a “timestamp” member that the data was not made available to the inference engine and it was left to “guess” what the timestamp attached to each image was. My guess is that the image processing API strips this structure out entirely and just replaces it with a vectorized version of the image. I do not see any explicit support in the vision API for a way to pass image metadata to the inference engine.
I tried replicating the (extended) image_url structures into the text input context (and renaming the type to “image_url_metadata”) but I get the following error:
openai.BadRequestError: Error code: 400 - {‘error’: {'message': "You uploaded an unsupported image. Please make sure your image has of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_image_format'}}
I know for sure the image format is correct because the image inference was working fine without the timestamps added to the equation.
Do you have any bright ideas how I can convey this image metadata to gpt4o ?
Thanks again to all you wonderful humans out there as you are much better at answering these questions than the standard AI based support bot ever will be.
Sending love and respect.
system_prompt = """Describe the provided images in considerable detail. Respond in JSON format with a detailed description of the image, including any objects, actions, or context that you observe.
Image timestamps are provided as metadata in a text block which must be correlated against all images sent via image_url blocks for inference. The following JSON structured form which extends the typical image_url input message schema and thus is provided in the text user context block and is identical to the typical image_url format but adds the timestamp image metadata:
[{"type": "image_url_metadata", "image_url": {"url": "# URL of image to analyze" , "detail": "# low, high or auto ", "timestamp": "# srt formatted timestamp"}}]
Use the following output format: [{\"type\": \"text\", \"text\": \"Your detailed description here.\", \"timestamp\: \"# SRT format of the described timestamp\"}]"""
user_prompt = [{"type": "text", "text": json.dumps([{"type": "image_url_metadata", "image_url": {"url": "https://www.dropbox.com/scl/fi/ef549ugcuo58yifo4c1h0/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A16-2C000.png?rlkey=mxn0f2kk6hty2w73wnxn7gst6&st=s8ewl5qd&dl=0" , "detail": "high", "timestamp": "00:00:16,000"}},
{"type": "image_url_metadata", "image_url": {"url": "https://www.dropbox.com/scl/fi/v0xi2cdi3ivy2ceu4cq6g/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A01-3A47-2C000.png?rlkey=q4ukmxzuywdky8rdi2wjp7lwj&st=hgmed6gc&dl=0", "detail": "high", "timestamp": "00:01:47,000"}}])}, {"type": "image_url", "image_url": {"url": "https://www.dropbox.com/scl/fi/ef549ugcuo58yifo4c1h0/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A16-2C000.png?rlkey=mxn0f2kk6hty2w73wnxn7gst6&st=s8ewl5qd&dl=0" , "detail": "high"}}, {"type": "image_url", "image_url": {"url": "https://www.dropbox.com/scl/fi/v0xi2cdi3ivy2ceu4cq6g/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A01-3A47-2C000.png?rlkey=q4ukmxzuywdky8rdi2wjp7lwj&raw=1" , "detail": "high"}}]
inference_result = call_gpt4(system_prompt, user_prompt)
print(inference_result['text'])
print(inference_result['timestamp'])