How to pass image metadata to gpt4o when doing multimodal inference?

Hello, excellent humans!

I am passing in multimodal input (text and image) for inference and I must correlate them with timestamps as they represent spoken dialog and keyframes from a vocational training video I am having the AI analyze and summarize, with summary fields pointing back to applicable timestamps as back-correlation tags.

I found that if I extended the “image_url” structure (which typically just has “url” and “detail” members) with a “timestamp” member that the data was not made available to the inference engine and it was left to “guess” what the timestamp attached to each image was. My guess is that the image processing API strips this structure out entirely and just replaces it with a vectorized version of the image. I do not see any explicit support in the vision API for a way to pass image metadata to the inference engine.

I tried replicating the (extended) image_url structures into the text input context (and renaming the type to “image_url_metadata”) but I get the following error:

openai.BadRequestError: Error code: 400 - {‘error’: {'message': "You uploaded an unsupported image. Please make sure your image has of one the following formats: ['png', 'jpeg', 'gif', 'webp'].", 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_image_format'}}

I know for sure the image format is correct because the image inference was working fine without the timestamps added to the equation.

Do you have any bright ideas how I can convey this image metadata to gpt4o ?

Thanks again to all you wonderful humans out there as you are much better at answering these questions than the standard AI based support bot ever will be.

Sending love and respect.

    system_prompt = """Describe the provided images in considerable detail.  Respond in JSON format with a detailed description of the image, including any objects, actions, or context that you observe.
    
    Image timestamps are provided as metadata in a text block which must be correlated against all images sent via image_url blocks for inference.  The following JSON structured form which extends the typical image_url input message schema and thus is provided in the text user context block and is identical to the typical image_url format but adds the timestamp image metadata:

    [{"type": "image_url_metadata", "image_url": {"url": "# URL of image to analyze" , "detail": "# low, high or auto ", "timestamp": "# srt formatted timestamp"}}]

    
    Use the following output format: [{\"type\": \"text\", \"text\": \"Your detailed description here.\", \"timestamp\: \"# SRT format of the described timestamp\"}]"""


    user_prompt = [{"type": "text", "text": json.dumps([{"type": "image_url_metadata", "image_url": {"url": "https://www.dropbox.com/scl/fi/ef549ugcuo58yifo4c1h0/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A16-2C000.png?rlkey=mxn0f2kk6hty2w73wnxn7gst6&st=s8ewl5qd&dl=0" , "detail": "high", "timestamp": "00:00:16,000"}},
    {"type": "image_url_metadata", "image_url": {"url": "https://www.dropbox.com/scl/fi/v0xi2cdi3ivy2ceu4cq6g/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A01-3A47-2C000.png?rlkey=q4ukmxzuywdky8rdi2wjp7lwj&st=hgmed6gc&dl=0", "detail": "high", "timestamp": "00:01:47,000"}}])}, {"type": "image_url", "image_url": {"url": "https://www.dropbox.com/scl/fi/ef549ugcuo58yifo4c1h0/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A00-3A16-2C000.png?rlkey=mxn0f2kk6hty2w73wnxn7gst6&st=s8ewl5qd&dl=0" , "detail": "high"}}, {"type": "image_url", "image_url": {"url": "https://www.dropbox.com/scl/fi/v0xi2cdi3ivy2ceu4cq6g/keyframe_1RCTCOP-A-BRIEF-DISCUSSION-WHY-WE-CHOOSE-THIS-BELT-DESIGN.-FOR-RACHEL-2023-02-11-001_00-3A01-3A47-2C000.png?rlkey=q4ukmxzuywdky8rdi2wjp7lwj&raw=1" , "detail": "high"}}]

    inference_result = call_gpt4(system_prompt, user_prompt)
    print(inference_result['text'])
    print(inference_result['timestamp'])

I’ll also enclose my call_gpt4 code here (work in progress) for completeness:

@retry(stop=stop_after_attempt(5), wait=wait_exponential(min=1, max=10), retry=retry_if_exception_type(APIConnectionError))
def call_gpt4(system_prompt, user_prompt,  max_tokens=EXPECTED_OUTPUT_TOKENS):
    """Call the GPT-4 API with system and user prompts, handling token limits by breaking output into chunks."""
    logging.info(f"Calling {MODEL_NAME} with prompt containing {count_tokens(system_prompt) + count_tokens(user_prompt)} tokens.")
    accumulated_response = []
    prompt_history = []
    processing_complete = False

    def get_current_prompt():
        """Prepare the current prompt with system and user messages, including previous responses."""
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
        if prompt_history:
            messages.extend(prompt_history)
        return messages

    def add_to_prompt_history(agent_response):
        """Add the assistant's response to the prompt history."""
        prompt_history.append({"role": "assistant", "content": agent_response})
        prompt_history.append({"role": "user", "content": "Continue processing the request from where you last left off."})

    # Start with the initial call to GPT-4
    while not processing_complete:
        # Construct the request payload
        request_data = {
            "model": MODEL_NAME,
            "messages": get_current_prompt(),
            "max_tokens": max_tokens,
            "temperature": 0.2,
            "frequency_penalty": 0,
            "presence_penalty": 0
        }

        # Make the API call
        response = client.chat.completions.create(**request_data)

        # Extract the assistant's response
        agent_response = response.choices[0].message.content

        # Isolate the json object assuming the following return form:
        # ```json
        # {json_object}```

        # Remove all occurrences of "```"
        agent_response = agent_response.replace("```", "")

        # Remove the first line if it contains 'json'
        split_agent_response=agent_response.splitlines()
        agent_response = '\n'.join(split_agent_response[1:]) \
            if 'json' in split_agent_response[0] else agent_response

        finish_reason = response.choices[0].finish_reason

        # Handle chunked response
        accumulated_response.append(agent_response)

        # Determine if the output is complete
        if finish_reason == 'stop':
            # If the finish reason is 'stop', the model output is fully generated
            if not is_valid_json(''.join(accumulated_response)):
                raise ValueError(f"Accumulated video summary inference response is not a valid JSON object : "
                                 f"{''.join(accumulated_response)}")
            processing_complete = True
        elif finish_reason == 'length':
            # If output length exceeded max tokens, continue requesting more
            add_to_prompt_history(agent_response)
        else:
            # Handle other finish reasons such as 'tool_calls', etc. if needed
            raise ValueError(f"Unexpected finish reason: {finish_reason}")

    # Join and return the accumulated output
    return json.loads(''.join(accumulated_response))

In your message “contents” from a user, you can simply alternate between "type": "text" and "type": "image_url", where you give the information consistently before or after each image.

You cannot make up new fields.

2 Likes

Beautiful, that works! Many thanks! I alternated image_url blocks followed by text blocks and provided a correlation field in both blocks and the AI reported the correct metadata for the correct image.