Returning image as result of function call to gpt-4-turbo

I’m trying to implement function that allows AI to request current image from camera to get more context to answer user’s request.
Previously to perform that function I had to send extra request to vision model to generate text describing that frame, and then return it as function result to gpt-4 to generate final response.
And it did work as expected, never had AI seeing something that doesn’t look like object captured by camera.
With gpt-4-turbo I’m trying to avoid that extra request and just return image instead of text description.
So, I have “look” function described as “Look at what user sees”, this function is called after user’s request “What do you see here?”. When I return text description, user gets valid response with this decription.
But when I’m trying to return image, answer looks hallucinated. It doesn’t even closly describe what’s actualy on a picture.

Here’s simplified version of request my app sends:

{"temperature":1,
"messages":[
    {"content":"What do you see here?","role":"user"},
    {"content":[],"role":"assistant","tool_calls":[{"id":"call_8SeUX71SbopVdaH0VkHXMN9C","type":"function","function":{"name":"look","arguments":""}}]},
    {"content":[{"type":"image_url","image_url":{"url":<base64 encoded image url>,"detail":"low"}}],"tool_call_id":"call_8SeUX71SbopVdaH0VkHXMN9C","role":"tool","name":"look"}
],"model":"gpt-4-turbo",
"tools":[{"type":"function","function":{"name":"look","parameters":{"type":"object","properties":{},"required":[]},"description":"Look at what user sees"}}]}

Here’s one of answer’s I got:

{
  "id": <id>,
  "object": "chat.completion",
  "created": 1712838282,
  "model": "gpt-4-turbo-2024-04-09",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image displays a green-colored parrot sitting on a wooden perch. The parrot has a predominantly green plumage with hints of red on its wings and tail. It appears to be in a cage or an enclosed area with a blurred background that includes other similar perches and possibly more birds."
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 73,
    "completion_tokens": 60,
    "total_tokens": 133
  },
  "system_fingerprint": <system_fingerprint>
}

Here’s unescaped version of image_url from that request

https://paste.mozilla.org/6sz6fJ4B#L1

And here’s image itself

2 Likes

Hi @dmitry.d

I’d recommend not sending the function’s JSON schema with the tools param when you are sending the tool_call result back to the model. This will also help you save tokens.

Make sure that the image has a supported format.

You can either send the image as base64 encoded or send the URL where the image is hosted.

The image format is definitely OK.

When I’m sending it in user’s message, not as function result, I’m getting valid response
For

{"messages":[
    {"content":[
        {"type":"text","text":"What do you see here?"},
        {"type":"image_url","image_url":{"url"<base64 encoded image url>:,"detail":"low"}}
    ],
    "role":"user"}
],
"temperature":1,"model":"gpt-4-turbo",
"tools":[{"type":"function","function":{"name":"look","parameters":{"type":"object","properties":{},"required":[]},"description":"Look at what user sees"}}]}

I’m getting

{
    "id": <id>,
    "object": "chat.completion",
    "created": 1712901959,
    "model": "gpt-4-turbo-2024-04-09",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "I see a man with a playful expression leaning out from the driver's side window of a vehicle. He has a beard and is wearing a cap. The vehicle looks a bit old and worn, as evidenced by the rust and peeling paint on the frame of the window. His expression and pose suggest a casual, friendly moment, possibly during a break or a lighthearted interaction."
        },
        "logprobs": null,
        "finish_reason": "stop"
      }
    ],
    "usage": {
      "prompt_tokens": 124,
      "completion_tokens": 80,
      "total_tokens": 204
    },
    "system_fingerprint": <system_fingerprint>
  }

It only shows problems when image is result of a function call

And about tools param. Shouldn’t GPT have a description of function it’s getting result for to use it properly?

It already has the instructions from previous messages and the context from the tool call it made.

You may also want to include a text content block before the image that describes the image’s source. E.g. “Here’s the current image from the camera”

1 Like

Is it documented somewhere? From what I read I had impression completion API requests are stateless

Chat completion requests are definitely stateless, but the messages list sent back to the API with the tool call response carries all the context the model needs to write a response.

I wrote some code to test this on my end and found that the model doesn’t hallucinate anymore when I include the image source in a text content block in the tool call response.

When it doesn’t hallucinate, it responds every single time with how it cannot “see” the image supplied in the tool call response.

ChatCompletionMessage(content="It seems there was an issue, and I can't see the image you're referring to. Please provide the image again or describe it so I can help you further.", role='assistant', function_call=None, tool_calls=None)

My best guess is that model can only “see” images that are sent by user role as of now and not the tool call or the system role.

The lack of vision in first system message is documented; however, that in the tool call isn’t, given how the function calling capability on vision was just released. I’m not sure if it’s by design or just a gap in implementation, and it’s something only OpenAI staff can answer.

Here's the code you can run to test on your end
import base64
from openai import OpenAI

client = OpenAI()

def get_base64_encoded_image(image_path):
    with open(image_path, 'rb') as image_file:
        # Read the file
        image_data = image_file.read()
        # Encode the binary data to base64
        base64_encoded_data = base64.b64encode(image_data)
        # Convert to string
        base64_message = base64_encoded_data.decode('utf-8')
        return base64_message

def look():
    """Get the image with user's view"""
    base_64_image = get_base64_encoded_image("PATH TO IMAGE")
    look_content = [
                  {
                    "type": "text",
                    "text": "Here's the image from user's view:",
                  },
                  {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base_64_image}"
                    }
                  },
                ]
    return look_content
    
def run_conversation():
    # Step 1: send the conversation and available functions to the model
    tools = [{"type":"function","function":{"name":"look","parameters":{"type":"object","properties":{},"required":[]},"description":"Look at what user sees"}}]

    conversation = [
              {
                "role": "user",
                "content": [
                  {
                    "type": "text",
                    "text": "What do you see here?",
                  }
                ]
              },
    ]

    response = client.chat.completions.create(
            model="gpt-4-turbo",
            messages=conversation,
            tools=tools,
            tool_choice="auto",  # auto is default, but we'll be explicit
        )
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    # Step 2: check if the model wanted to call a function
    if tool_calls:
        # Step 3: call the function
        # Note: the JSON response may not always be valid; be sure to handle errors
        available_functions = {
            "look": look,
        }  # only one function in this example, but you can have multiple
        conversation.append(response_message)  # extend conversation with assistant's reply
        # Step 4: send the info for each function call and function response to the model
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions[function_name]
            function_response = function_to_call()
            conversation.append(
                {
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": function_response,
                }
            )  # extend conversation with function response
        second_response = client.chat.completions.create(
            model="gpt-4-turbo",
            messages=conversation,
        )  # get a new response from the model where it can see the function response
        return second_response
print(run_conversation())

That is a token savings, but the function absolutely tells the AI where that function return came from and its purpose, letting it then understand the assistant message that send the tool call request and the return value you give back to the AI model.


And that is the crux of what is being discussed here: the return value.

And the simple fact, overlooked, that an image input is only allowed in a user message, and the AI is trained to answer user questions using functions in a [user, assistant->function, function return ] message order, where it is the information coming after the user but not from the user that better informs the next action of the assistant.

Although it is hard to imagine the exact app where an image is the return from a function, I can propose a way that you can place the messages in a way that you get the image in there, which must be in a user message, and still get the AI to answer the original message. The sequence of new messages you place would be like this metacode:


user: I have this ominous feeling I’m not going to be able to get yard work done before it rains. What do you think?
assistant functions.multitool({“weather_api”: {“location”: “Walla Walla, WA”, “source”: “radar”, “time”:-1}}, {“weather_api”: {“location”: “Walla Walla, WA”, “source”: “radar”, “time”:now}})
tool: [“success: Doppler radar image returned”, “success: Doppler radar image returned”]
user name:weather_api: “Hi, it’s the weather_api here, giving you the images you requested from radar”, image1, image2"

to get from a genius AI: “Your apprehensions regarding the completion of your yard work today are likely justified. I have scrutinized the trajectory of precipitation fronts on radar imagery near Walla Walla, where you reside, and it appears that a swath of rain is approaching your location. By my calculations, it should arrive within two hours. It would be prudent to conclude your outdoor activities promptly!”

What do you think for your own app?

2 Likes

Images are allowed in user, system and assistant messages.

Here’s the vision quick start guide:

Images can be passed in the user , system and assistant messages. Currently we don’t support images in the first system message but this may change in the future.

2 Likes

Images in other roles is new information to me – and new information to openai.yaml API specification upon which API Reference docs are built.

So let’s check. Here is [system, then system+image], being sent by requests (as the Python library validates against schema):

The person in the image is a bald male with a slight stubble. He is wearing a light grey suit with a white shirt and no tie. He appears to be speaking or gesturing during a discussion, indicating he might be engaged in a formal or professional setting.

Based on visual cues, his estimated age is 40 years old.

It seems now working on gpt-4-vision-preview also.

Spoiler: who's being looked at?

https://content.fortune.com/wp-content/uploads/2023/11/GettyImages-1258459705-e1700340943429.jpg?w=512&q=75

Funny thing, with an image just in the first system message, as OpenAI says is unsupported, the AI doesn’t say “hey, no picture” or an API error; we get a complete fabrication and the image tokens uncounted in usage:

The image shows a young woman with long, straight brown hair. She is wearing a light blue denim jacket and a white top. Her makeup is natural, featuring subtle eye makeup and a soft pink lip color. She is smiling gently at the camera, and her overall appearance suggests a youthful and casual style.

Visual age estimation: 23 years old.

So it looks like indeed you have two more options of how you might frame an image from a function, in a role that is not from the function.

1 Like

I guess your proposal will work fine for my app. Don’t see any other way for now

I wonder if OpenAI has any plans to allow tools that return image, images, audio…

2 Likes