I’m trying to implement function that allows AI to request current image from camera to get more context to answer user’s request.
Previously to perform that function I had to send extra request to vision model to generate text describing that frame, and then return it as function result to gpt-4 to generate final response.
And it did work as expected, never had AI seeing something that doesn’t look like object captured by camera.
With gpt-4-turbo I’m trying to avoid that extra request and just return image instead of text description.
So, I have “look” function described as “Look at what user sees”, this function is called after user’s request “What do you see here?”. When I return text description, user gets valid response with this decription.
But when I’m trying to return image, answer looks hallucinated. It doesn’t even closly describe what’s actualy on a picture.
Here’s simplified version of request my app sends:
{"temperature":1,
"messages":[
{"content":"What do you see here?","role":"user"},
{"content":[],"role":"assistant","tool_calls":[{"id":"call_8SeUX71SbopVdaH0VkHXMN9C","type":"function","function":{"name":"look","arguments":""}}]},
{"content":[{"type":"image_url","image_url":{"url":<base64 encoded image url>,"detail":"low"}}],"tool_call_id":"call_8SeUX71SbopVdaH0VkHXMN9C","role":"tool","name":"look"}
],"model":"gpt-4-turbo",
"tools":[{"type":"function","function":{"name":"look","parameters":{"type":"object","properties":{},"required":[]},"description":"Look at what user sees"}}]}
Here’s one of answer’s I got:
{
"id": <id>,
"object": "chat.completion",
"created": 1712838282,
"model": "gpt-4-turbo-2024-04-09",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The image displays a green-colored parrot sitting on a wooden perch. The parrot has a predominantly green plumage with hints of red on its wings and tail. It appears to be in a cage or an enclosed area with a blurred background that includes other similar perches and possibly more birds."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 73,
"completion_tokens": 60,
"total_tokens": 133
},
"system_fingerprint": <system_fingerprint>
}
Here’s unescaped version of image_url from that request
https://paste.mozilla.org/6sz6fJ4B#L1
And here’s image itself