I’m trying to figure out if there’s a way to return an image as a tool output for Assistants. I can think of several use cases where that would be useful (I’m making a small app that can access a “take screenshot” function to see what’s happening on my computer automatically), but the documentation doesn’t seem to cover that possibility.
Is it possible in one way or another, or planned? A workaround I can think of is to simply send the image in a separate message after the assistant called the function, but I’m guessing that would probably waste tokens? From my testing, the assistant can’t help but post a message after calling the function instead of waiting for the image to get sent in the next one. Most of the time just hallucinating a link to a screenshot that doesn’t exist.
EDIT (accidentally sent the message before I finished typing it, oops)
The only way I see would be Prompt the Assistant in such a way that it outputs a response that you can directly input in Image Generation Model(Dall.E) to generate an image.
Yes, you are correct that the Assistants, once invoking a tool_call, can only sit there and wait for your return value (“tool output”), and then will produce a reply based on the tool return value (which can’t be a binary file, or an image for further vision understanding by the AI).
That means that you don’t have versatility like chat completions, where you can be in complete control of messages and functions every call you make (and not so much tools, where you are forced to return the same ID as was called, or ignore the tool call), and could place another user message in the context before or after the function for vision to have a look at.
About the only way in Assistants is to have a function return that has already performed image-to-text, with all the information the AI would answer about anyway, written in natural language.
Your function could have parameters get_user_screenshot(“query”: “(what the AI would like to know about the screenshot)”) and then employ a separate chat completion vision API call with that user message and image.