Returning image as tool output in Assistants API?

Hello!

I’m trying to figure out if there’s a way to return an image as a tool output for Assistants. I can think of several use cases where that would be useful (I’m making a small app that can access a “take screenshot” function to see what’s happening on my computer automatically), but the documentation doesn’t seem to cover that possibility.

Is it possible in one way or another, or planned? A workaround I can think of is to simply send the image in a separate message after the assistant called the function, but I’m guessing that would probably waste tokens? From my testing, the assistant can’t help but post a message after calling the function instead of waiting for the image to get sent in the next one. Most of the time just hallucinating a link to a screenshot that doesn’t exist.

EDIT (accidentally sent the message before I finished typing it, oops)

2 Likes

The only way I see would be Prompt the Assistant in such a way that it outputs a response that you can directly input in Image Generation Model(Dall.E) to generate an image.

Yes, you are correct that the Assistants, once invoking a tool_call, can only sit there and wait for your return value (“tool output”), and then will produce a reply based on the tool return value (which can’t be a binary file, or an image for further vision understanding by the AI).

That means that you don’t have versatility like chat completions, where you can be in complete control of messages and functions every call you make (and not so much tools, where you are forced to return the same ID as was called, or ignore the tool call), and could place another user message in the context before or after the function for vision to have a look at.

About the only way in Assistants is to have a function return that has already performed image-to-text, with all the information the AI would answer about anyway, written in natural language.

Your function could have parameters get_user_screenshot(“query”: “(what the AI would like to know about the screenshot)”) and then employ a separate chat completion vision API call with that user message and image.

Hi! I’ll be short.

  1. Upload file using File Upload API (it’s used for Assistants, Batch and others OpenAI API’s to reference files).
  2. Use id from POST /files response. Submit tool output with the file id. For example:
{"screenshot_file_id": "file-abc123"}

Perhaps too short - what use can a language AI in Assistants make of an internal file ID from the storage endpoint?

  • can’t see the image (the desire of this topic)
  • can’t provide the ID for an end-user (they’d need a hosted URL)

Only “user” role messages can include images. You cannot push a user message onto a thread stack while a tool call is still open, and returning the tool call makes the AI write language about the results.