As everyone is aware, gpt-4-vision-preview does not have function calling capabilities yet. Therefore, there’s no way to provide external context to the GPT-4V model that’s not a part of what the “System”, “Assistant” or the “User” provides.
I’m curious if anyone has figured out a workaround to make sure the external context is injected in a reliable manner?
A different form of this question would be:
Any idea if there is a way to use gpt-4-vision-preview as the base model for a ReAct agent?
gpt-4-vision doesn’t take functions, doesn’t take logprobs or logit_bias. They pretty much keep the utility of ChatGPT to themselves.
You’d likely have to provide a regular model an “analyze image” function, with a “prompt” parameter of the desired text to get back. Then inject post-prompt that says “user has made banana-tree.jpg available to the analyze image function”.
Function calling is just a way to easily structure the input-output for tool calling. You can use GPT4-vision in the same way but without this structured output. For example, tell it that if it needs to use tool X, output a json format with the tool name and parameters, and then parse the output as you would a function call, and simply add the function output as another regular user message.
You can use AlphaWave with GPT-4V to reliably return a JSON object specifying the name of a function to call and the parameters to pass that function.
There’s nothing magical that OpenAI is doing to support function calling. They’re just adding some text to the end of your prompt describing a list of available functions and then asking the model to return some JSON. You can do that just as easily yourself.
AlphaWave actually enables more reliable function calling because it not only ensures the model returns valid JSON but it also schema validates everything. It makes it impossible for the model to call an invalid function or to return invalid parameters. OpenAI makes no such guarantees.