Does the model `gpt-4-vision-preview` have function calling?

I am trying to have the vision model evaluate an image and then based off of its conclusion it will call a function. But I am getting this error:

openai.BadRequestError: Error code: 400 - {'error': {'message': '2 validation errors for Request\nbody -> function_call\n  extra fields not permitted (type=value_error.extra)\nbody -> functions\n  extra fields not permitted (type=value_error.extra)', 'type': 'invalid_request_error', 'param': None, 'code': None}}

Which is causing me to think that it does not support function calling.

4 Likes

Upon further research the model gpt-4-vision-preview does not support function calling:

3 Likes

**

This is required feature. please add function calling to the vision model.

**

As GPT-4V does not do object segmentation or detection and subsequent bounding box for object location information, having function calling may augument the LLM with the object location returned by object segmentation or detection/localization function call.

4 Likes

I find this a frustrating limitation and surprised this has not been added.

As it stands you have to make it a special case and a separate API call.

3 Likes

Yes this is a very desirable feature. I would find the Vision functionality much more useful with function calling. Without it, it is less useful and I find myself not using it. Was it left out intentionally for safety reasons or is this an upcoming feature in progress? It has been awhile.

2 Likes

Yes, the vision model needs to support function call, so that you can call the plug-in directly after reading the image. Otherwise, the function of image recognition is basically useless

It isn’t entirely useless, because you can create a local function that makes a separate call to the LLM with the vision model and handle the function call and chat with a regular LLM.

Unfortunately that means sending the result of the vision call back to the LLM for it to potentially change the result which would be annoying.

The way it is though means you have to add this potentially redundant and less efficient and effective solution.

There’s no magic that they’re doing to implement function calling support. They’re just taking the list of functions you’re passing in, adding the schema on to the end of the prompt and then asking the model to return a JSON object that predicts the name of the function to invoke. You can do the exact same thing

The magic that they are doing is pretraining AI and recognizing specific special tokens to invoke backend API methods. This trained AI is disabled if a tool or function isn’t passed “officially”.

Code

Example trained "function calling" forced into the AI:
system = """
You are ImageCreate.
You employ your python ability to perform one task:
- Create and display 1280x720 images per user specification.

# Tools

## python

// You have a stateless Python Jupyter notebook without file storage.
// Additional libraries installed include:
// PIL (pillow), numpy, io, IPython
// Important: PIL's Image.show() method is broken, \
so use IPython.display for GUI user display of Image object.

## functions

namespace functions {

// do not use functions, only python tool is available
type disabled_function = (_: {
// This null property also will never be used, only use python
null_value: any,
}) => any;

} // namespace functions

""".strip()

user = "Show an image with thousands of colorful circles by writing and executing python code."
messages = [{"role": "system", "content": system},
            {"role": "user", "content": user}]

Output from gpt-4-vision-preview or gpt-4-1106-preview:

Certainly! I will create an image with a multitude of colorful circles using Python. Let’s get started.

That’s all, there is an introduction and nothing happens except “stop”.

However, for the next step, all I will do is include a do-nothing placeholder tool…everything else is the same system prompt language, simulating what OpenAI backend creates.

Output with dummy function namespace as tools API parameter:

– output as a tool_calls= argument –

from PIL import Image, ImageDraw
import numpy as np
from IPython.display import display

# Create an image with a white background
width, height = 1280, 720
image = Image.new("RGB", (width, height), "white")
draw = ImageDraw.Draw(image)

# Generate random circles
num_circles = 1000
for _ in range(num_circles):
    # Random position
    x, y = np.random.randint(0, width), np.random.randint(0, height)
    # Random radius
    r = np.random.randint(5, 50)
    # Random color
    color = (np.random.randint(0, 255), np.random.randint(0, 255), np.random.randint(0, 255))
    # Draw the circle
    draw.ellipse((x-r, y-r, x+r, y+r), fill=color, outline=None)

# Display the image
display(image)

Conclusion

You cannot activate the pretraining on functions or tools unless they decide to let you. For gpt-4-vision, they do not let you.

1 Like

Can you think of a technical reason why they don’t?

In the Microsoft Teams AI Library I have a feature called “augmentations” which mimics the enhancements that OpenAI makes to your prompt when you pass in functions/tools. My sequence augmentation is nearly verbatim the augmentation they do when you pass in a list of functions/tools only mine works better because I add a validation loop that makes it impossible for the model to call an invalid function or violate the functions schema.

The Teams AI Library also supports other augmentation types like “monologue” which automatically adds an AutoGPT style monologue to your prompt using the same exact list of functions.

Trust me, there’s nothing they’re doing that you can’t do just as easily and probably better. The proof of that is that in the Teams AI Library, all augmentations work with all models, including gpt-4-vision-preview.

2 Likes

The point is I had the AI emit Python to a function without there being any hint of what sequence of special tokens and function wrapper to produce to send to a tool recipient. This is all fine-tuned and pre-existing.

To demonstrate how fine-tuned and trained, here is code to run:

from openai import OpenAI; client = OpenAI()

tools = [{"type": "function", "function": {"name": "disabled",
  "description": "additional functions are disabled",
  "parameters": {"type": "object", "properties": {"none": {"type": "null"}}}}}]

messages = [{"role": "system", "content": "You are ChatAPI."},
  {"role": "user", "content":
   "Use your python sandbox to add together 100 fractional digits of pi."}]

out = client.chat.completions.create(
  model="gpt-3.5-turbo-0613", top_p=0.01,
  messages=messages, tools=tools, 
)

if out.choices[0].message.content:
  print("content:\n" + out.choices[0].message.content)
if out.choices[0].message.tool_calls:
  print("tool_calls:\n" + out.choices[0].message.tool_calls[0].function.arguments)

Not a whiff of instruction telling the AI how it can run code in its Jupyter notebook, by emitting two special tokens, a to= message, a recipient, a code designation, etc. No system programming and no actual function. Just the user giving a command.

What does it produce? Not content for a user, but a tool call for name='python':

tool_calls:
import math

pi = math.pi
sum_digits = sum(int(digit) for digit in str(pi)[2:102])
sum_digits

Without inclusion of API function or tool, the model doesn’t have this, as if they have two different models to select from. That means that -vision is a clean slate for whatever else you might want to try…

1 Like

That’s actually a hallucination… if you look back through some of the posts on here right after they released function support, the model had a bad habit of trying to invoke a hallucinated function named python. They’ve tuned a lot of those hallucinations out but the model can still hallucinate functions that don’t exist. That’s why the docs for functions/tools specifically say you need to validate both the function name and any arguments.

It’s not really a “hallucination” - if it works exactly the same as code interpreter.

We could say it is safety-related, that developer functions combined with vision could amplify the capabilities of AI, calling functions to get imagery, automatically ingesting and classifying them and such.

I think it is just that OpenAI (and Microsoft that gets undiminished AI products) doesn’t want feature-parity competition with their ChatGPT running a vision-enabled model natively.

1 Like

So that doesn’t bode well.

But that wouldn’t make sense imho as there is a workaround using a local function? It’s just not as efficient and effective. (And I’m not confident the image prompt would not be shortened for the query)

So I gave up and wrote a function.

With hindsight this has one huge advantage: you can use a cheaper LLM to hold the general conversation and restrict calls to GPT 4 Vision for just image analysis.

UPDATE: looks like Open AI not only released an updated preview model with vision with function calling but also created a new alias for gpt-4-turbo which has vision!

gpt-4-turbo

More here: https://platform.openai.com/docs/models/continuous-model-upgrades

My solution is still a good one because it keeps the costs down.

2 Likes