OpenAI vision with structured output when uploading local files

Is it possible to use structured outputs when using the vision model?

I have pictures locally stored which I want to extract information from. I need my outputs in a structured .json format, which I want to specify myself. however the vision tutorial uses URL requests to upload locally stored files.

https://platform.openai.com/docs/guides/vision

Whereas structured outputs require you to use chat completions.

https://platform.openai.com/docs/guides/structured-outputs

Hi,

It may not be possible to use Structured Format and Vision together—in fact, I think a Structured Format Assistant or Completion can only have Functions turned on.

Anyway, I don’t know what the info in the pictures is, but you could use the multimodal 4o to extract the information, then take another Assistant and properly structure that output.

you can use vision with structured output using chat completions. however, as of now, you cannot use vision directly with structured output in assistant api. but there is a workaround, you can delegate vision function as a tool and just pass the output to the main thread which has structured output.

i don’t know if that’s true – i have this working as an example.

from openai import OpenAI
from pprint import pprint
from pydantic import BaseModel, Field

client = OpenAI()

class Image(BaseModel):
    description: str
    topic: str = Field(description='the single topic of the image')



response = client.beta.chat.completions.parse(
  model="gpt-4o-2024-08-06",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  response_format=Image,
  max_tokens=300,
)