Is it possible to use structured outputs when using the vision model?
I have pictures locally stored which I want to extract information from. I need my outputs in a structured .json format, which I want to specify myself. however the vision tutorial uses URL requests to upload locally stored files.
It may not be possible to use Structured Format and Vision together—in fact, I think a Structured Format Assistant or Completion can only have Functions turned on.
Anyway, I don’t know what the info in the pictures is, but you could use the multimodal 4o to extract the information, then take another Assistant and properly structure that output.
you can use vision with structured output using chat completions. however, as of now, you cannot use vision directly with structured output in assistant api. but there is a workaround, you can delegate vision function as a tool and just pass the output to the main thread which has structured output.