Using Pydantic structured outputs in batch mode

Hello everyone,

I’m having some trouble using Pydantic structured outputs in batch mode. The following is a toy example outlining my problem.

import json
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()
fruits = ["apple", "banana", "orange", "strawberry"]

class Response(BaseModel):
    description:    str = Field(description="A short description of the fruit")
    colour:         str = Field(description="The fruit's colour")

tasks = []
for fruit in fruits:
    task = {
        "custom_id": f"task-{fruit}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "temperature": 0.1,
            "messages": [
                {
                    "role": "system",
                    "content": "I will give you a fruit, you will provide the information outlined in the structured output"
                },
                {
                    "role": "user",
                    "content": fruit
                }
            ],
            "response_format": Response
        }
    }

    tasks.append(task)

# Creating and uploading the file
file_name = "test/batch_tasks_fruit.jsonl"
with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')
batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)
print(batch_file)
# Creating the batch job
batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)

This code throws a TypeError: Object of type ModelMetaclass is not JSON serializable. After looking around for a bit I found a helpful answer in a post on this forum:

Which suggests converting the model to JSON first using Pydantic’s model_json_schema() method.

The batches created with:

....
                }
            ],
            "response_format": Response.model_json_schema()
        }
....

…fail however. The error for each task is Invalid value: 'object'. Supported values are: 'json_object', 'json_schema', and 'text'..

My question then is: how can I use Pydantic classes for structured outputs in batch mode? Is it even supported as of right now? Furthermore, I would rather avoid having to manually convert the Pydantic classes to a JSON schema. This is due to the fact that in my actual project I am using nested classes. I have noticed that converting nested Pydantic classes to json has weird effects on the resulting json schema: Classes are referenced instead of directly inserted at the point where they should be.

You’ll need to use the schema format specified by openai for structured outputs:

{
    "type": "json_schema",
    "json_schema": {
      "name": "math_response",
      "strict": true,
...

The easiest way to achieve this is to use a drop in pydantic wrapper that will automatically output this specific format when you call the BaseModel.model_json_schema() method. You could try tooldantic like so:

# pip install -U git+https://github.com/nicholishen/tooldantic.git

from tooldantic import OpenAiResponseFormatBaseModel as BaseModel

tooldantic also handles this and you can bind a new schema generator to the BaseModel in a couple of different ways to inline the schema (resolve the $refs and $defs).

Note: for very complex schemas, especially those that utilize recursion, I’ve found it best to leave the $defs alone and not inline them (with OpenAI). For most tasks, however, you can save ~30% tokens by resolving them.

Putting it all together, your code might look something like this:

import json
from tooldantic import ToolBaseModel, OpenAiResponseFormatGenerator

class CustomSchemaGenerator(OpenAiResponseFormatGenerator):
    is_inlined_refs = True
    
class BaseModel(ToolBaseModel):
    _schema_generator = CustomSchemaGenerator
    

class Test(BaseModel):
    """This is a test."""
    class Inner(BaseModel):
        inner_test: str
    outer_test: Inner
    
print(json.dumps(Test.model_json_schema(), indent=2))

# {
#   "type": "json_schema",
#   "json_schema": {
#     "name": "Test",
#     "description": "This is a test.",
#     "strict": true,
#     "schema": {
#       "type": "object",
#       "properties": {
#         "outer_test": {
#           "type": "object",
#           "properties": {
#             "inner_test": {
#               "type": "string"
#             }
#           },
#           "required": [
#             "inner_test"
#           ],
#           "additionalProperties": false
#         }
#       },
#       "required": [
#         "outer_test"
#       ],
#       "additionalProperties": false
#     }
#   }
# }

Hi @pietroro and welcome to the community!

Just an addendum on what @nicholishen said. If you are indeed doing BaseModel.model_json_schema(), you will most likely get lots of errors anyways, because OpenAI interpretation of JSON schema spec is a bit different (e.g. according to OpenAI, you MUST supply additionalProperties: False for all objects - this is something that Pydantic serialization will not give you!

So you may need to do some additional “wrapping” of your JSON schema, or use tooldantic that Nicholas suggested.

If anyone else facing issue with using structured outputs for batch api, i have found this works the easiest.

Define your schema using Pydantic , then you can convert it to OAI compliant strict json schema using to_strict_json_schema function available in the python client. (though a private method and ugly, this was much easier and more reliable than other 3rd party libraries that have not been tested extensively).

from openai.lib._pydantic import to_strict_json_schema

@karthik.shivaram how have you managed recreating instances of your Pydantic models from the Batches API response?

@henry.wilde Just to make sure i have understood your question, do you mean to ask how you would parse the json string output returned by the batch job back into a pydantic model instance ?

If Yes then you can do it as follows (just a small example)

If you have a pydantic schema defined as below

from pydantic import BaseModel
from typing import Optional, Dict, Union

class ProductAttributes(BaseModel):
    color: Optional[str]
    battery_life: Optional[str]
    connectivity: Optional[str]
    noise_cancellation: Optional[bool]

class Product(BaseModel):
    product_id: str
    name: str
    category: str
    price: float
    stock: int
    attributes: Optional[ProductAttributes]

And the JSON string obtained from the batch job looks like

json_string = '''
{
  "product_id": "12345",
  "name": "Wireless Headphones",
  "category": "Electronics",
  "price": 199.99,
  "stock": 25,
  "attributes": {
    "color": "Black",
    "battery_life": "20 hours",
    "connectivity": "Bluetooth",
    "noise_cancellation": true
  }
}
'''

You can directly load and create a instance of your pydantic model as follows

# Parse JSON string
data = json.loads(json_string)

# Recreate the Pydantic model
product = Product(**data)

# Print the parsed model
print(product)

# Access specific attributes
print(product.attributes.color)  # Output: Black
print(product.price)             # Output: 199.99

what works for me is this:
from openai.lib._parsing._completions import type_to_response_format_param

In [1]: from openai.lib._parsing._completions import type_to_response_format_param

In [2]: from pydantic import BaseModel
   ...:
   ...: from openai import OpenAI
   ...:
   ...:
   ...: class Step(BaseModel):
   ...:     explanation: str
   ...:     output: str
   ...:
   ...:
   ...: class MathResponse(BaseModel):
   ...:     steps: list[Step]
   ...:     final_answer: str
   ...:
   ...:
   ...: client = OpenAI()
   ...:
   ...: completion = client.chat.completions.create(
   ...:     model="gpt-4o-mini",
   ...:     messages=[
   ...:         {"role": "system", "content": "You are a helpful math tutor."},
   ...:         {"role": "user", "content": "solve 8x + 31 = 2"},
   ...:     ],
   ...:     response_format=type_to_response_format_param(MathResponse),
   ...: )
   ...:
   ...: message = completion.choices[0].message

In [3]: print(message)
ChatCompletionMessage(content='{"steps":[{"explanation":"Start with the equation: 8x + 31 = 2. The goal is to isolate x.","output":"8x + 31 = 2"},{"explanation":"Subtract 31 from both sides to move the constant term away from the left side: 8x = 2 - 31.","output":"8x = -29"},{"explanation":"Now, divide both sides of the equation by 8 to solve for x: x = -29 / 8.","output":"x = -29/8"}],"final_answer":"x = -3.625"}', refusal=None, role='assistant', function_call=None, tool_calls=None)