Using Pydantic structured outputs in batch mode

pietroro · September 25, 2024, 8:34am

Hello everyone,

I’m having some trouble using Pydantic structured outputs in batch mode. The following is a toy example outlining my problem.

import json
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()
fruits = ["apple", "banana", "orange", "strawberry"]

class Response(BaseModel):
    description:    str = Field(description="A short description of the fruit")
    colour:         str = Field(description="The fruit's colour")

tasks = []
for fruit in fruits:
    task = {
        "custom_id": f"task-{fruit}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "temperature": 0.1,
            "messages": [
                {
                    "role": "system",
                    "content": "I will give you a fruit, you will provide the information outlined in the structured output"
                },
                {
                    "role": "user",
                    "content": fruit
                }
            ],
            "response_format": Response
        }
    }

    tasks.append(task)

# Creating and uploading the file
file_name = "test/batch_tasks_fruit.jsonl"
with open(file_name, 'w') as file:
    for obj in tasks:
        file.write(json.dumps(obj) + '\n')
batch_file = client.files.create(
  file=open(file_name, "rb"),
  purpose="batch"
)
print(batch_file)
# Creating the batch job
batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)

This code throws a TypeError: Object of type ModelMetaclass is not JSON serializable. After looking around for a bit I found a helpful answer in a post on this forum:

Which suggests converting the model to JSON first using Pydantic’s model_json_schema() method.

The batches created with:

....
                }
            ],
            "response_format": Response.model_json_schema()
        }
....

…fail however. The error for each task is Invalid value: 'object'. Supported values are: 'json_object', 'json_schema', and 'text'..

My question then is: how can I use Pydantic classes for structured outputs in batch mode? Is it even supported as of right now? Furthermore, I would rather avoid having to manually convert the Pydantic classes to a JSON schema. This is due to the fact that in my actual project I am using nested classes. I have noticed that converting nested Pydantic classes to json has weird effects on the resulting json schema: Classes are referenced instead of directly inserted at the point where they should be.

nicholishen · September 25, 2024, 9:05am

You’ll need to use the schema format specified by openai for structured outputs:

{
    "type": "json_schema",
    "json_schema": {
      "name": "math_response",
      "strict": true,
...

The easiest way to achieve this is to use a drop in pydantic wrapper that will automatically output this specific format when you call the BaseModel.model_json_schema() method. You could try tooldantic like so:

# pip install -U git+https://github.com/nicholishen/tooldantic.git

from tooldantic import OpenAiResponseFormatBaseModel as BaseModel

tooldantic also handles this and you can bind a new schema generator to the BaseModel in a couple of different ways to inline the schema (resolve the $refs and $defs).

Note: for very complex schemas, especially those that utilize recursion, I’ve found it best to leave the $defs alone and not inline them (with OpenAI). For most tasks, however, you can save ~30% tokens by resolving them.

Putting it all together, your code might look something like this:

import json
from tooldantic import ToolBaseModel, OpenAiResponseFormatGenerator

class CustomSchemaGenerator(OpenAiResponseFormatGenerator):
    is_inlined_refs = True
    
class BaseModel(ToolBaseModel):
    _schema_generator = CustomSchemaGenerator
    

class Test(BaseModel):
    """This is a test."""
    class Inner(BaseModel):
        inner_test: str
    outer_test: Inner
    
print(json.dumps(Test.model_json_schema(), indent=2))

# {
#   "type": "json_schema",
#   "json_schema": {
#     "name": "Test",
#     "description": "This is a test.",
#     "strict": true,
#     "schema": {
#       "type": "object",
#       "properties": {
#         "outer_test": {
#           "type": "object",
#           "properties": {
#             "inner_test": {
#               "type": "string"
#             }
#           },
#           "required": [
#             "inner_test"
#           ],
#           "additionalProperties": false
#         }
#       },
#       "required": [
#         "outer_test"
#       ],
#       "additionalProperties": false
#     }
#   }
# }

platypus · September 25, 2024, 11:48am

Hi @pietroro and welcome to the community!

Just an addendum on what @nicholishen said. If you are indeed doing BaseModel.model_json_schema(), you will most likely get lots of errors anyways, because OpenAI interpretation of JSON schema spec is a bit different (e.g. according to OpenAI, you MUST supply additionalProperties: False for all objects - this is something that Pydantic serialization will not give you!

So you may need to do some additional “wrapping” of your JSON schema, or use tooldantic that Nicholas suggested.

karthik.shivaram · November 10, 2024, 5:43am

If anyone else facing issue with using structured outputs for batch api, i have found this works the easiest.

Define your schema using Pydantic , then you can convert it to OAI compliant strict json schema using to_strict_json_schema function available in the python client. (though a private method and ugly, this was much easier and more reliable than other 3rd party libraries that have not been tested extensively).

from openai.lib._pydantic import to_strict_json_schema

henry.wilde · November 26, 2024, 11:09am

@karthik.shivaram how have you managed recreating instances of your Pydantic models from the Batches API response?

karthik.shivaram · November 26, 2024, 4:26pm

@henry.wilde Just to make sure i have understood your question, do you mean to ask how you would parse the json string output returned by the batch job back into a pydantic model instance ?

If Yes then you can do it as follows (just a small example)

If you have a pydantic schema defined as below

from pydantic import BaseModel
from typing import Optional, Dict, Union

class ProductAttributes(BaseModel):
    color: Optional[str]
    battery_life: Optional[str]
    connectivity: Optional[str]
    noise_cancellation: Optional[bool]

class Product(BaseModel):
    product_id: str
    name: str
    category: str
    price: float
    stock: int
    attributes: Optional[ProductAttributes]

And the JSON string obtained from the batch job looks like

json_string = '''
{
  "product_id": "12345",
  "name": "Wireless Headphones",
  "category": "Electronics",
  "price": 199.99,
  "stock": 25,
  "attributes": {
    "color": "Black",
    "battery_life": "20 hours",
    "connectivity": "Bluetooth",
    "noise_cancellation": true
  }
}
'''

You can directly load and create a instance of your pydantic model as follows

# Parse JSON string
data = json.loads(json_string)

# Recreate the Pydantic model
product = Product(**data)

# Print the parsed model
print(product)

# Access specific attributes
print(product.attributes.color)  # Output: Black
print(product.price)             # Output: 199.99

rzheng · November 29, 2024, 5:37pm

what works for me is this:
from openai.lib._parsing._completions import type_to_response_format_param

In [1]: from openai.lib._parsing._completions import type_to_response_format_param

In [2]: from pydantic import BaseModel
   ...:
   ...: from openai import OpenAI
   ...:
   ...:
   ...: class Step(BaseModel):
   ...:     explanation: str
   ...:     output: str
   ...:
   ...:
   ...: class MathResponse(BaseModel):
   ...:     steps: list[Step]
   ...:     final_answer: str
   ...:
   ...:
   ...: client = OpenAI()
   ...:
   ...: completion = client.chat.completions.create(
   ...:     model="gpt-4o-mini",
   ...:     messages=[
   ...:         {"role": "system", "content": "You are a helpful math tutor."},
   ...:         {"role": "user", "content": "solve 8x + 31 = 2"},
   ...:     ],
   ...:     response_format=type_to_response_format_param(MathResponse),
   ...: )
   ...:
   ...: message = completion.choices[0].message

In [3]: print(message)
ChatCompletionMessage(content='{"steps":[{"explanation":"Start with the equation: 8x + 31 = 2. The goal is to isolate x.","output":"8x + 31 = 2"},{"explanation":"Subtract 31 from both sides to move the constant term away from the left side: 8x = 2 - 31.","output":"8x = -29"},{"explanation":"Now, divide both sides of the equation by 8 to solve for x: x = -29 / 8.","output":"x = -29/8"}],"final_answer":"x = -3.625"}', refusal=None, role='assistant', function_call=None, tool_calls=None)

Topic		Replies	Views
Structured Outputs with Batch Processing API api	10	5193	November 10, 2024
Issue with Structured Outputs Returning Invalid JSON Object API api , structured-output	10	1192	August 29, 2024
Is client.beta.chat.completions.parse Supported in Batch API? API	4	457	February 11, 2025
Structured output with pydantic and the Batch API API gpt-4 , api , structured-output	2	129	January 30, 2025
Pydantic Model Responses API API	7	666	March 21, 2025

Using Pydantic structured outputs in batch mode

Related topics