How to define pydantic/JSON schema

Hello,
I am working on a text translation project. I extract text from pdf, pass it to gpt-4o-mini and instruct it to translate the text to a target language. The text is in the following format:
{‘0’:‘text1’, ‘1’:‘text2’,‘2’:‘text3’, …}
I require the output in the following format :
{‘0’:‘translated_text1’, ‘1’:‘translated_text2’,‘2’:‘translated_text3’, …}

To ensure reliable output, i am using the structured output approach using Pydantic. The function and the llm call is defined as follows:

class OptFormat(BaseModel):
index: str
text: str

client.chat.completions.create(
model = “gpt-4o-mini”,
messages = messages,
temperature=0,
response_format={
“type”:“json_schema”,
“json_schema”:{
“name”: “output_schema”,
“schema”: OptFormat.model_json_schema()
}
}
)

This works in most of the cases and gives the output in the desired format as follows:
{‘0’:‘translated_text1’, ‘1’:‘translated_text2’,‘2’:‘translated_text3’, …}

In few cases it deviates from the format and returns the output as follows:
{'text : ‘translated_text1’, ‘text’: ‘translated_text2’, ‘text’: ‘translated_text3’,…}
or
{'text : {‘0’:‘translated_text1’, ‘1’:‘translated_text2’,‘2’:‘translated_text3’, …},‘index’:“”}
I am not sure if the pydantic definition is incorrect. Or how to define the required output format as a JSON schema in response_format argument in the api call.
Can someone suggest how to go about this issue.

1 Like

I think your output structure is fine. If the model is giving you different responses I would focus on the prompt and make it clearer what you want, and try giving it some examples of the expected output. Hope this helps!

1 Like

Hello.

After analysis of the methods that you demonstrate in your code snippet for making API calls, I can identify the fault.

Your Concern

The issue arises from sending a schema direct from a Pydantic object, using a method to produce JSON. However, this does not have the API’s strict parameter required for activating structured outputs, nor are you taking advantage of OpenAI’s library which has new methods enhanced for accepting and validating based on a pydantic BaseModel itself.

As simply as visiting the structured output documentation, we can see the new beta method where you pass the entire pydantic class object, and then also use the library’s parse method to obtain validated results.

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "AI learning seminar, October 20."},
    ],
    response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed

By integrating the demonstrated methods into your API call, you will have an AI output forced to use the specified schema container, and also will have an output validator, where an exception being raised can allow iterative correction of the model’s response production.

Summary

I hope this better illustrates how to employ the native methods offered by OpenAI for client-side processing of structured output specifications for API.

Thanks for responding @arata . I have few questions:

  1. Does the new beta method enforce strict parameter automatically? Or do i need to specify it explicitly (both in beta as well as create version)?
  2. Although the model return output in the format required, I am not sure how does the pydantic class definition and the output format relate. I need the output in this format:
    {‘0’:‘translated_text1’, ‘1’:‘translated_text2’,‘2’:‘translated_text3’, …}
    which does not contain any index key, rather it is index: text pair.
    I want to understand if this is the correct way to define the class or this could cause any issues later?
    Thanks
1 Like

I’ll attempt to answer your questions about the .parse() version of chat completions, found in ‘beta’.

The new beta method, when used with a BaseModel, enforces and passes strict:true without regard to your desires otherwise when you use a pydantic BaseModel as the response_format.

For example, let’s say that you didn’t want to use strict:true, but instead wanted some fields to simply not appear in the {required: []} object (whereas, for strict: true to be used, all properties must be placed as required). You want the AI to choose if they are necessary, and omit if they are not. You’d use Pydantic with optional fields, right? Wrong.

You might assume detecting “false” was the behavior from documentation:

A new option for the response_format parameter: developers can now supply a JSON Schema via json_schema , a new option for the response_format parameter. This is useful when the model is not calling a tool, but rather, responding to the user in a structured way. This feature works with our newest GPT-4o models: gpt-4o-2024-08-06 , released today, and gpt-4o-mini-2024-07-18 .

When a response_format is supplied with strict: true , model outputs will match the supplied schema.

So we make ourselves a response format in Python:

class Vegetables(BaseModel):
    vegetable_name: str = Field(..., description="extract all vegetables from text")
    is_green_color: Optional[bool]  = Field(..., description="only use if green vegetable color is true")

class AIResponse(BaseModel):
    response_to_user: str = Field(..., description="What the user reads")
    mentioned_vegetables: Optional[list[Vegetables]] = Field(..., description="Produces UI info links")
    produce_error: Optional[bool] = Field(..., description="DO NOT USE: any use will crash the program!!! ")

Deliberate use of Optional is shown.

  • A normal-thinking AI might think an optional parameter that promises to crash a program would be bad to emit in name.
  • A normal-thinking SDK would note the deliberate choice of “Optional” from typing.

For what would be easy non-strict JSON, because parse() implies a strict schema only, you may get is a mess of billed tokens to make this work within strict:true:

{
   "type": "json_schema",
   "json_schema": {
      "schema": {
         "$defs": {
            "Vegetables": {
               "properties": {
                  "vegetable_name": {
                     "description": "extract all vegetables from text",
                     "title": "Vegetable Name",
                     "type": "string"
                  },
                  "is_green_color": {
                     "anyOf": [
                        {
                           "type": "boolean"
                        },
                        {
                           "type": "null"
                        }
                     ],
                     "description": "only use if green vegetable color is true",
                     "title": "Is Green Color"
                  }
               },
               "required": [
                  "vegetable_name",
                  "is_green_color"
               ],
               "title": "Vegetables",
               "type": "object",
               "additionalProperties": false
            }
         },
         "properties": {
            "response_to_user": {
               "description": "What the user reads",
               "title": "Response To User",
               "type": "string"
            },
            "mentioned_vegetables": {
               "anyOf": [
                  {
                     "items": {
                        "$ref": "#/$defs/Vegetables"
                     },
                     "type": "array"
                  },
                  {
                     "type": "null"
                  }
               ],
               "description": "Produces UI info links",
               "title": "Mentioned Vegetables"
            },
            "produce_error": {
               "anyOf": [
                  {
                     "type": "boolean"
                  },
                  {
                     "type": "null"
                  }
               ],
               "description": "DO NOT USE: any use will crash the program!!! ",
               "title": "Produce Error"
            }
         },
         "required": [
            "response_to_user",
            "mentioned_vegetables",
            "produce_error"
         ],
         "title": "AIResponse",
         "type": "object",
         "additionalProperties": false
      },
      "name": "AIResponse",
      "strict": true
   }
}

You get redundant “title”, besides multiple sub-schemas and definitions and references to provide the AI more confusion when this language is passed to the model for you.

In running, the SDK has emitted the program crasher key as message.content, because it can’t NOT emit all required keys:

{
  "response_to_user":"My specialty is providing gardening advice, specifically focused on vegetable gardening.",
  "mentioned_vegetables":null,
  "produce_error":null
}

Therefore, use Pydantic and strict when you really want strict.



For your particular example of wanting an output response like:

{
  "0": "translated_text1",
  "1": "translated_text2",  
  "2": "translated_text3",
  ...
}

(note the double-quotes of JSON)

This is pretty much impossible in structured outputs, unless you have an exact number of items to fulfill.

That is because every key must be ‘required’. That is true even at a higher level, were this an object placed within `{“translations”: {“0”:…}}.

You’ll likely just want a list, which you can enumerate after, or a list of objects with multiple fields to be filled by AI if the non-consecutive number is important.

class VegList(BaseModel):
    list_of_vegetables: List[str] = Field(
        ...,
        description="A list of extracted vegetable names from the user's input."
    )

I hope this information has been helpful.

Summary

  • SDK with Pydantic is always strict;
  • All fields must be required when translated to JSON object keys for API;
  • SDK+Pydantic can be innovative in producing complex schemas for otherwise unsupported patterns.
3 Likes

Thanks a lot @arata. This is very detailed explanation and helps to understand the sdk pydantic much better.

1 Like