Stuctured output - json-file different from time to time

Hello - i try to use an assistant with a structured output using the following code-parts -

class ArticleSummary(BaseModel):
    kundenname: str
    artikelname: str
    pznNr: str
    anzProdukte: str
    bestellnummer: str

class Messages(BaseModel):
    messages: List[ArticleSummary]

and

  assistant = client.beta.assistants.create(
    name="Document Analyse Assistant",
    instructions="You are a machine learning researcher, answer questions about the provided pdf-file",
    model = "gpt-4o-mini",
    tools = [{"type": "file_search"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "test_schema",
            "schema": Messages.model_json_schema()
        }
    }
  )

General it works as intended - but sometimes the output is like this
(only one dictionary in the result)

{'messages': [{'kundenname': 'Stern Apotheke', 'artikelname': 'Simplee mit Gemüse', 'pznNr': 'nicht angegeben', 'anzProdukte': '8x12x500ml', 'bestellnummer': 'nicht angegeben'}]}

and sometimes liek that
(every field in a seperate dictionary)

{'messages': [{'kundenname': 'Simply Real Nutrition GmbH'}, {'artikelname': 'Simplee mit Gemuese'}, {'pznNr': '18795690'}, {'anzProdukte': '1'}, {'bestellnummer': '13190'}]}

How can i influence that in the result?

Here’s the deal: If you have a schema placed, but it is not “strict” and is not constructed compliant to the strict specification, the only thing enforcing the AI’s output is the AI model’s own understanding of the text it received as “schema”.

Additionally, Pydantic doesn’t make fields required, doesn’t block additionalProperties, and also makes liberal use of references and definitions. When using the SDK chat completions parse() method, all those are added for you in addition to the schema being forced “strict”, but here we need to make an easily-understood schema, that is enhanced in quality, also.

The first thing - we want a flat schema, probably even if reusing $def items. A helper function will do that with the supported Pydantic version’s JSON schema output:

from openai import OpenAI
from pydantic import BaseModel, Field
# from typing import List  # if not using built-in types

def dereference_schema(schema: dict) -> dict:
    defs = schema.pop("$defs", {})

    def _resolve_refs(obj):
        if isinstance(obj, dict):
            if "$ref" in obj:
                ref_path = obj.pop("$ref")
                ref_name = ref_path.split("/")[-1]
                ref_schema = defs.get(ref_name)
                if ref_schema is None:
                    raise ValueError(f"Reference {ref_name} not found in definitions.")
                # Recursively resolve nested refs
                resolved_schema = _resolve_refs(ref_schema.copy())
                obj.update(resolved_schema)
            else:
                for key, value in obj.items():
                    obj[key] = _resolve_refs(value)
        elif isinstance(obj, list):
            obj = [_resolve_refs(item) for item in obj]
        return obj

    return _resolve_refs(schema)

Then let’s really enhance that class schema for AI understanding:

  • A useful name (when placed after “# Responses” in internal AI context)
  • A useful title (the main class name)
  • A useful description field for the AI to read
  • Setting all fields in a required
  • Disallow more fields or those placed in “additionalProperties”
class ArticleSummary(BaseModel):
    kundenname: str = Field(..., description="Name of the customer")
    artikelname: str = Field(..., description="Name of the article")
    pznNr: str = Field(..., description="PZN number of the article")
    anzProdukte: str = Field(..., description="Number of products")
    bestellnummer: str = Field(..., description="Order number")

    class Config:
        extra = "forbid"  # ensures additionalProperties: false

class JSONListOfEachArticle(BaseModel):
    messages: list[ArticleSummary] = Field(...,
            description="Array list of article summary objects, one for every item")

    class Config:
        extra = "forbid"  # ensures additionalProperties: false

You’ll see I even talked about “objects” and what goes in them. You can enhance the descriptions even further.

Now build the metadata for the response format. Everything is also prepared for this to be strict now, and it will be accepted even when creating an assistant along with file_search, but I expect with a vector store attached you’ll get a big 500 error.

response_schema = dereference_schema(JSONListOfEachArticle.model_json_schema())
response_format={
'type': 'json_schema',
    'json_schema': 
      {
        "name":"JSON_article_output",
        #"strict": True,  # was possible only when not using internal tools
        "schema": response_schema
      }
}

Now you are ready to ask about your “articles” (items for sale?) with the schema now sent and here actually as returned in the assistant object:

  "object": "assistant",
  "tools": [
    {
      "type": "file_search",
      "file_search": {
        "max_num_results": null,
        "ranking_options": {
          "score_threshold": 0.0,
          "ranker": "default_2024_08_21"
        }
      }
    }
  ],
  "response_format": {
    "json_schema": {
      "name": "JSON_article_output",
      "description": null,
      "schema_": {
        "additionalProperties": false,
        "properties": {
          "messages": {
            "description": "List of article summary objects, one for every item",
            "items": {
              "additionalProperties": false,
              "properties": {
                "kundenname": {
                  "description": "Name of the customer",
                  "title": "Kundenname",
                  "type": "string"
                },
                "artikelname": {
                  "description": "Name of the article",
                  "title": "Artikelname",
                  "type": "string"
                },
                "pznNr": {
                  "description": "PZN number of the article",
                  "title": "Pznnr",
                  "type": "string"
                },
                "anzProdukte": {
                  "description": "Number of products",
                  "title": "Anzprodukte",
                  "type": "string"
                },
                "bestellnummer": {
                  "description": "Order number",
                  "title": "Bestellnummer",
                  "type": "string"
                }
              },
              "required": [
                "kundenname",
                "artikelname",
                "pznNr",
                "anzProdukte",
                "bestellnummer"
              ],
              "title": "ArticleSummary",
              "type": "object"
            },
            "title": "Messages",
            "type": "array"
          }
        },
        "required": [
          "messages"
        ],
        "title": "JSONListOfEachArticle",
        "type": "object"
      },
      "strict": false
    },
    "type": "json_schema"
  },