Structured Outputs not reliable with GPT-4o-mini and GPT-4o

The problem here is the description field being used incorrectly.

“description”: "JSON array, conforming to the type\n\nArray<{\n carrier: string; // the name of the carrier\n available_on: string | null; // a date formatted as "MM/DD/YYYY"\n\n // A city name, such as "New York, NY, USA". If there are multiple possible origins, separate them by "/", such as "New York, NY, USA / San Francisco, CA, USA / Chicago, IL, USA".\n origin: string; \n\n // if the carrier describes the origin as within a certain distance from a city, put the radius here\n // for example if they say "within 50 miles of Chicago, IL" put "50mi" \n origin_radius: string | null;\n\n // A city name, such as "New York, NY, USA". If there are multiple possible destinations, separate them by "/", such as "New York, NY, USA / San Francisco, CA, USA / Chicago, IL, USA".\n destination: string | null;\n\n // if the carrier describes the destination as within a certain distance from a city, put the radius here\n // for example if they say "within 50 miles of Chicago, IL" put "50mi" \n destination_radius: string | null;\n\n truck_type: string; // The type of truck. If none is provided, assume\n}

This is NOT what it’s for and makes sense why the model would be so confused. You should instead used nested items and then set a reference to them.

          "my_nested_items": {
            "type": "array",
            "items": {
              "$ref": "#/definitions/NestedItem"
            }
          },
"definitions": {
          "NestedItem": {
            "type": "object",
...

I am wondering if @jim . Are you also using the description fields in a… Unorthodox way?

I would stay away from over-depending on descriptions.
Instead, I would refine refine and REFINE the structure to fit the model’s intuition. THEN once I’ve found a comfortable little spot I would use the descriptions to “nudge” it in the correct direction if necessary.

It’s important to remember that this structure is made for the model. Not for you & the model NEEDS to understand it. Even if it deviates from your expected schema you can ALWAYS perform some computations to translate it.

3 Likes

That’s a really good point about writing it FOR the model…I’ll see what I can do - it’s just strange that the same exact schema worked for weeks since day one of structured output.

My description for a tool is pretty generic, just says what the tool is used for, something like:

"description": "Used to check the presence of items related to genre with the goal of finding a complete genre. This function marks genre_complete once it has a value for each item. The response MUST be in JSON."

I mean, maybe I need to be really concrete about “marks genre_complete as TRUE” once it has a value for every item, but I certainly didn’t have to do that the first half of August.

(the idea being that one error i keep experiencing is it dropping and leaving out the genre_complete key even though its present and required in the schema.

There are two strategies you can try:

  1. Explicitly use multiple inference passes.
    a. a regular chat message to have it categorize just the genre
    b. including the previous context in the window, then do the extraction

  2. validate your LLM outputs using pydantic and then use the errors as feedback to the model so it can fix the errors and return the correct data structure. Here is an example using tooldantic:

import openai, pydantic
from tooldantic import ModelBuilder, OpenAiBaseModel, validation_error_to_llm_feedback

client = openai.OpenAI()

model_builder = ModelBuilder(base_model=OpenAiBaseModel)

MovieGenreModel = model_builder.model_from_json_schema(
    {
        "name": "edit_movie_genre",
        "description": "Used to edit a Genre Component within any kind of Movie. Please respond in JSON.",
        "parameters": {
            "type": "object",
            "properties": {
                "genre_component": {
                    "type": "object",
                    "description": "The component of the Genre or Movie being reviewed, which contains key characteristics.",
                    "properties": {
                        "characteristic": {
                            "type": "string",
                            "description": "The characteristic of narrative structure within the project.",
                            "enum": [
                                "Protagonist's Journey",
                                "Antagonist's Motivation",
                                "Climactic Conflict",
                            ],
                        }
                    },
                    "required": ["characteristic"],
                },
                "template": {
                    "type": "string",
                    "description": "The type of project.",
                    "enum": ["Book", "Novel"],
                },
            },
            "required": ["genre_component", "template"],
        },
    }
)


def structured_chat(
    messages: list,
    model: type[pydantic.BaseModel] | None = None,
    max_recursion_depth: int = 3,
    _recursion_depth: int = 0,
    **kwargs
):

    if model:
        if _recursion_depth >= max_recursion_depth:
            raise ValueError("Recursion depth exceeded.")
        kwargs["tools"] = [model.model_json_schema()]
        kwargs["tool_choice"] = "required"
        kwargs["parallel_tool_calls"] = False

    r = client.chat.completions.create(model="gpt-4o-mini", messages=messages, **kwargs)
    message = r.choices[0].message
    tc = message.tool_calls[0]
    try:
        return model.model_validate_json(tc.function.arguments).model_dump()
    except pydantic.ValidationError as e:
        feedback = validation_error_to_llm_feedback(e)
        new_messages = [
            *messages,
            message,
            {"role": "tool", "tool_call_id": tc.id, "content": feedback},
        ]
        return structured_chat(
            new_messages, model, _recursion_depth=_recursion_depth + 1, **kwargs
        )

@anon10827405 agree 100% that my prompt is confusing to the LLM, and is unlikely likely to get useful results without different prompting.

This is a minimal repro of a Structured Outputs bug, not a request for help with better prompting.

Structured Outputs makes guarantees about output even if the prompt is confusing to the LLM. The confusing prompt coaxes the LLM to produce output which the Structured Outputs layer is supposed to suppress, but isn’t.

{"routes": "argah blargh", offered_on_date: "narglebarf"}

would be valid according to the docs. There’s no expectation that "routes" contains valid JSON just because it was requested in the "description" field.

That’s not the issue. The issue is seeing

{"routes": "[]", available_on: null, origin: "Springfield", ...}

which shouldn’t be possible with Structured Outputs, because it contains additional keys.

According to the docs at https://platform.openai.com/docs/guides/structured-outputs/additionalproperties-false-must-always-be-set-in-objects:

Structured Outputs only supports generating specified keys / values, so we require developers to set additionalProperties: false to opt into Structured Outputs.

tl;dr:

The expectation is Structured Outputs prevents additional top-level members ( "origin", "offered_on_date", etc), even if the prompt is confusing to the LLM.

2 Likes

You discovered a way to use prompt injection to modify the schema. This isn’t the “gotcha” that you think it is.

  1. Explore these new emergent capabilities and discover the power of the new LLM prompt engineering patterns that have recently emerged.
  2. ALWAYS validate your LLM outputs. Use pydantic/zod to validate and create feedback loops.

EDIT:

After looking at your playground example, I realized that the issue you’re facing is two-fold:

  1. You are not using structured outputs in the playground and instead you are using the old function calling.
  2. Your schema is not compliant with the new structured outputs.

After fixing the schema and attaching it to the structured outputs json_schema input section of the playground, I was unable to reproduce the reported bugs. Cheers!

image

https://platform.openai.com/playground/p/iKkZiaAVduxm11HGl1KaiFZv?model=undefined&mode=chat

1 Like

@nicholishen

  1. You are not using structured outputs in the playground and instead you are using the old function calling.

Structured Outputs is also available in function calling by setting "strict: true" according to https://platform.openai.com/docs/guides/function-calling/introduction

  1. Your schema is not compliant with the new structured outputs.

Howso? I’m eager to use Structured Outputs with function calling— would appreciate knowing what to change!

You discovered a way to use prompt injection to modify the schema.

Yep, prompt injection is a fun way to mess with GPT apps and has been for years.

That’s why I’m so excited about Structured Output’s promise to provide some, limited guarantees about outputs!

Because it promises to be the first time we can get some output guarantees, it’s extra important to know that they can be relied on. This example of them not working, if real, would mean they can’t be relied on. This would defeat the purpose of the feature, so we should care about it!

1 Like

It is, and you have pointed out a valid issue with playground’s features. I assume that “functions” is still wired up to their deprecated “function_call” endpoint.

If you use a notebook and call the API directly and you won’t be able to reproduce the schema injection bugs. If you want to test the structured outputs in the playground you must use the response_format instead (at least until they fix playground).

You can copy and paste your schema into the json_schema input window and the linter will throw the warnings for you.

Here is the test I ran in the notebook with different seeds for each call; no validation issues and no extra args:

import openai, random, pydantic
from tooldantic import ModelBuilder, ToolBaseModel, OpenAiStrictSchemaGenerator

class NoExtraArgsStrictModel(ToolBaseModel):
    _schema_generator = OpenAiStrictSchemaGenerator
    model_config = {'extra': 'forbid'}

model_builder = ModelBuilder(base_model=NoExtraArgsStrictModel)

Model = model_builder.model_from_json_schema(
    {
        "name": "record_availability",
        "description": "Record excess capacity a carrier has available",
        "parameters": {
            "type": "object",
            "required": ["routes", "offered_on_date"],
            "properties": {
                "routes": {
                    "type": "string",
                    "description": 'JSON array, conforming to the type\n\nArray<{\n  carrier: string; // the name of the carrier\n  available_on: string | null; // a date formatted as "MM/DD/YYYY"\n\n  // A city name, such as "New York, NY, USA". If there are multiple possible origins, separate them by "/", such as "New York, NY, USA / San Francisco, CA, USA / Chicago, IL, USA".\n  origin: string; \n\n  // if the carrier describes the origin as within a certain distance from a city, put the radius here\n  // for example if they say "within 50 miles of Chicago, IL" put "50mi" \n  origin_radius: string | null;\n\n  // A city name, such as "New York, NY, USA".  If there are multiple possible destinations, separate them by "/", such as "New York, NY, USA / San Francisco, CA, USA / Chicago, IL, USA".\n  destination: string | null;\n\n  // if the carrier describes the destination as within a certain distance from a city, put the radius here\n  // for example if they say "within 50 miles of Chicago, IL" put "50mi" \n  destination_radius: string | null;\n\n  truck_type: string; // The type of truck. If none is provided, assume\n}>',
                },
                "offered_on_date": {
                    "type": "string",
                    "description": 'the date the carrier sent the email with the availabilities. Format as "MM/DD/YYYY"',
                },
            },
        },
    }
)

sys = """
We are a freight forwarding company. From time to time we receive emails from carriers, describing excess capacity they have available. Following is an email thread we have received. If it has information about route availability, call the record_availability function with appropriate arguments.

Today is Tuesday, August 20, 2024 and the time is currently 5:38:58 PM EDT.

Use the tools available to you to resolve the email thread below:
"""

user = """ 
Subject: Availability Notification
Date: Thu, Aug 15, 2024 at 5:08 PM
Hello,
I am sharing the availability for today and tomorrow:
- 1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.
- 1 unit available now in Mexico City, Querétaro, Guanajuato, or nearby areas, heading to Torreón, Laredo, or directly within the USA.

If you have any loads in these areas, we would be happy to review them.
Best regards,
John Smith
"""


def call_llm(seed):
    r = openai.OpenAI().chat.completions.create(
        model='gpt-4o-mini',
        messages=[
            {"role": "system", "content": sys},
            {"role": "user", "content": user}
        ],
        tools=[Model.model_json_schema()],
        tool_choice='required',
        parallel_tool_calls=False,
        seed=seed
    )
    message = r.choices[0].message
    tc = message.tool_calls[0]
    try:
        validated_data = Model.model_validate_json(tc.function.arguments)
        print(validated_data)
    except pydantic.ValidationError as e:
        print(e.errors())

for i in random.sample(range(100), 10):
    call_llm(i)

# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico. 1 unit available now in Mexico City, Querétaro, Guanajuato, or nearby areas, heading to Torreón, Laredo, or directly within the USA.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico. / 1 unit available now in Mexico City, Querétaro, Guanajuato, or nearby areas, heading to Torreón, Laredo, or directly within the USA.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico. / 1 unit available now in Mexico City, Querétaro, Guanajuato, or nearby areas, heading to Torreón, Laredo, or directly within the USA.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='Springfield/Oakland, Chicago to Arizona/Mexico' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='Springfield/Oakland, Chicago, or nearby areas to Arizona / Mexico' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='Springfield/Oakland, Chicago, or nearby areas to Arizona or directly within Mexico' offered_on_date='08/15/2024'

If it’s any consolation, I find the 4o has gotten weirdly dumber lately.

2 Likes

TOTALLY agree! There was this brief window of time, the first 2.5-3 weeks of August where I felt like all my dreams had come true with Structured Outputs, etc. and just in the last week or so its really fallen short. Even just basic “conversation starters” like “Absolutely!”, etc. - don’t even fit within the context of system instructions, etc.

Alright…this is super weird now, just checked the logs for my app, and now there are tool requests being made with parameters from an OLDER version of the tool.

In other words, I had a tool: get_character and it had a certain set of properties. In an effort to combat hallucinated key/values from SO, I’ve been adjusting and dumbing down the tools to somehow make it work better, and in doing so have removed options that aren’t in the instructions, etc.

And yet, somehow the model (4o-08-06) is now hallucinating properties from an older version of the tool. Probably some cached version which for some reason isn’t getting updated…wondering if I just come up with a completely different tool name, and save it again, if that will somehow fix it.

(I note there is a bug with the Playground where sometimes you have to rename the tool first before it will save, then you can rename it back to what you want)

1 Like

Two months later, still a huge problem for GPT-4o-mini and Structured Outputs. Failure rate now is 60% with mini - GPT-4o same codebase, same requests is 100% success. Not a big fan of waking up to this every morning:

when seeing this from GPT-4o:

4o-mini makes up Keys, hallucinates ENUM’d values, calls msearch all the time - customers are starting to bail because of it, sooooo…anyone have any ideas?

I have the same problem and my business model does not allow me to use gpt-4o yet, we do not have any answers about this or what to do.

Oddly, the time when I started getting the finish_reason===“stop” (indicating strict tool use) and well-formed tool calls seemed to be after I added some type annotations to the data that I was putting into the completion function in Typescript. Before the type annotations, OpenAI’s API accepted the calls but had a finish_reason of tool use and didn’t follow the schema.

If anyone is very interested, I might be able to narrow this down a bit and provide some code examples. I could also be wrong about the reason, because I might have changed something else and forgotten.

Definitely performance is fluctuating a lot. Docs are very incomplete, how data schemas are ingested is just a black box. For example use of Optional (I’m sending Pydantic schemas in though).

I see structured outputs as one of key features of LLM APIs to generate outputs. Unfortunately all we got here is the lack of support.

If anyone writing big pydantic schemas, let’s collaborate. I found out that big prompts are dragging performance down.

Pydantic, though, is observable.

from pydantic import BaseModel
from openai import pydantic_function_tool

class Hello(BaseModel):
    say_hello: str
 
print(Hello.model_json_schema())
print(pydantic_function_tool(Hello))

{'properties': {'say_hello': {'title': 'Say Hello', 'type': 'string'}}, 'required': ['say_hello'], 'title': 'hello', 'type': 'object'}

{'type': 'function', 'function': {'name': 'hello', 'strict': True, 'parameters': {'properties': {'say_hello': {'title': 'Say Hello', 'type': 'string'}}, 'required': ['say_hello'], 'title': 'hello', 'type': 'object', 'additionalProperties': False}}}


Library methods for constructing the sent JSON.

src/openai/lib/_pydantic.py


Optional is hairy, it works counter to the idea of “strict”: true

from pydantic import BaseModel, Field
from typing import Optional
import json

class Weather(BaseModel):
    city: str = Field(..., description="City name.")
    temp_unit: Optional[str] = Field(None, description="Temperature unit.")
    days: Optional[int] = Field(None, description="Days for forecast.")
    
    class Config:
        title = "Weather"
        extra = Extra.forbid  # Adds additionalProperties: false

# schema output
import json
print(json.dumps(Weather.model_json_schema(), indent=2))

{
  "additionalProperties": false,
  "properties": {
    "city": {
      "description": "City name.",
      "title": "City",
      "type": "string"
    },
    "temp_unit": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Temperature unit.",
      "title": "Temp Unit"
    },
    "days": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Days for forecast.",
      "title": "Days"
    }
  },
  "required": [
    "city"
  ],
  "title": "Weather",
  "type": "object"
}

Not seen is the helper tuneups, and placing that in a container:


{
  "name": "weather",
  "strict": true,
  "schema": {..

Sending that class as input for the AI to understand is indeed pedantic. The strict:false version, just omitting temp_unit and days from a JSON required field and using JSON, is more straightforward if you do want optional instead of a sub-schema for every optional case.

1 Like

Thanks, while it is indeed observable, what happens at ingestion is a black box still, I see people reporting that description better specified at prompt than in the field.

I’ll tinker around, cheers again

What happens is you get the schema appended to the system message.

{system}

# Responses

## {your schema name}

{schema is without linefeeds and appears here}

You just have to ask the cooperative box

What’s odder is the AI seems to write the schema name as output first, like it’s sending to a JSON recipient that could accept multiple types.

… which could be handled by a combination of regex and schema validation on the match…

This comment was flagged and suppressed by either a bot or an actual member of the community. The comment:
“If it’s any consolation, I find the 4o has gotten weirdly dumber lately.”
The comment was apropos of an issue with 4o. The comment is a data point that for someone like me would trigger the thought “oh yeah – I notice that the system sometimes becomes fatigued or attempts to downscale the use of resources and begins to make mistakes it normally would not make or cut corners it would not normally or should not take. I’m not alone.”
As a software developer my aim is to make software behave as it should according to a principle of ‘least surprise’ and a principle of ‘least effort’ and a principle of ‘never fail if possible and only fail gracefully with some indication of what went wrong, why the system can’t fix itself and what to do to about it’
The data point would be meaningful to someone like me with more than forty years of programming experience. Perhaps it is just burdensome somehow or extraneous somehow here. For me it would be meaningful, but I won’t be crushed if it remains suppressed. :slight_smile:

1 Like

Update, as I occasionally get “likes” and replies to some of my earlier comments here.

I eventually gave up, and just switched everything over to GPT-4o, and have had 100% since. The other thing I did is that I offloaded a lot of the logic that I was relying on the model to sort out with enums and the like, when really I could do a lot of that on the backend.

Once I started treating it like someone who knew how to make structured output API calls but wasn’t really up-to-date on my applications terminology and concepts, I managed to avoid the headache alltogether of bad output calls.

3 Likes