Structured Outputs not reliable with GPT-4o-mini and GPT-4o

The problem here is the description field being used incorrectly.

“description”: "JSON array, conforming to the type\n\nArray<{\n carrier: string; // the name of the carrier\n available_on: string | null; // a date formatted as "MM/DD/YYYY"\n\n // A city name, such as "New York, NY, USA". If there are multiple possible origins, separate them by "/", such as "New York, NY, USA / San Francisco, CA, USA / Chicago, IL, USA".\n origin: string; \n\n // if the carrier describes the origin as within a certain distance from a city, put the radius here\n // for example if they say "within 50 miles of Chicago, IL" put "50mi" \n origin_radius: string | null;\n\n // A city name, such as "New York, NY, USA". If there are multiple possible destinations, separate them by "/", such as "New York, NY, USA / San Francisco, CA, USA / Chicago, IL, USA".\n destination: string | null;\n\n // if the carrier describes the destination as within a certain distance from a city, put the radius here\n // for example if they say "within 50 miles of Chicago, IL" put "50mi" \n destination_radius: string | null;\n\n truck_type: string; // The type of truck. If none is provided, assume\n}

This is NOT what it’s for and makes sense why the model would be so confused. You should instead used nested items and then set a reference to them.

          "my_nested_items": {
            "type": "array",
            "items": {
              "$ref": "#/definitions/NestedItem"
            }
          },
"definitions": {
          "NestedItem": {
            "type": "object",
...

I am wondering if @jim . Are you also using the description fields in a… Unorthodox way?

I would stay away from over-depending on descriptions.
Instead, I would refine refine and REFINE the structure to fit the model’s intuition. THEN once I’ve found a comfortable little spot I would use the descriptions to “nudge” it in the correct direction if necessary.

It’s important to remember that this structure is made for the model. Not for you & the model NEEDS to understand it. Even if it deviates from your expected schema you can ALWAYS perform some computations to translate it.

2 Likes

That’s a really good point about writing it FOR the model…I’ll see what I can do - it’s just strange that the same exact schema worked for weeks since day one of structured output.

My description for a tool is pretty generic, just says what the tool is used for, something like:

"description": "Used to check the presence of items related to genre with the goal of finding a complete genre. This function marks genre_complete once it has a value for each item. The response MUST be in JSON."

I mean, maybe I need to be really concrete about “marks genre_complete as TRUE” once it has a value for every item, but I certainly didn’t have to do that the first half of August.

(the idea being that one error i keep experiencing is it dropping and leaving out the genre_complete key even though its present and required in the schema.

There are two strategies you can try:

  1. Explicitly use multiple inference passes.
    a. a regular chat message to have it categorize just the genre
    b. including the previous context in the window, then do the extraction

  2. validate your LLM outputs using pydantic and then use the errors as feedback to the model so it can fix the errors and return the correct data structure. Here is an example using tooldantic:

import openai, pydantic
from tooldantic import ModelBuilder, OpenAiBaseModel, validation_error_to_llm_feedback

client = openai.OpenAI()

model_builder = ModelBuilder(base_model=OpenAiBaseModel)

MovieGenreModel = model_builder.model_from_json_schema(
    {
        "name": "edit_movie_genre",
        "description": "Used to edit a Genre Component within any kind of Movie. Please respond in JSON.",
        "parameters": {
            "type": "object",
            "properties": {
                "genre_component": {
                    "type": "object",
                    "description": "The component of the Genre or Movie being reviewed, which contains key characteristics.",
                    "properties": {
                        "characteristic": {
                            "type": "string",
                            "description": "The characteristic of narrative structure within the project.",
                            "enum": [
                                "Protagonist's Journey",
                                "Antagonist's Motivation",
                                "Climactic Conflict",
                            ],
                        }
                    },
                    "required": ["characteristic"],
                },
                "template": {
                    "type": "string",
                    "description": "The type of project.",
                    "enum": ["Book", "Novel"],
                },
            },
            "required": ["genre_component", "template"],
        },
    }
)


def structured_chat(
    messages: list,
    model: type[pydantic.BaseModel] | None = None,
    max_recursion_depth: int = 3,
    _recursion_depth: int = 0,
    **kwargs
):

    if model:
        if _recursion_depth >= max_recursion_depth:
            raise ValueError("Recursion depth exceeded.")
        kwargs["tools"] = [model.model_json_schema()]
        kwargs["tool_choice"] = "required"
        kwargs["parallel_tool_calls"] = False

    r = client.chat.completions.create(model="gpt-4o-mini", messages=messages, **kwargs)
    message = r.choices[0].message
    tc = message.tool_calls[0]
    try:
        return model.model_validate_json(tc.function.arguments).model_dump()
    except pydantic.ValidationError as e:
        feedback = validation_error_to_llm_feedback(e)
        new_messages = [
            *messages,
            message,
            {"role": "tool", "tool_call_id": tc.id, "content": feedback},
        ]
        return structured_chat(
            new_messages, model, _recursion_depth=_recursion_depth + 1, **kwargs
        )

@RonaldGRuckus agree 100% that my prompt is confusing to the LLM, and is unlikely likely to get useful results without different prompting.

This is a minimal repro of a Structured Outputs bug, not a request for help with better prompting.

Structured Outputs makes guarantees about output even if the prompt is confusing to the LLM. The confusing prompt coaxes the LLM to produce output which the Structured Outputs layer is supposed to suppress, but isn’t.

{"routes": "argah blargh", offered_on_date: "narglebarf"}

would be valid according to the docs. There’s no expectation that "routes" contains valid JSON just because it was requested in the "description" field.

That’s not the issue. The issue is seeing

{"routes": "[]", available_on: null, origin: "Springfield", ...}

which shouldn’t be possible with Structured Outputs, because it contains additional keys.

According to the docs at https://platform.openai.com/docs/guides/structured-outputs/additionalproperties-false-must-always-be-set-in-objects:

Structured Outputs only supports generating specified keys / values, so we require developers to set additionalProperties: false to opt into Structured Outputs.

tl;dr:

The expectation is Structured Outputs prevents additional top-level members ( "origin", "offered_on_date", etc), even if the prompt is confusing to the LLM.

2 Likes

You discovered a way to use prompt injection to modify the schema. This isn’t the “gotcha” that you think it is.

  1. Explore these new emergent capabilities and discover the power of the new LLM prompt engineering patterns that have recently emerged.
  2. ALWAYS validate your LLM outputs. Use pydantic/zod to validate and create feedback loops.

EDIT:

After looking at your playground example, I realized that the issue you’re facing is two-fold:

  1. You are not using structured outputs in the playground and instead you are using the old function calling.
  2. Your schema is not compliant with the new structured outputs.

After fixing the schema and attaching it to the structured outputs json_schema input section of the playground, I was unable to reproduce the reported bugs. Cheers!

image

https://platform.openai.com/playground/p/iKkZiaAVduxm11HGl1KaiFZv?model=undefined&mode=chat

1 Like

@nicholishen

  1. You are not using structured outputs in the playground and instead you are using the old function calling.

Structured Outputs is also available in function calling by setting "strict: true" according to https://platform.openai.com/docs/guides/function-calling/introduction

  1. Your schema is not compliant with the new structured outputs.

Howso? I’m eager to use Structured Outputs with function calling— would appreciate knowing what to change!

You discovered a way to use prompt injection to modify the schema.

Yep, prompt injection is a fun way to mess with GPT apps and has been for years.

That’s why I’m so excited about Structured Output’s promise to provide some, limited guarantees about outputs!

Because it promises to be the first time we can get some output guarantees, it’s extra important to know that they can be relied on. This example of them not working, if real, would mean they can’t be relied on. This would defeat the purpose of the feature, so we should care about it!

It is, and you have pointed out a valid issue with playground’s features. I assume that “functions” is still wired up to their deprecated “function_call” endpoint.

If you use a notebook and call the API directly and you won’t be able to reproduce the schema injection bugs. If you want to test the structured outputs in the playground you must use the response_format instead (at least until they fix playground).

You can copy and paste your schema into the json_schema input window and the linter will throw the warnings for you.

Here is the test I ran in the notebook with different seeds for each call; no validation issues and no extra args:

import openai, random, pydantic
from tooldantic import ModelBuilder, ToolBaseModel, OpenAiStrictSchemaGenerator

class NoExtraArgsStrictModel(ToolBaseModel):
    _schema_generator = OpenAiStrictSchemaGenerator
    model_config = {'extra': 'forbid'}

model_builder = ModelBuilder(base_model=NoExtraArgsStrictModel)

Model = model_builder.model_from_json_schema(
    {
        "name": "record_availability",
        "description": "Record excess capacity a carrier has available",
        "parameters": {
            "type": "object",
            "required": ["routes", "offered_on_date"],
            "properties": {
                "routes": {
                    "type": "string",
                    "description": 'JSON array, conforming to the type\n\nArray<{\n  carrier: string; // the name of the carrier\n  available_on: string | null; // a date formatted as "MM/DD/YYYY"\n\n  // A city name, such as "New York, NY, USA". If there are multiple possible origins, separate them by "/", such as "New York, NY, USA / San Francisco, CA, USA / Chicago, IL, USA".\n  origin: string; \n\n  // if the carrier describes the origin as within a certain distance from a city, put the radius here\n  // for example if they say "within 50 miles of Chicago, IL" put "50mi" \n  origin_radius: string | null;\n\n  // A city name, such as "New York, NY, USA".  If there are multiple possible destinations, separate them by "/", such as "New York, NY, USA / San Francisco, CA, USA / Chicago, IL, USA".\n  destination: string | null;\n\n  // if the carrier describes the destination as within a certain distance from a city, put the radius here\n  // for example if they say "within 50 miles of Chicago, IL" put "50mi" \n  destination_radius: string | null;\n\n  truck_type: string; // The type of truck. If none is provided, assume\n}>',
                },
                "offered_on_date": {
                    "type": "string",
                    "description": 'the date the carrier sent the email with the availabilities. Format as "MM/DD/YYYY"',
                },
            },
        },
    }
)

sys = """
We are a freight forwarding company. From time to time we receive emails from carriers, describing excess capacity they have available. Following is an email thread we have received. If it has information about route availability, call the record_availability function with appropriate arguments.

Today is Tuesday, August 20, 2024 and the time is currently 5:38:58 PM EDT.

Use the tools available to you to resolve the email thread below:
"""

user = """ 
Subject: Availability Notification
Date: Thu, Aug 15, 2024 at 5:08 PM
Hello,
I am sharing the availability for today and tomorrow:
- 1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.
- 1 unit available now in Mexico City, Querétaro, Guanajuato, or nearby areas, heading to Torreón, Laredo, or directly within the USA.

If you have any loads in these areas, we would be happy to review them.
Best regards,
John Smith
"""


def call_llm(seed):
    r = openai.OpenAI().chat.completions.create(
        model='gpt-4o-mini',
        messages=[
            {"role": "system", "content": sys},
            {"role": "user", "content": user}
        ],
        tools=[Model.model_json_schema()],
        tool_choice='required',
        parallel_tool_calls=False,
        seed=seed
    )
    message = r.choices[0].message
    tc = message.tool_calls[0]
    try:
        validated_data = Model.model_validate_json(tc.function.arguments)
        print(validated_data)
    except pydantic.ValidationError as e:
        print(e.errors())

for i in random.sample(range(100), 10):
    call_llm(i)

# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico. 1 unit available now in Mexico City, Querétaro, Guanajuato, or nearby areas, heading to Torreón, Laredo, or directly within the USA.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico. / 1 unit available now in Mexico City, Querétaro, Guanajuato, or nearby areas, heading to Torreón, Laredo, or directly within the USA.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico. / 1 unit available now in Mexico City, Querétaro, Guanajuato, or nearby areas, heading to Torreón, Laredo, or directly within the USA.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='Springfield/Oakland, Chicago to Arizona/Mexico' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='1 unit available now in Springfield/Oakland, Chicago, or nearby areas, heading to Arizona or directly within Mexico.' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='Springfield/Oakland, Chicago, or nearby areas to Arizona / Mexico' offered_on_date='08/15/2024'
# INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
# routes='Springfield/Oakland, Chicago, or nearby areas to Arizona or directly within Mexico' offered_on_date='08/15/2024'

If it’s any consolation, I find the 4o has gotten weirdly dumber lately.

1 Like

TOTALLY agree! There was this brief window of time, the first 2.5-3 weeks of August where I felt like all my dreams had come true with Structured Outputs, etc. and just in the last week or so its really fallen short. Even just basic “conversation starters” like “Absolutely!”, etc. - don’t even fit within the context of system instructions, etc.

Alright…this is super weird now, just checked the logs for my app, and now there are tool requests being made with parameters from an OLDER version of the tool.

In other words, I had a tool: get_character and it had a certain set of properties. In an effort to combat hallucinated key/values from SO, I’ve been adjusting and dumbing down the tools to somehow make it work better, and in doing so have removed options that aren’t in the instructions, etc.

And yet, somehow the model (4o-08-06) is now hallucinating properties from an older version of the tool. Probably some cached version which for some reason isn’t getting updated…wondering if I just come up with a completely different tool name, and save it again, if that will somehow fix it.

(I note there is a bug with the Playground where sometimes you have to rename the tool first before it will save, then you can rename it back to what you want)