Structured Outputs not reliable with GPT-4o-mini and GPT-4o

Have had pretty phenomenal success with StructuredOutputs and GPT-4o, all tools are following specified schemas, particular ENUMs.

Starting to notice that GPT-4o-mini is less reliable, more errors, missing keys, inaccurate values (with a list of 3 ENUM strings, it asks for something completely made up).

Sent a support request, but Fin from Intercom just told me to provide them information I already gave them. :smile:

Is this a known issues with the mini model? Having to put all kinds of error handling back inā€¦

8 Likes

Mini is, well, mini.

Faster, cheaper, but also fewer neurons.

I wouldnā€™t go for a weaker model if reliability is important. Depending on your exact use case, it might be possible to tweak the prompt to get a similar result, but I think what you discovered is generally true:

you can trade inference cost for engineering cost. decreasing one will probably increase the other.

7 Likes

Can you share the code to reproduce this @jim ?

2 Likes

Yeah, definitely, Iā€™ll find some time later today - itā€™s definitely the mini model. Itā€™s leaving out keys now that are listed as required - only my runs on gpt-4o-mini fail like this - all the gpt-4os are 100% success (which is awesome and so helpful).

I just wanted others to be aware as its advertised as 100% success rate (and hopefully theyā€™ll figure out a way to get it to be so!)

2 Likes

Did you provide a json schema + example json in the prompt?

Not in the prompt but definitely in the tools. Like I said as long as I have gpt-4o sat as the model structured output is 100% reliable. Itā€™s only on gpt-4o-mini that keys will be dropped.

Actually, scratch that, there were some instances tonight where gpt-4oā€”08-06 was not following the schema where top level items were being somehow included as second level props. For instance if my schema was:

  • fruit
  • vegetables
  • spices

It was coming back as:

  • fruit
    • vegetables
    • spices

Iā€™m hoping it was only a temporary thing tonight because that would be a huge bummer if somehow all of this is degrading.

GPT3.5 works very well on that when giving it the right prompt.
I got at most 0.25% errors even with bigger json formats.

maybe you could also try something like giving it more context like:

extract the hierarchie of food for my knowledge graph. Find the first level, use that as the categorie and everything below.

e.g.

fruits

  • apple
  • cherry

give me the result as json.

1 Like

I could do thatā€¦but Structured Outputs are sold as 100% reliable responses without additional prompting. I believe thatā€™s the idea behind the initial load that saves all kinds of info in the background.

If I go back to prompt engineering, then Iā€™m back where I was a year ago with all sorts of sanitizing and error catching.

What I LOVED about SO when it first came out was that every single response was 100% accurate. As Iā€™m rolling it out to more and more users Iā€™m starting to see less than 100% when using gpt-4o-mini.

Gpt4o-mini is not even out for a year. Have you ever seen a software that works 100% reliable at all?
Also this is not the place to rant about it I guess. It is a place to find a solution.

Oh I think you might be misunderstanding. Not ranting at all and I really appreciate the help. But 3.5 doesnā€™t even work with Structured Outputs so Iā€™m not sure if you understand what Iā€™m referring to specifically.

The new strict mode for JSON is guaranteed 100% reliability. Itā€™s there in the docs and all throughout the cookbook example. For the first two weeks I did experience this, but now Iā€™m getting failures and my code has not changed.

Iā€™m only explaining it in detail here because sometimes itā€™s nice to have a place like this where you can arrive and instantly find out, ā€œoh someone else is having the same issue.ā€ And then once a solution is provided, I can post it here for future users.

I did send in bug reports along with IDs on the runs and threads, but since theyā€™re relying on Fin and Intercom to respond Iā€™m not 100% sure anyoneā€™s addressing it.

2 Likes

Isnā€™t that impossible? The structured output is forced to match the structure 100%.

2 Likes

I believe so. Although I have only been using it myself recently I do have a faint memory of the model hallucinating a property.

I would recommend to anyone to first try using structured outputs in the Playground.

Itā€™d be nice if OP shared their code and structure.

1 Like

I will this weekend. Some of itā€™s proprietary, so itā€™s not just a clear copy and paste.

You can feed it through GPT and instantly strip anything propietary

How bad are the outputs and how often? I find this difficult to believe, since I have had excellent structured JSON results (yes, text outputs actually generated as functional JSON, as instructed and in the format instructed) in definitely more than 99 out of 100 generations using Llama 3.1 8B on a consumer GPU.

So I would expect GPT-4o to be able to do better than Llama 3.1 8b!

I can get better analytics, but its essentially somewhere in the 97-98% range. This is only in the last week, as my first two weeks of experience with Structured Outputs with JSON were 100% for both gpt-4o and gpt-4o-mini.

For instance, over the past 90 requests for tool_calls there was ONE request that hallucinated a property not set in the schema. This was with gpt-4o-08-06 NOT gpt-4o-mini.

An example of the schema (edited) is something like this:

{
  "name": "edit_movie_genre",
  "description": "Used to edit a Genre Component within any kind of Movie. Please respond in JSON.",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "genre_component": {
        "type": "object",
        "description": "The component of the Genre or Movie being reviewed, which contains key characteristics.",
        "properties": {
          "characteristic": {
            "type": "string",
            "description": "The characteristic of narrative structure within the project.",
            "enum": [ "Main Genre Element 1", "Main Genre Element 2", ... ]
          }
        },
        "required": [
            "characteristic"
          ],
          "additionalProperties": false
      },        
      "template": {
        "type": "string",
        "description": "The type of project.",
        "enum": [
          "Book",
          "Novel"
        ]
      }
    },
    "additionalProperties": false,
    "required": [
      "genre_component",
      "template"
    ]
  }
}

The characteristic enum property is an array of 100 items (cut for space here).

In the past 24 hours the model hallucinated a value that was not listed in the array of enums. Over the past couple of days there were a handful of tool calls that shifted parent properties to be children properties of another parent.

Itā€™s not really about the schema as I mentioned, it works 89 times out of 90. The issue is that it is advertised as 100% reliability (from the blog post):

it still did not meet the reliability that developers need to build robust applications. So we also took a deterministic, engineering-based approach to constrain the modelā€™s outputs to achieve 100% reliability.

All in all, itā€™s working great and Iā€™m extremely pleased as all the error catching and sanitizing over the past year or so of function calls has been less than enjoyable. When they released Structured Outputs the first week of August, I quickly converted over and was super impressed - it was 100% with the same exact tooling above.

Itā€™s only in the last week that Iā€™ve started to see errors with the model not following the schema 100%. Yes, I can put back in all the error catching, etc. but there was something quite magical about the 100% results that I would love to recapture if at all possible.

2 Likes

Same issue, have a repro for you:

https://platform.openai.com/playground/chat?models=gpt-4o-mini&preset=preset-i3LiIoW9aaFGVgAp7TGBRXqG

The function definition for record_availability says it should take 2 parameters:

  • routes: string // a stringified JSON of {ā€œcarrierā€: string, ā€œavailable_onā€: string|null, ā€œoriginā€: string, ā€¦}
  • available_on: string // date formatted as MM/DD/YYY

(Iā€™m aware Iā€™m ā€œholding it wrongā€ by asking for "routes" to contain stringified JSON. I coded myself into a corner and discovered this by accident.)

What I get

record_availability({
  "routes": "Springfield/Oakland, Chicago, or nearby areas to Arizona or Mexico",
  "offered_on_date": "08/15/2024",
  "carrier": "John Smith",
  "available_on": null,
  "origin": "Springfield/Oakland, Chicago, or nearby areas",
  ...
})

Whatā€™s wrong

The top-level record_availability() call has additional parameters that are supposed to go into the "routes" parameterā€™s JSON.stringified content.

"additionalProperties": false is set, so these extra parameters are invalid.

Looking closely, "routes" and "offered_on_date" are members of the top-level params. So technically this meets the function call schema requirements if we ignore the additional members. (The "routes" string isnā€™t the stringified JSON requested, but we expect no guarantees from structured outputs about it.)

Model

This misbehavior happens with gpt-4o-mini. At least for this sample, gpt-4o is conforms to schema correctly. (In fact, gpt-4o fills ā€œroutesā€ with correct stringified JSON, though again we donā€™t expect it.)

Structured Outputs are supposed to work with gpt-4o-mini as well as gpt-4o, and not rely on model quality

1 Like

Always good to see that iā€™m not the only one!

Unfortunately, itā€™s getting worse - even with gpt-4o-08-06. 75% failure rate now, where 3 out of 4 tool calls to the same tool on separate runs dropped an important key specified in the schema.

Spent all day setting up error handling so that the tool responds with a ā€œHey, you forgot such-and-such keyā€¦ā€, but would love to get that advertised 100% success rate back!

Iā€™m experiencing the same issue. When using structured output with the Assistant API, the success rate drops significantly whenever user instructions contradict the required format. While I understand this complicates things, itā€™s disappointing that the structured output isnā€™t consistently enforced as promised in the documentation.

Furthermore, Iā€™ve also noticed a sharp increase in error rates since last week.

I received a response from OpenAI support last night that said they reviewed my case and decided it was ā€œan issue with my codeā€ and that I should check here or Discord for help. :smile:

To remove any chance it could be on my end, I did what @jaredp did and ran it through the Playground. Using gpt-4o-mini it failed to respond with the correct schema not once, but twice, before finally responding with all keys.