Structured Outputs not reliable with GPT-4o-mini and GPT-4o

jim · August 22, 2024, 10:58pm

Have had pretty phenomenal success with StructuredOutputs and GPT-4o, all tools are following specified schemas, particular ENUMs.

Starting to notice that GPT-4o-mini is less reliable, more errors, missing keys, inaccurate values (with a list of 3 ENUM strings, it asks for something completely made up).

Sent a support request, but Fin from Intercom just told me to provide them information I already gave them.

Is this a known issues with the mini model? Having to put all kinds of error handling back in…

Diet · August 22, 2024, 11:12pm

Mini is, well, mini.

Faster, cheaper, but also fewer neurons.

I wouldn’t go for a weaker model if reliability is important. Depending on your exact use case, it might be possible to tweak the prompt to get a similar result, but I think what you discovered is generally true:

you can trade inference cost for engineering cost. decreasing one will probably increase the other.

sps · August 23, 2024, 4:30am

Can you share the code to reproduce this @jim ?

jim · August 23, 2024, 3:41pm

Yeah, definitely, I’ll find some time later today - it’s definitely the mini model. It’s leaving out keys now that are listed as required - only my runs on gpt-4o-mini fail like this - all the gpt-4os are 100% success (which is awesome and so helpful).

I just wanted others to be aware as its advertised as 100% success rate (and hopefully they’ll figure out a way to get it to be so!)

jochenschultz · August 24, 2024, 7:28am

Did you provide a json schema + example json in the prompt?

jim · August 24, 2024, 7:36am

Not in the prompt but definitely in the tools. Like I said as long as I have gpt-4o sat as the model structured output is 100% reliable. It’s only on gpt-4o-mini that keys will be dropped.

Actually, scratch that, there were some instances tonight where gpt-4o—08-06 was not following the schema where top level items were being somehow included as second level props. For instance if my schema was:

fruit
vegetables
spices

It was coming back as:

fruit
- vegetables
- spices

I’m hoping it was only a temporary thing tonight because that would be a huge bummer if somehow all of this is degrading.

jochenschultz · August 24, 2024, 8:53am

GPT3.5 works very well on that when giving it the right prompt.
I got at most 0.25% errors even with bigger json formats.

maybe you could also try something like giving it more context like:

extract the hierarchie of food for my knowledge graph. Find the first level, use that as the categorie and everything below.

e.g.

fruits

apple
cherry

give me the result as json.

jim · August 24, 2024, 4:49pm

I could do that…but Structured Outputs are sold as 100% reliable responses without additional prompting. I believe that’s the idea behind the initial load that saves all kinds of info in the background.

If I go back to prompt engineering, then I’m back where I was a year ago with all sorts of sanitizing and error catching.

What I LOVED about SO when it first came out was that every single response was 100% accurate. As I’m rolling it out to more and more users I’m starting to see less than 100% when using gpt-4o-mini.

jochenschultz · August 24, 2024, 4:57pm

Gpt4o-mini is not even out for a year. Have you ever seen a software that works 100% reliable at all?
Also this is not the place to rant about it I guess. It is a place to find a solution.

jim · August 24, 2024, 5:37pm

Oh I think you might be misunderstanding. Not ranting at all and I really appreciate the help. But 3.5 doesn’t even work with Structured Outputs so I’m not sure if you understand what I’m referring to specifically.

The new strict mode for JSON is guaranteed 100% reliability. It’s there in the docs and all throughout the cookbook example. For the first two weeks I did experience this, but now I’m getting failures and my code has not changed.

I’m only explaining it in detail here because sometimes it’s nice to have a place like this where you can arrive and instantly find out, “oh someone else is having the same issue.” And then once a solution is provided, I can post it here for future users.

I did send in bug reports along with IDs on the runs and threads, but since they’re relying on Fin and Intercom to respond I’m not 100% sure anyone’s addressing it.

brightj · August 24, 2024, 5:41pm

Isn’t that impossible? The structured output is forced to match the structure 100%.

anon10827405 · August 24, 2024, 5:46pm

I believe so. Although I have only been using it myself recently I do have a faint memory of the model hallucinating a property.

I would recommend to anyone to first try using structured outputs in the Playground.

It’d be nice if OP shared their code and structure.

jim · August 24, 2024, 6:01pm

I will this weekend. Some of it’s proprietary, so it’s not just a clear copy and paste.

anon10827405 · August 24, 2024, 6:10pm

You can feed it through GPT and instantly strip anything propietary

darf333 · August 25, 2024, 1:59am

How bad are the outputs and how often? I find this difficult to believe, since I have had excellent structured JSON results (yes, text outputs actually generated as functional JSON, as instructed and in the format instructed) in definitely more than 99 out of 100 generations using Llama 3.1 8B on a consumer GPU.

So I would expect GPT-4o to be able to do better than Llama 3.1 8b!

jim · August 25, 2024, 7:10pm

I can get better analytics, but its essentially somewhere in the 97-98% range. This is only in the last week, as my first two weeks of experience with Structured Outputs with JSON were 100% for both gpt-4o and gpt-4o-mini.

For instance, over the past 90 requests for tool_calls there was ONE request that hallucinated a property not set in the schema. This was with gpt-4o-08-06 NOT gpt-4o-mini.

An example of the schema (edited) is something like this:

{
  "name": "edit_movie_genre",
  "description": "Used to edit a Genre Component within any kind of Movie. Please respond in JSON.",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "genre_component": {
        "type": "object",
        "description": "The component of the Genre or Movie being reviewed, which contains key characteristics.",
        "properties": {
          "characteristic": {
            "type": "string",
            "description": "The characteristic of narrative structure within the project.",
            "enum": [ "Main Genre Element 1", "Main Genre Element 2", ... ]
          }
        },
        "required": [
            "characteristic"
          ],
          "additionalProperties": false
      },        
      "template": {
        "type": "string",
        "description": "The type of project.",
        "enum": [
          "Book",
          "Novel"
        ]
      }
    },
    "additionalProperties": false,
    "required": [
      "genre_component",
      "template"
    ]
  }
}

The characteristic enum property is an array of 100 items (cut for space here).

In the past 24 hours the model hallucinated a value that was not listed in the array of enums. Over the past couple of days there were a handful of tool calls that shifted parent properties to be children properties of another parent.

It’s not really about the schema as I mentioned, it works 89 times out of 90. The issue is that it is advertised as 100% reliability (from the blog post):

it still did not meet the reliability that developers need to build robust applications. So we also took a deterministic, engineering-based approach to constrain the model’s outputs to achieve 100% reliability.

All in all, it’s working great and I’m extremely pleased as all the error catching and sanitizing over the past year or so of function calls has been less than enjoyable. When they released Structured Outputs the first week of August, I quickly converted over and was super impressed - it was 100% with the same exact tooling above.

It’s only in the last week that I’ve started to see errors with the model not following the schema 100%. Yes, I can put back in all the error catching, etc. but there was something quite magical about the 100% results that I would love to recapture if at all possible.

jaredp · August 26, 2024, 5:16pm

Same issue, have a repro for you:

https://platform.openai.com/playground/chat?models=gpt-4o-mini&preset=preset-i3LiIoW9aaFGVgAp7TGBRXqG

The function definition for record_availability says it should take 2 parameters:

routes: string // a stringified JSON of {“carrier”: string, “available_on”: string|null, “origin”: string, …}
available_on: string // date formatted as MM/DD/YYY

(I’m aware I’m “holding it wrong” by asking for "routes" to contain stringified JSON. I coded myself into a corner and discovered this by accident.)

What I get

record_availability({
  "routes": "Springfield/Oakland, Chicago, or nearby areas to Arizona or Mexico",
  "offered_on_date": "08/15/2024",
  "carrier": "John Smith",
  "available_on": null,
  "origin": "Springfield/Oakland, Chicago, or nearby areas",
  ...
})

What’s wrong

The top-level record_availability() call has additional parameters that are supposed to go into the "routes" parameter’s JSON.stringified content.

"additionalProperties": false is set, so these extra parameters are invalid.

Looking closely, "routes" and "offered_on_date" are members of the top-level params. So technically this meets the function call schema requirements if we ignore the additional members. (The "routes" string isn’t the stringified JSON requested, but we expect no guarantees from structured outputs about it.)

Model

This misbehavior happens with gpt-4o-mini. At least for this sample, gpt-4o is conforms to schema correctly. (In fact, gpt-4o fills “routes” with correct stringified JSON, though again we don’t expect it.)

Structured Outputs are supposed to work with gpt-4o-mini as well as gpt-4o, and not rely on model quality

jim · August 27, 2024, 2:19am

Always good to see that i’m not the only one!

Unfortunately, it’s getting worse - even with gpt-4o-08-06. 75% failure rate now, where 3 out of 4 tool calls to the same tool on separate runs dropped an important key specified in the schema.

Spent all day setting up error handling so that the tool responds with a “Hey, you forgot such-and-such key…”, but would love to get that advertised 100% success rate back!

tobiasBoldTech · August 28, 2024, 5:01pm

I’m experiencing the same issue. When using structured output with the Assistant API, the success rate drops significantly whenever user instructions contradict the required format. While I understand this complicates things, it’s disappointing that the structured output isn’t consistently enforced as promised in the documentation.

Furthermore, I’ve also noticed a sharp increase in error rates since last week.

jim · August 28, 2024, 5:57pm

I received a response from OpenAI support last night that said they reviewed my case and decided it was “an issue with my code” and that I should check here or Discord for help.

To remove any chance it could be on my end, I did what @jaredp did and ran it through the Playground. Using gpt-4o-mini it failed to respond with the correct schema not once, but twice, before finally responding with all keys.

Topic		Replies	Views
Function calling looping uncontrollably and calling unnecessarily Bugs function-calling , gpt-4o , gpt-4o-mini	27	2334	September 19, 2024
Structured Outputs & Functions - Schema-Writer Playground AI Preset to make them API playground	13	1567	November 1, 2024
Strict mode does not enforce the JSON schema? API structured-output	7	1026	May 8, 2025
Getting response data as a fixed & Consistent JSON response API	43	127991	February 19, 2024
Since 2024-Nov-16 Assistant API returning 'server_error' Bugs assistants-api	19	658	November 22, 2024

Structured Outputs not reliable with GPT-4o-mini and GPT-4o

What I get

What’s wrong

Model

Related topics