I can get better analytics, but its essentially somewhere in the 97-98% range. This is only in the last week, as my first two weeks of experience with Structured Outputs with JSON were 100% for both gpt-4o and gpt-4o-mini.
For instance, over the past 90 requests for tool_calls there was ONE request that hallucinated a property not set in the schema. This was with gpt-4o-08-06 NOT gpt-4o-mini.
An example of the schema (edited) is something like this:
{
"name": "edit_movie_genre",
"description": "Used to edit a Genre Component within any kind of Movie. Please respond in JSON.",
"strict": true,
"parameters": {
"type": "object",
"properties": {
"genre_component": {
"type": "object",
"description": "The component of the Genre or Movie being reviewed, which contains key characteristics.",
"properties": {
"characteristic": {
"type": "string",
"description": "The characteristic of narrative structure within the project.",
"enum": [ "Main Genre Element 1", "Main Genre Element 2", ... ]
}
},
"required": [
"characteristic"
],
"additionalProperties": false
},
"template": {
"type": "string",
"description": "The type of project.",
"enum": [
"Book",
"Novel"
]
}
},
"additionalProperties": false,
"required": [
"genre_component",
"template"
]
}
}
The characteristic
enum property is an array of 100 items (cut for space here).
In the past 24 hours the model hallucinated a value that was not listed in the array of enums. Over the past couple of days there were a handful of tool calls that shifted parent properties to be children properties of another parent.
Itās not really about the schema as I mentioned, it works 89 times out of 90. The issue is that it is advertised as 100% reliability (from the blog post):
it still did not meet the reliability that developers need to build robust applications. So we also took a deterministic, engineering-based approach to constrain the modelās outputs to achieve 100% reliability.
All in all, itās working great and Iām extremely pleased as all the error catching and sanitizing over the past year or so of function calls has been less than enjoyable. When they released Structured Outputs the first week of August, I quickly converted over and was super impressed - it was 100% with the same exact tooling above.
Itās only in the last week that Iāve started to see errors with the model not following the schema 100%. Yes, I can put back in all the error catching, etc. but there was something quite magical about the 100% results that I would love to recapture if at all possible.