Structured Outputs Deep-dive

Hi! I made an article that tries to provide a concise deep-dive into structured outputs and their usage through OpenAI’s ChatCompletions API. It is based on my own usage and various threads I’ve been involved with in these forums. Article is available here: Diving Deeper with Structured Outputs | by Armin Catovic | Sep, 2024 | Towards Data Science

Approximate outline of the article:

  • What structured outputs are and how they work (context-free-grammar + constrained decoding)
  • Introduction to JSON Schema Spec and Pydantic
  • Limitations associated with structured outputs, such as still being susceptible to hallucinations, 4096 max output token limit and limited JSON schema support (not all features are there)
  • Couple of example walk-throughs, showcasing response schema flexibility using optional fields, and using enums to reduce hallucinations

Happy reading!

6 Likes

Can you share some of the details with us here on the forum? Summary or something maybe?

Thanks!

2 Likes

@PaulBellow no worries, I modified my initial post - hope that sheds some light on things.

2 Likes

Nice article… one tip I’d add is that it’s helpful to almost always include an explanation field on your structured object. This gives the model a place to explain why it returned the object that it did. I’ve found that this consistently results in better responses from the model.

4 Likes

Yes that’s great @stevenic , and it also saves you from adding explanations in the prompt.

1 Like

The AI model has no internal storage where it remembers its reasoning for producing {“is_ugly_person”: “False”, “explanation”: "…

In a boolean case like that, the choice can even be pretty random.

Thus, if you want actual reasoning to affect and improve the answer, it has to be output first. Then the AI can inspect its own language when it comes to producing the end result.

I put the explanation at the beginning of the structure so its the first thing generated. From what I’ve seen over the last few days that’s enough to get better answers back.

1 Like

Sounds great! Just wanted to clarify for those playing along at home.

You might even consider what else a structured and ordered response could produce in terms of chain-of-thought, even giving the schema some optional fields. Off the top of my head…

“preliminary_answer”
“reasoning_justification”
“answer_truthfulness_rating”
“answer_needs_improvement”
“discuss_areas_for_improvement”
“revised_answer_output”

I was originally including a field called “reasoning” but switched to “explanation” because that’s what OpenAI had in their example. I assumed they might be fine tuning using that field so it makes sense to follow their lead naming wise.

I’ve seen this with claude… I was unintentionally using the same prompt layout format they used but my content was different and I’d occasionally send claude down some weird rabbit hole where it started spewing its guts and would go int output loops that were clearly triggered by my prompt format.

This is a great article, Armin. TIL

BTW, here is a simple workaround for the arbitrary key value pairs.

class ArbitraryObjectsList(BaseModel):
    class KeyValuePair(BaseModel):
        key: str
        value: str

    items: List[KeyValuePair]

    def to_dict_list(self):
        return [{kvp.key: kvp.value} for kvp in self.items]


chain_of_thoughts
a_descriptive_name_can_stand_in_for_field_descriptions_especially_for_throw_away_args

1 Like

That’s awesome @nicholishen , thank you for the tip!

It’s probably not the same on Python with Pydantic but the Node samples for structured outputs all use a library called Zod and the code you have to write to define your schemas is just a nightmare. Especially given that the only thing you’re likely using that code for is to generate the schema.

I find that what works better is to define a TypeScript interface and then when I go to write the JSON Schema Copilot just writes it all for me. No errors, no typing, it’s great.

Like I said this may only be relevant for TypeScript but more so wondering if anyone else is using Copilot to write their schemas manually. For me it works perfectly

@stevenic I don’t know much about Node.js and Zod, but Python+Pydantic IMHO is super easy. Also, I tend to use Pydantic throughout the entire codebase - similar to how I used structs in C - it’s basically a way for me to uniformly represent data and objects throughout the code. So in that sense, its use with structured outputs is a natural extension.

PS. I haven’t coded JS since like 2014, and all this talk is making me want to revisit JS (are jQuery and Angular even a thing these days?)

1 Like

I think it’s all mainly REACT on the web side of things… I’ve been designing chat bot SDK’s for the last 8 years so I’m almost exclusively Node.js and Typescript.

Here’s a structured output from a JSON produced by a playground preset, pasting a bit from above:

The playground preset, used for making tool schemas, just needed to be multishot a bit with more documentation for success.

No use for node.js, but Python is a skill.

The modifications required within a pydantic straightjacket are made.

Work done by GPT-4, don’t accept impOstors. The drawback is the extensive knowledge GPT-4 has of the full set of JSON schema elements, that if OpenAI provides a substitute understanding to AI instead of just ignoring, just become part of description.

Is it possible to use the pydantic schema definitions with the batch API?
Since you have to provide the input as jsonl, it seems to be impossible to properly encode the pydantic object into json. I’ve seen posts where the schema was a plain dict, but it’s a pain in the *** to define those.

2 Likes

Yes it is possible. BTW, I’m open sourcing a pydantic wrapper I’m developing that extends it for use with LLMs. It’s not ready for PYPI but you can check it out here: GitHub - nicholishen/tooldantic

! pip install -U git+https://github.com/nicholishen/tooldantic.git

Here is a snippet:

import asyncio
import json

import httpx
import openai
from bs4 import BeautifulSoup
from tooldantic import OpenAiResponseFormatBaseModel as BaseModel

client = openai.AsyncOpenAI()

class ArticleExtractor(BaseModel):
    """Use this tool to extract information from the user's articles"""

    headline: str
    summary: str


urls = [
    "https://www.cnn.com/2019/08/29/us/new-hampshire-vanity-license-plate-trnd/index.html",
    "https://www.cnn.com/2024/08/02/tech/google-olympics-ai-ad-artificial-intelligence/index.html",
]



async def get_url_content(url: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")
        important_tags = ["h1", "h2", "h3", "p"]
        content = []
        for tag in important_tags:
            elements = soup.find_all(tag)
            for element in elements:
                content.append(element.get_text())
        return " ".join(content)



async def prepare_jsonl():
    tasks = [get_url_content(url) for url in urls]
    articles = await asyncio.gather(*tasks)
    jsonl = []
    for i, article in enumerate(articles, start=1):
        jsonl.append(
            {
                "custom_id": f"request-{i}",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-4o-mini",
                    "max_tokens": 1000,
                    "messages": [{"role": "user", "content": article}],
                    "response_format": ArticleExtractor.model_json_schema(),
                },
            }
        )
    with open("requests.jsonl", "w") as f:
        for line in jsonl:
            f.write(json.dumps(line) + "\n")

await prepare_jsonl()
        

EDIT:

tooldantic can also create dynamic pydantic models from arbitrary data sources.

import tooldantic

some_existing_data = {
    "headline": "The headline of the article",
    "summary": "The summary of the article",
}

MyDataModel = tooldantic.ModelBuilder(
    base_model=tooldantic.OpenAiResponseFormatBaseModel
).model_from_dict(
    some_existing_data,
    model_name="MyDataModel",
    model_description="This is a custom data model from arbitrary data",
)

print(json.dumps(MyDataModel.model_json_schema(), indent=2))
assert MyDataModel(**some_existing_data).model_dump() == some_existing_data

# {
#   "type": "json_schema",
#   "json_schema": {
#     "name": "MyDataModel",
#     "description": "This is a custom data model from arbitrary data",
#     "strict": true,
#     "schema": {
#       "type": "object",
#       "properties": {
#         "headline": {
#           "type": "string"
#         },
#         "summary": {
#           "type": "string"
#         }
#       },
#       "required": [
#         "headline",
#         "summary"
#       ],
#       "additionalProperties": false
#     }
#   }
# }
2 Likes

Really appreciate the article, very helpful, particularly the bit about making the schema as flat as possible. I’ve had issues recently with gpt-4o-mini providing the required keys, but not following the schema at all (a child key moves to the parent level, etc.)

With enums I wonder - have you had issues where, even with enums set, it would still hallucinate the value (something not provided in the list)? Again, I see this more with mini, but wondered with your experience if you’ve seen this as well.

1 Like

Hi @jim and glad that there is something useful in there!

To tell you the truth, I haven’t used mini that much. On paper at least, enums should not allow hallucinations, since for that particular field, the only possible values out of the entire vocabulary, are the set of values specified in the enum - every other token should be completely masked out (probability of 0). However, nothing would surprise me! It’s difficult to know what happens “upstream” - I discovered recently that even removing sampling out of the equation and setting the seed value, you still get huge token variations.

But there was some research done recently that found when using structured outputs, you actually get more hallucinations, especially on smaller models.

So one workaround to that could be to actually make two calls to mini - the first call is a more complex one, that requires some extraction or reasoning, but the output is just a flat string; the 2nd call then takes the output of the 1st one, and its sole purpose is to simply structure the output. I’ve seen people report nearly 0% in hallucinations with that method.

This is however very dependent on your task, and if from latency PoV you can afford it. But cost-wise, it’s still much cheaper that a single GPT-4o call.

1 Like

Ohhhhh, that’s actually a great idea. Hadn’t even considered that, because I was so obsessed with Structured Outputs advertised as 100% reliable – spent the morning “flattening” my schemas LOL - if that doesn’t work the “clean-up” pass is a pretty decent one given the low latency of mini.

Thanks!