Structured Outputs Deep-dive

platypus · September 4, 2024, 9:32pm

Great @jim ! Let me know how the results look like with this two-phase approach!

peter.edmonds · September 5, 2024, 7:38pm

Thanks for the write-up @platypus. Can you provide a source for the 4096 output token limit? According to the release blog post, structured outputs with response formats is only available on gpt-4o-2024-08-06 and gpt-4o-mini, both of which support 16,384 tokens.

platypus · September 5, 2024, 7:52pm

Great catch @peter.edmonds ! I will make a correction! My intention was to make sure users understand that there is no “safety” mechanism - if you went over max output token limit count (whatever that is, in this case 16384), you will get incomplete JSON back. Thanks again!

platypus · September 5, 2024, 7:56pm

Done @peter.edmonds ! I made an attribution to you there, thanks again!

peter.edmonds · September 5, 2024, 8:11pm

Thanks for the update @platypus. Regarding the lack of safety mechanism, I’m not sure if that is the case. I’ve always been returned an error rather than malformed JSON when my response exceeds the context limit. This could be coming from the Zod + NodeJS SDK I’ve been using though, curious if you have seen this in the wild.

platypus · September 5, 2024, 8:23pm

@peter.edmonds in my case I have received incomplete JSON responses - I used Python + Batch API.

platypus · September 5, 2024, 8:27pm

@peter.edmonds when I say “safety mechanism”, I mean from the API side. Yes if I load those into a dict in Python I will get an error (because the JSON is incomplete). Similarly if I use Pydantic. But this is from the SDK, not the API.

kavitatipnis · September 5, 2024, 10:19pm

Please feel free to redirect to the appropriate forum : Has anyone experienced regression in the quality of the answers because of Structured Outputs ? Answers that were once elaborate and detailed are now short and succinct which may not be desired in every use case?

jim · September 5, 2024, 10:39pm

Yes, absolutely. Quality of SO has declined in the last week including weaker, shorter responses and hallucinated ENUM’d values. I see this across all models, with an uptick on gpt-4o-mini.

kavitatipnis · September 5, 2024, 10:57pm

wow, I am glad I am not hallucinating. We are on gpt-4o, didn’t want to switch to mini prematurely. Structured Outputs is a great feature , however, I wish there is a feature to retain the original chatCompletions responses and only use Structured Outputs sparingly for parts that need to fit into a schema for post processing in function/tools calling.

oldtimehacker · September 7, 2024, 3:41pm

(Thank you so much @platypus for this great and timely article!)

I have a question I hope I can ask here that might be relevant to this thread:

My task involves summarization, and I need the output to conform to an existing PDF file. I used playground to read the PDF and output the schema… but I want to implement a data-driven way to provide the output schema to the model at run-time so I can change it on the fly.

I thought first to specify the schema as JSON for file storage, then load it into pydantic. But would it be smarter to ask gpt to provide python/pydantic, then use pydantic to serialize to JSON for file storage? I thought that might eliminate any schema-driven de-serialization problems.

In short, are there any best practices for getting a model to output a structured output schema that you plan to load at runtime from a file?

platypus · September 7, 2024, 4:50pm

Glad you liked the article @oldtimehacker and welcome to the community!

So if I understood correctly, you want to have a dynamic representation of a document, and your question is regarding the best practices for schema storage?

So regarding schema storage, it may be best actually to define a JSON schema as a text file, i.e. document_summary_schema.json since you have it nice and neat in one place and you can version control it and update it accordingly. Then in your code you would just have schema validation code (e.g. you can load it using Pydantic and perform validation steps).

Regarding dynamic schema generation - it leaves you a bit exposed, because you never know what you may get and you will end up doing lots of validation code. Maybe that’s fine. But another approach is to have optional fields. So presumably in a given document summary schema you have some base information: title, authors, date_published, keywords, executive_summary, etc. Then you basically create an optional field, which would be an array of objects. And each object would have something along the lines of section_heading and section_summary. So maybe you have some PDF that has Appendix A Blah blah blah and that end up being populated as one of these optional fields.

Hope that makes sense

oldtimehacker · September 8, 2024, 8:07pm

You got it. And I should mention that I’m dynamically generating the schema, but mostly once for each form as a time savings / automation device - not that the input form content will be always be unkown/different at runtime. So while I may be a little exposed to what I’ll call ‘LLM quality control’ on that step, it happens once and I can fix it, then finalize the data representation for the file.

What you are saying makes sense and it was my first thought, so this gives me some confidence! One little detail is that when I look at JSON output from pydantic it has a certain structure/rules that I hope GPT can ‘fulfill’ when it defines the JSON schema, so that it loads and passes validation nicely in pydantic.

This was one reason I considered having GPT output the schema using python/pydantic - so that when I used pydantic to generate the JSON schema it will always be ‘pydantic friendly’.

Appreciate your help thanks!

owenffs · September 10, 2024, 4:01pm

Hi, your article is rly insightful. especially on the part of reducing hallucinations.

One question - I have observed that the structured output functionality does not natively support outputting multiple records in one query. If I define one pydantic model and query once, it will output single pydantic model even when the input message might suggest few records.

Wondering if this is expected behavior and is there any workaround?

nicholishen · September 10, 2024, 5:24pm

I created an extension of pydantic (tooldantic) that can handle this use case. The tooldantic.ModelBuilder and it can dynamically create wrapped pydantic models (at runtime) from:

Raw data (python dict[str, Any])
JSON schema
function signatures

You could ask GPT for the schema, but it could also return something the response_format or tools doesn’t like. Alternatively, you could ask it to give you an example object which you then pass to ModelBuilder to get a pydantic model and appropriate schema.

Example:

# %%
import json

import openai
import tooldantic as td

client = openai.OpenAI()

model_builder = td.ModelBuilder(base_model=td.OpenAiResponseFormatBaseModel)

mock_doc = """
On July 5th, 2023, John Doe, a 35-year-old software engineer from San Francisco, California, joined ABC Tech as a Senior Developer. \
He will be responsible for overseeing the development of the company's flagship product, XYZ App. In his previous role at DEF Corp, \
he led a team of 10 engineers and successfully launched two major projects. John holds a Bachelor’s degree in Computer Science from \
Stanford University, obtained in 2010, and has over 12 years of industry experience. He is excited to bring his expertise in \
full-stack development and project management to ABC Tech.
"""

data_obj_recommended_by_gpt = {
    "date": "2023-07-05",
    "person": {
        "name": "John Doe",
        "age": 35,
        "occupation": "Software Engineer",
        "location": {"city": "San Francisco", "state": "California"},
        "education": {
            "degree": "Bachelor’s",
            "field": "Computer Science",
            "university": "Stanford University",
            "graduation_year": 2010,
        },
        "experience_years": 12,
        "previous_employment": {
            "company": "DEF Corp",
            "role": "Team Lead",
            "team_size": 10,
            "projects_launched": 2,
        },
    },
    "new_employment": {
        "company": "ABC Tech",
        "role": "Senior Developer",
        "responsibilities": "Overseeing development of XYZ App",
    },
    "skills": ["Full-Stack Development", "Project Management"],
}

DynamicDocModel = model_builder.model_from_dict(
    data_obj_recommended_by_gpt,
    model_name="extract_data_from_text",
    model_description="Use this tool to extract data from text",
)

with open("persistent_model.json", "w") as f:
    f.write(json.dumps(DynamicDocModel.model_json_schema(), indent=2))


# {'type': 'json_schema',
#  'json_schema': {'name': 'extract_data_from_text',
#   'description': 'Use this tool to extract data from text',
#   'strict': True,
#   'schema': {'type': 'object',
#    'properties': {'date': {'type': 'string'},
#     'person': {'type': 'object',
# ...

# %%
with open("persistent_model.json", "r") as f:
    schema = json.load(f)
    
DynamicDocModel = model_builder.model_from_json_schema(
    schema=schema,
)

r = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=[{"role": "user", "content": mock_doc}],
    response_format=DynamicDocModel.model_json_schema(),
)

validated_llm_output = DynamicDocModel.model_validate_json(r.choices[0].message.content)



# %%
print(json.dumps(validated_llm_output.model_dump(), indent=2))
# {
#   "date": "July 5th, 2023",
#   "person": {
#     "name": "John Doe",
#     "age": 35,
#     "occupation": "software engineer",
#     "location": {
#       "city": "San Francisco",
#       "state": "California"
#     },

nicholishen · September 10, 2024, 5:42pm

You’ll need to nest your model inside a list of another model like so:

class Person(pydantic.BaseModel):
    name: str
    age: int

class People(pydantic.BaseModel):
    people: list[Person]

_j · September 10, 2024, 8:01pm

It seems this article’s reference to CFG, context-free grammar, and the flowchart illustration that starts it, is highly specific yet highly speculative. OpenAI doesn’t talk about how their JSON mode works.

In fact, the JSON response can be seen to be backwards-looking, or token-run based. Generate and extend your json-mode response one token at a time, by increasing max_token values, and you’ll see previous logits replaced with new ones. This may be switch transformer or mixture-of-experts architecture, highly suspected, at work.

If it was clever, the algorithm wouldn’t consider 500 tab characters or newlines to be a valid production of this mode, which is why first JSON had to be mentioned or the request would be rejected, and then now an output described to the AI by schema, which was a good practice before this.

Also, Pydantic is a mere client implementation possibility - it doesn’t have anything to do with what goes over the wire.

nicholishen · September 10, 2024, 9:54pm

In OpenAI’s article titled “Introducing Structured Outputs in the API” there’s an entire section on how they’re using “Constrained decoding”.

Our approach is based on a technique known as constrained sampling or constrained decoding… To do this, we convert the supplied JSON Schema into a context-free grammar (CFG)…

A lot of the verbiage used in the OpenAI release give vibes from this paper: Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation (arxiv.org)

platypus · September 11, 2024, 8:54am

@_j @nicholishen correctly pointed out their statements on CFG - so as I am using that as the ground truth, that aspect of the article is not speculative. I’ve also been tracking packages such as Outlines, and they also use CFG.

Regarding Pydantic - it does have an impact what is sent “down the wire” since in OpenAI’s code, they do some validation and house-keeping on Pydantic models behind the scenes (like setting additionalProperties to False), which impacts what is sent to the API. When you send raw JSON, not formulating it correctly or forgetting a field will result with 400 error from the API. I made all of this explicit in the article, and also pointed out to OpenAI statements that they recommend the use of Pydantic (or similar JSON-schema-spec tracking libraries in other languages).

kavitatipnis · September 16, 2024, 5:40pm

@platypus @jim - One more option to address the hallucinations with structured outputs. If you leave the “description” field blank and rely on the completeness of the system prompt, you get responses that reflect reasoning
Depending on your use case, if this works you save some costs and improve latency by incurring only one API call.

For example :

`class ResponseAnswer(BaseModel):
“”“Answer the user question based only on the given sources”“”

answer: str = Field(
    ...,
    description="",
)
products: List[str] = Field(
    ...,
    description="List of products specified in the answer",
)`

Example of the system prompt : ( note we use Langchain, so this feeds into a Prompt Template)

("system", "Provide a detailed and clear answer to the user question. Ensure all important details about the products are explained thoroughly. "
     "Use the following pieces of context to answer the user's question. "
     "If you don't know the answer, just say that you don't know, don't try to make up an answer. "
     "\\n\\n{context}"),
    ("user", "{input}")

Topic		Replies	Views
Introducing Structured Outputs Announcements api	57	7415	November 8, 2024
Structured Outputs not reliable with GPT-4o-mini and GPT-4o API structured-output	31	3559	November 8, 2024
How to prevent ChatGPT from answering questions that are outside the scope of the provided context in the SYSTEM role message? API	53	166613	December 2, 2023
Feature Request: Deterministic Answer Option for Unit Testing API	21	3795	April 27, 2024
Structured Outputs & Functions - Schema-Writer Playground AI Preset to make them API playground	13	444	November 1, 2024

Structured Outputs Deep-dive

Related topics