Streaming using Structured Outputs

expertise.ai.chat · August 28, 2024, 7:24pm

Is it possible to stream using structured outputs?

Say I define, using Pydantic or an equivalent, a structure that looks like this:

Person = {
    "name": <string>,
    "age": <number>,
    "profession": <string>
}

And I give GPT a prompt like “The user input will be a story. Read the story and identify all of the characters identified in the story.”

And I want GPT to return an array of Person objects.

Rather than return them all at once as an array, is it possible to have GPT send me each person as it identifies them, one by one? This would be very helpful as it would allow me to reduce the time to get the first piece of information back.

supershaneski · August 28, 2024, 11:20pm

it is possible. but you need to have a code to parse the chunks. i have not used the helper libraries so i do not know if they are equipped to get you values even when the result is not completed yet. but basically that is what you need to do on your end.

allyssonallan · August 29, 2024, 3:32am

Thanks @supershaneski

Please @expertise.ai.chat check it out the @katiagg tutorial at Introduction to Structured Outputs | OpenAI Cookbook

strict: true rocks

expertise.ai.chat · August 29, 2024, 3:22pm

Hey @allyssonallan , I’ve just gone through the tutorial you linked. There were some good examples in there, but nothing about streaming. Did I miss something in there?

andreasantoro.pvt · September 1, 2024, 10:35am

Hi @expertise.ai.chat , I managed to find a workaround by creating a wrapper for the Pydantic base class and process the json schema the same way the streaming beta api is doing.

from pydantic.json_schema import (
    DEFAULT_REF_TEMPLATE,
    GenerateJsonSchema,
    JsonSchemaMode,
    model_json_schema
)
from typing import Any
from pydantic import BaseModel, Field
from openai.lib._pydantic import _ensure_strict_json_schema


class BaseModelOpenAI(BaseModel):
    @classmethod
    def model_json_schema(
        cls,
        by_alias: bool = True,
        ref_template: str = DEFAULT_REF_TEMPLATE,
        schema_generator: type[GenerateJsonSchema] = GenerateJsonSchema,
        mode: JsonSchemaMode = 'serialization'
    ) -> dict[str, Any]:
        json_schema = model_json_schema(
            cls,
            by_alias=by_alias,
            ref_template=ref_template,
            schema_generator=schema_generator,
            mode=mode
        )
        return _ensure_strict_json_schema(json_schema, path=(), root=json_schema)

Your classes should inherit from BaseModelOpenAI and then you need to pass the response format as follows:

{
        "type": "json_schema",
        "json_schema": {
        "name": response_class.__name__,
        "schema": response_class.model_json_schema(),
        "strict": True
}

Then you can to use the standard client.chat.competions.create to send your request and get a streaming response.

expertise.ai.chat · September 3, 2024, 3:31pm

Wow @andreasantoro.pvt! This is really cool. It is streaming the json response.

Do you happen to know, though, is there a way to get it to stream one key and value at a time rather than word by word?

So for example, if I should get back

{
    key1: val1,
    key2: val2,
    key3: val3
}

I’ll get back

{key1: val1}

{key2: val2}

{key3: val3}

or something similar instead of

key1

val1

key2

val2

key3

val3

Thanks

andreasantoro.pvt · September 3, 2024, 3:59pm

@expertise.ai.chat
I don’t think so since the generation happens token by token. If it’s just a matter of displaying the result to your users, you could accumulate the chunks contents until a new “:” sign (or “}”) has been reached

sebastian.chejniak · September 10, 2024, 3:44pm

Maybe I’m missing something, but I had no issues streaming output from the structured outputs API in a simple way. The client.beta.chat.completions object has a .stream method seemingly tailor-made for this.

e.g. this function I created works perfectly fine, and extracts the cumulative streamed response, as well as the final token usage

from openai import OpenAI

# Generator
def openai_structured_outputs_stream(**kwargs):
    client = OpenAI()

    with client.beta.chat.completions.stream(**kwargs, stream_options={"include_usage": True}) as stream:
        for chunk in stream:
            if chunk.type == 'chunk':
                latest_snapshot = chunk.to_dict()['snapshot']
                # The first chunk doesn't have the 'parsed' key, so using .get to prevent raising an exception
                latest_parsed = latest_snapshot['choices'][0]['message'].get('parsed', {})
                # Note that usage is not available until the final chunk
                latest_usage  = latest_snapshot.get('usage', {})
                latest_json   = latest_snapshot['choices'][0]['message']['content']

                yield latest_parsed, latest_usage, latest_json

Usage:
So you can stream the output e.g. as a pandas dataframe as below (though it looks ugly, since this example refreshes the entire dataframe every chunk - purely done for illustrative purposes):

from IPython.display import display, clear_output

for parsed_completion, completion_usage, completion_json in openai_structured_outputs_stream(
    model=model_name,
    temperature=temperature,
    messages=messages,
    response_format=YourPydanticModel
):
    clear_output()
    display(pd.DataFrame(parsed_completion))

Notes:
There are three chunk types, one with chunk.type == 'chunk', chunk.type == 'content.delta', and chunk.type == 'content.done' - hence the need for the if statement to only use one of them (they share lots of data). I believe the content.delta type contains the changes between consecutive chunks.

andreasantoro.pvt · September 10, 2024, 4:02pm

@sebastian.chejniak
Last time i checked, the parse api did not accept the stream parameter so we weren’t able to use the stream function you provided.

Anyway I’m glad they added it, thanks for pointing out

expertise.ai.chat · September 10, 2024, 7:14pm

Sorry, but would you mind spelling it our for me?

Say I have the following code.

stream = self.client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful math tutor. Guide the user through the solution step by step.",
        },
        {"role": "user", "content": "how can I solve 8x + 7 = -23"},
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

This code streams the response to the provided math question, but it does so in plain text. Say I still want to have that response be streamed, but now in JSON that conforms to MathReasoning defined below. How would I modify this code to achieve that?

class Step(BaseModel):
    explanation: str
    output: str


class MathReasoning(BaseModel):
    steps: list[Step]
    final_answer: str

Thanks

andreasantoro.pvt · September 10, 2024, 7:30pm

@expertise.ai.chat you would just need to use the model_validate method of the Pydantic class. In your case you would need to do

MathResponse.model_validate(plain_text_response)

And your result will be parsed as a MathResponse object

nicholishen · September 10, 2024, 10:15pm

Option 1:

# pip install git+https://github.com/nicholishen/tooldantic.git

import tooldantic as td

class Step(td.OpenAiResponseFormatBaseModel):
    explanation: str
    output: str

class MathReasoning(td.OpenAiResponseFormatBaseModel):
    steps: list[Step]
    final_answer: str

Option 2:

class OpenAiResponseFormatGenerator(pydantic.json_schema.GenerateJsonSchema):
    # https://docs.pydantic.dev/latest/concepts/json_schema/#customizing-the-json-schema-generation-process
    def generate(self, schema, mode="validation"):
        json_schema = super().generate(schema, mode=mode)
        json_schema = {
            "type": "json_schema",
            "json_schema": {
                "name": json_schema.pop("title"),
                "schema": json_schema,
            }
        }
        return json_schema


class StrictBaseModel(pydantic.BaseModel):
    model_config = {"extra": "forbid"}

    @classmethod
    def model_json_schema(cls, **kwargs):
        return super().model_json_schema(
            schema_generator=OpenAiResponseFormatGenerator, **kwargs
        )


class Step(StrictBaseModel):
    explanation: str
    output: str


class MathReasoning(StrictBaseModel):
    steps: list[Step]
    final_answer: str

Calling the LLM:

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful math tutor. Guide the user through the solution step by step.",
        },
        {"role": "user", "content": "how can I solve 8x + 7 = -23"},
    ],
    stream=True,
    response_format=MathReasoning.model_json_schema(),
)

sebastian.chejniak · September 13, 2024, 12:00pm

The below code works, using my function openai_structured_outputs_stream as defined above. Like I said, it’s not pretty due to the constant refreshing, but it works. You could write code that instead prints consecutive chunks of text instead of printing the entire object repeatedly, but this is unnecessary for my use-case, due to react’s reconciliation behaviour (i.e. react only re-renders the changes in the text, not the entire text every time).

from pydantic import BaseModel
from pydantic.fields import Field
from IPython.display import display, clear_output
import pandas as pd

class Step(BaseModel):
    explanation: str
    output: str


class MathReasoning(BaseModel):
    steps: list[Step]
    final_answer: str

messages=[
    {
        "role": "system",
        "content": "You are a helpful math tutor. Guide the user through the solution step by step.",
    },
    {"role": "user", "content": "how can I solve 8x + 7 = -23"},
]

for parsed_completion, *_ in openai_structured_outputs_stream(
    model='gpt-4o-mini',
    temperature=0,
    messages=messages,
    response_format=MathReasoning
):
    clear_output()
    display(parsed_completion)

rich12 · September 15, 2024, 9:34am

Does this second approach then need the mandatory extra keys adding? additionalProperties false and all the elements of an array being required?

nicholishen · September 15, 2024, 4:42pm

Yes. That applies to the new structured output mode. Tool calling is a bit more flexible, and does not require those additional properties, but extra validation is required on your end.

_j · September 15, 2024, 6:03pm

Here’s a post with an obsolete-to-me printer class of generator words or chunks to the console, with word wrapping.

nickyreinert · February 13, 2025, 6:45pm

This didnt work for me as I am using streaming in an assistant context. Still, the stream return the partial JSON which make it kind of painful to extract the actual content out of it. This is probably not bullet proof, but the quickest and smartest way to get streaming work with assistant returning a JSON.

I probably only works with simple JSON structure with just one level of nesting, like

{"description": "...", "key1": "value1", ...}

You need to make sure, in your assistant’s instruction, to set the priority response, like “description” at the beginning.

The idea is to implement a simple JSON-repair logic, like that - you may add more conditions to even check unclosed lists, I’d just kept it simple here. The goal is to get a valid JSON that you can at least parse for the “important” data you need, which in my case is the description.

                    if json_buffer.endswith(':'):
                        dict_response = json.loads(json_buffer + '""}')
                    if json_buffer.count('"') % 2 != 0:
                        dict_response = json.loads(json_buffer + '"}')

Then I try to load the JSON, if successfull, I got something to return to the frontend, if not, I just return the last succesful state:

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id="asst_XYZ",
    stream=True,
    temperature=0
)

message_buffer = ""
json_buffer = ""

for event in run:
    event_type = getattr(event, "event", None)

    if event_type == "thread.message.created":
        pass

    elif event_type == "thread.message.in_progress":
        pass

    elif event_type == "thread.message.delta":

        delta_content = event.data.delta.content
        for block in delta_content:
            if block.type == "text":

                message_buffer += block.text.value                

                  # try to repair the partial JSON
                  if message_buffer.endswith(':'):
                      json_buffer = message_buffer +  '""}'
                  if message_buffer.count('"') % 2 != 0:
                      json_buffer = message_buffer +'"}'

                try:
                    # if we can parse the json and get the required key, assign it
                    description = json.loads(json_buffer).get('description', '')
                    # for testing purposes only
                    print(description, end='\n', flush=True) 
                    # and also parse the JSON for the stream response to the frontend
                    response = json.loads(dict_response)
                except json.JSONDecodeError:
                    pass

                yield json_buffer

    elif event_type == "thread.message.completed":
        print("\n[Message Completed]")
        yield message_buffer

    else:
        pass

Eventually this streams the important “description” back to the frontend, as required and then at some point I get the whole response will all the additional information.

_j · February 14, 2025, 12:04pm

Are you hitting a timeout or other reason why the AI output is cut off?

If the Assistant should only produce a JSON, there’s a parameter for that instead of merely asking nicely:

( a Python object for the library’s create/modify assistant response_format parameter)

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "json_task_output",
        "schema": {
            "type": "object",
            "properties": {
                "my_description": {
                    "type": "string",
                    "description": "A free-text description of the JSON task output."
                },
                "key1": {
                    "type": "string",
                    "description": "An arbitrary key to store a specific value related to the task output."
                }
            },
            "required": ["my_description", "key1"],
            "additionalProperties": False
        },
        "strict": True
    }
}

Move over to Chat Completions: get streaming helpers to construct a schema with Pydantic and validate the response:

github.com/openai/openai-python

helpers.md

main

# Structured Outputs Parsing Helpers

The OpenAI API supports extracting JSON from the model with the `response_format` request param, for more details on the API, see [this guide](https://platform.openai.com/docs/guides/structured-outputs).

The SDK provides a `client.beta.chat.completions.parse()` method which is a wrapper over the `client.chat.completions.create()` that
provides richer integrations with Python specific types & returns a `ParsedChatCompletion` object, which is a subclass of the standard `ChatCompletion` class.

## Auto-parsing response content with Pydantic models

You can pass a pydantic model to the `.parse()` method and the SDK will automatically convert the model
into a JSON schema, send it to the API and parse the response content back into the given model.

```py
from typing import List
from pydantic import BaseModel
from openai import OpenAI

class Step(BaseModel):
    explanation: str
    output: str

This file has been truncated. show original

nickyreinert · February 14, 2025, 12:20pm

Not at all, I just get partial JSON, like, the API is returning the JSON char-by-char. That seems intended, not a bug? Or am I missing something?

I did everything by the book, declared json_schema for the response and so on, added ”JSON instruction" to the assistant itself. I get a final valid JSON back from the API, but all the chunks, as I said, is just a growing partial JSON string.

_j · February 14, 2025, 12:31pm

You requested streaming, which is optional, but if you are waiting for a complete JSON, doesn’t serve you much purpose except as an alternative for run status polling.

The chunks are generally aligned with AI tokens, or multi-token unicode glyphs. You might get {" and next receive description, where a word can be a single token.

Topic		Replies	Views
Json format causes infinite "\n \n \n \n" in response API gpt-4 , api , json-mode	21	9483	April 30, 2025
Auto tool call streaming differentiation is unintuitive Feedback api	3	98	May 19, 2025
Incomplete stream chunks for completions API API api , completions	8	3186	September 25, 2023
Streaming with recursive function / tools calling API gpt-4 , functions , streaming	13	3418	April 3, 2025
API response is not JSON parsable despite specified response format API api , response_format , gpt-4o-mini , structured-output	13	2043	November 21, 2024

Streaming using Structured Outputs

Related topics