Why aren’t we talking more about attribute-specific prompting in Structured Outputs?

This week, OpenAI released Structured Outputs, a significant upgrade to the JSON mode that allows us to define response formats more strictly. If you’ve wrestled with unpredictable JSON responses, you know how much time this will save us!

But here’s the real gem: Structured Outputs let you define prompts for each specific output attribute—an incredible feature that’s not getting enough attention. Let me show you why this is a game-changer.

Take a look at this slightly modified math reasoning example from the OpenAI Cookbook:

import json
from openai import OpenAI
client = OpenAI()

from pydantic import BaseModel, Field

MODEL = "gpt-4o-2024-08-06"
math_tutor_prompt = '''
    You are a helpful math tutor. You will be provided with a math problem,
    and your goal will be to output a step by step solution, along with a final answer.
    For each step, just provide the output as an equation use the explanation field to detail the reasoning.
'''

class MathReasoning(BaseModel):
    class Step(BaseModel):
        explanation: str
        output: str

    steps: list[Step]
    final_answer: str

def get_math_solution(question: str):
    completion = client.beta.chat.completions.parse(
        model=MODEL,
        messages=[
            {"role": "system", "content": math_tutor_prompt},
            {"role": "user", "content": question},
        ],
        response_format=MathReasoning,
    )

    return completion.choices[0].message

question = "how can I solve 8x + 7 = -23"
result = get_math_solution(question).parsed

print(result.steps[0].explanation + "\n" + result.steps[0].output)

This example shows how Structured Outputs can enforce a specific step-by-step format. The output?

Start by isolating the term with the variable by subtracting 7 from both sides of the equation.
8x + 7 - 7 = -23 - 7

Cool, right? But it gets better. With the ability to add descriptions for each output attribute, you can tailor responses even more finely. For example, let’s make the explanation sound like a pirate and the output in Polish notation:

class MathReasoning(BaseModel):
    class Step(BaseModel):
        explanation: str = Field(..., description="The step's explanation, but said like a pirate would say it")
        output: str = Field(..., description="The step's output, but in Polish notation")

    steps: list[Step]
    final_answer: str

Run the code, and you’ll get:

First, matey, we be movin’ that 7 to the other side by subtractin’ it from both sides. Aye!
- = (+ (-23) 7) (+ 7 (* 8 x))

Previously, you could add this information to your system prompt, but attribute-specific prompting lets you fine-tune the output even better. Try changing the explanation to Spanish and the output to reverse Polish notation:

class MathReasoning(BaseModel):
    class Step(BaseModel):
        explanation: str = Field(..., description="The step's explanation, but said in Spanish")
        output: str = Field(..., description="The step's output, but in reverse Polish notation")

    steps: list[Step]
    final_answer: str

And you’ll see:

Primero, restamos 7 de ambos lados de la ecuación para aislar el término con x.
8x - 23 7 -

That’s all for today! Hope you find this feature as powerful and fun as I do!

5 Likes

This is indeed very cool!

One thing I’m struggling with: I’m currently experimenting with Structured Outputs in order to iterate over chunks of a podcast episode transcript and ask it to output a JSON response finding the specific, quantifiable, and falsifiable predictions made in that section (if any). Each chunk can have anywhere from 0 to multiple predictions. (This is about Premier League soccer, so I’m looking for stuff like “Mo Salah will score at least 20 goals,” not vague/unquantifiable/unfalsifiable ones like “Crystal Palace’s defense will be better this season”).

I’ve tried a bunch of stuff but keep seeing a ton of “predictions” being returned that shouldn’t be - in effect, it’s being overly inclusive. This is happening in a variety of different scenarios: mischaracterizing declarative statements as predictions, interpreting questions about something that could happen as if the person predicted they would happen, including actual predictions that are nevertheless too vague/unfalsifiable, etc.

To try to correct for this, I’ve added more and more attributes to the Structured Outputs JSON schema, using each property’s key name and description fields to nudge it to rate how quantifiable each prediction is, how falsifiable, etc. In effect, I’m trying to combine chain-of-thought or step-by-step reasoning with Structured Outputs.

This helps somewhat, but it got me thinking: how do the JSON property names and description fields in the response_format schema interact with the prompt itself when it comes to chain-of-thought? Which one is “read” first? I’m unclear on how the actual prompt and the response_format fit together, and where to put what type of context to improve my chances of getting a good response.

Perhaps stated more generally: when doing chain-of-thought alongside Structured Outputs, what is the best way to structure the output? As in, does it make sense to add attributes to each item in a response list (as I described above) to “force” the model to evaluate whether its responses actually satisfy my criteria? Or is some other method better, e.g. having it evaluate each component (e.g. quantifiability, falsifiability, etc.) of every prediction candidate first, and then only after it’s gone through all of them, have it filter down to the final list of predictions, etc.?

1 Like

Hey Jay, welcome to the forum and congrats on your first post! Here are some detailed thoughts on your questions, but the TLDR is:

I’m not sure how OpenAI’s model treats system prompts versus structured output descriptions. However, the best approach for improving information extraction tasks is to first fine-tune the prompts for basic tasks and then build up to more complex ones. Generally, it’s better to specify “what” you want in the prompt and “how” in the structured output descriptions.

I’ve been working on a similar task, extracting proposed times from emails in a structured format. Initially, I assumed the model could handle the output format, time criteria, and chunking logic simultaneously. Unfortunately, the results were inconsistent.

So, I broke down the task into steps, starting with ensuring the model consistently returns information in the expected JSON format. Despite various attempts, consistency was elusive until OpenAI introduced structured outputs, which solved the issue.

Next, I focused on chunking logic, guiding the model to prioritize information from the latest email in a thread while using previous emails as background. Now, I’m working on refining time criteria for ambiguous start and end times. It’s a work in progress!

This process is iterative and somewhat manual. I test prompts on emails with incorrect results, tweak them until they work, and then move on to the next email.

In your case, I suggest starting with one podcast chunk and ensuring the prompt can extract all predictions. If the model struggles with identifying predictions, it likely won’t differentiate between specific, quantifiable, and falsifiable ones. Once it’s reliably identifying predictions, you can then refine it further.

Hope this helps!

1 Like

You could do that before as well.

@tools_function(TOOLS_FUNCTIONS)
def reword_next_step(reworded_next_step: Annotated[str, "The reworded_next_step should describe the message IN TEXT as a request in second person; rephrased as a question"]):
    """
    The message describes a part of the entire sentence where the main subject is cut out. The task of this function 
    is to reword the part of the sentence as if refering to the main subject in abstact.

    i.e. "should carry on" would become "could you please carry on?"


    """
    return reworded_next_step

where decorator tools_function is defined below:


def tools_function(tools_functions): 
    def wrapper(func):

        function = dict()
        function['function'] = func
        function['name'] = func.__name__
        function['description'] = func.__doc__
        function['parameters'] = {}
        function['parameters']['type'] = "object"
        function['parameters']['properties'] = {}

        

        input_arg_names = [arg_name for arg_name in func.__code__.co_varnames[:func.__code__.co_argcount]]

        for input_arg_name in input_arg_names:
            function['parameters']['properties'][input_arg_name] = {}
            raw_annotation = func.__annotations__[input_arg_name]


            if raw_annotation.__origin__.__name__ in FUNCTIONS_TYPE_MAP:
                ip_type = FUNCTIONS_TYPE_MAP[raw_annotation.__origin__.__name__]
                function['parameters']['properties'][input_arg_name]['type'] = ip_type

                if ip_type == 'array':
                    
                    print('in array')
                    function['parameters']['properties'][input_arg_name]['items'] = {}
                    function['parameters']['properties'][input_arg_name]['items']['type'] = 'str'



            else:
                ip_type =  raw_annotation.__origin__.__name__
                function['parameters']['properties'][input_arg_name]['type'] = ip_type


            function['parameters']['properties'][input_arg_name]['type'] = ip_type
            function['parameters']['properties'][input_arg_name]['description'] = raw_annotation.__metadata__[0]

        tools_functions[func.__name__] = function


        return func
    return wrapper

1 Like

Well look at that! You could do this previously for function calls. Thank you for the detailed description.

Was there something similar for the chat and completion endpoints? If so, do you think that method is better than the new Structured Outputs?

I don’t know how long your average overall transcripts are and if they include speaker markers but have you considered getting those predictions based on the full transcript ?

1 Like

I did originally but it seemed to perform even worse than when I do it in chunks. (I’m assuming it’s some version of the “lost in the middle” problem that seems to crop up a lot with very long prompts.)

I’d actually prefer to send the entire transcript, rather than specific chunks, for a few reasons:

A) It’s simpler than looping over a bunch of chunks, and it also prevents issues where a prediction happens to occur over multiple speaker turns that I’ve arbitrarily split into separate chunks.

B) These are auto-transcriptions, so the speakers are demarcated by “Speaker 1”, “Speaker 2”, etc. But if early on in the episode, “Speaker 1” says “Hey, this is Jesse” then ChatGPT should be able to infer that Speaker 1 == Jesse for the entire episode. This isn’t possible (at least not as easily) with chunked portions.

Per @sdelgado’s reply I think I’ll keep working on the chunking approach for now and try to get very specific / step-by-step until I can get consistently good results in a very basic task before exploring using the entire transcript again.

Thanks for the detailed and thoughtful reply, @sdelgado. This is great advice and I think you’re right: I’m running into a generalized version of the exact same type of problem you described with extracting times from emails.

I think on some level my mental model is still to try to cram as much context into a single prompt as possible because I started playing with these OpenAI APIs back in the 3.5 days when each prompt/response was significantly more expensive than it is now. With gpt-4o-mini I think iterating over multiple steps (even if it takes sequential prompting for each step) is so cheap now that it’s worth a shot. Will see how it goes!

1 Like

How about a ‘two shot’ version - I do a lot of work with transcripts and often I start with a specific, tuned summarizer. In your case I would imagine something along the lines like 'you are podcast data analyst, your job is to help determine predictive abilities of people by looking at past predictions. You job is to summarize this transcript focussing only on statements that are predictions - some examples are ‘…’ .

You current prompt could than work on that transcript. For that prompt I would probably do something along the lines of a rating system (rate each prediction for these elements (quantifiable - can this prediction be easily verified after a certain time has passed. (again possibly using examples), … other elements of the rating …

‘for each prediction output json with the prediction, your argument for the score of this prediction and the score’ (You must detail this more or use the structured output to specifiy)

If you want to provide (dm if you want) an example transcript I’d be happy to play with it :slight_smile:

1 Like

Thanks @jlvanhulst ! I’ll give this a shot and report back.

I’ve been doing this for over a year, even before the original “add JSON to your system prompt” approach. Early on, I was able to get reasonable suggestions for the various fields of complex objects by carefully explaining what was being sent and what was expected.

In part of the form to send the prompt, you can choose which fields of the object to ask for suggestions on and give it specific additional instructions.

I had the response JSON not only contain the properly shaped object, but also a mirror object where it explains why it made each suggestion.

Then when displaying the response, for every field containing a suggestion returned from the AI, I show two buttons. One lets you hover to see the justification for that suggestion, and the other lets you dismiss the suggestion.

This way you can iterate on the object, getting suggestions, accepting some, dismissing others, until you’re happy with it.

Here is an example:

Character Ideation Params

Character Form

Character Ideation Prompt

1 Like