Can you specify partial responses in the Structured Outputs API?

I would like to specify partial responses in using the Structured Outputs API.

To illustrate what I mean by this, I created a mock example illustrating my use-case. I suspect that it might not be possible to do exactly what I am looking for, but I hope that there is a way to improve my solution.

Basically, let’s say you have a prompt which involves answering a series of Yes/No questions on a document. You might code something like this:

from typing import List, Literal

from pydantic import BaseModel
from openai import OpenAI

TEMPERATURE = 0
MODEL = 'gpt-4o-mini-2024-07-18'

class DocumentQuestion(BaseModel):
    """
    A question about a given company's job description, accompanied by the corresponding answer.
    """
    question_label: int
    question_text: str
    answer: Literal['Yes', 'No', 'N/A']

class Analysis(BaseModel):
    questions: List[DocumentQuestion]


fake_job_description = """Job Title: Marketing Analyst
Company: Brightwave Solutions
Location: Remote

Brightwave Solutions is seeking a Marketing Analyst to join our growing team. The ideal candidate will analyze market trends, manage digital marketing campaigns, and optimize customer acquisition strategies. Responsibilities include data-driven decision-making, preparing reports, and collaborating with cross-functional teams. Applicants should have experience with Google Analytics, SEO, and CRM platforms.

Requirements:

Bachelor's degree in Marketing, Business, or related field
2+ years of experience
Strong analytical and communication skills
Salary: $55,000 - $75,000 annually
Apply by: October 31, 2024"""

questions = [
    {"question_label" : 1, "question_text": "Does the job description specify a salary expectation?"},
    {"question_label" : 2, "question_text": "If the job description specifies a salary expectation: Is this a salary range?"},
    {"question_label" : 3, "question_text": "If the job description specifies a salary expectation: Is this an exact salary?"},
    {"question_label" : 4, "question_text": "Does the job offer a hybrid working arrangement?"},
]

user_input_template = (
    "Your task is to read through the document provided below, and answer the provided questions.\n"
    "<document>{document}</document>\n"
    "<questions>{questions}</questions>"
)

# Create a user message
user_input = user_input_template.format(
    document = fake_job_description,
    questions = questions
)

messages = [{"role": "user", "content": user_input}]

client = OpenAI()

completion = client.beta.chat.completions.parse(
    model=MODEL,
    temperature=TEMPERATURE,
    messages=messages,
    response_format=Analysis
)
parsed_completion = completion.to_dict()['choices'][0]['message']['parsed']['questions']

This approach seems to have a few issues, in my estimation:

  • It relies on the model picking out all the questions from the questions list, which is not guaranteed - hallucination risk
  • It relies on the model quoting these questions correctly, which is not guaranteed - hallucination risk
  • It wastes output tokens (which, of course, are more expensive than input tokens), since these questions are identical for every single document, meaning that the model will duplicate copy-pasting these across every single API call.

To further illustrate this, let’s say I continue using the same approach, and build a second model - an evaluation model which checks whether the output of the above model is correct (which you could include in the production roll-out of the above model to be able to monitor it’s performance in an automated way):

class DocumentQuestionEvaluation(BaseModel):
    """
    A question about a given company's job description, accompanied by the corresponding answer and an evaluation thereof.
    """
    question_label: int
    question_text: str
    answer: Literal['Yes', 'No', 'N/A']
    evaluation_explanation: str
    evaluation: Literal['Consistent', 'Inconsistent']

class AnalysisEvaluation(BaseModel):
    questions: List[DocumentQuestionEvaluation]

user_input_template = (
    "Your task is to evaluate the analysis provided below, and by checking whether it is consistent with the document provided below.\n"
    "<analysis>\n{analysis}\n</analysis>\n"
    "<document>\n{document}\n</document>"
)

# Create a user message
user_input = user_input_template.format(
    analysis = parsed_completion,
    document = fake_job_description
)

evaluation_messages = [{"role": "user", "content": user_input}]

completion = client.beta.chat.completions.parse(
    model=MODEL,
    temperature=TEMPERATURE,
    messages=evaluation_messages,
    response_format=AnalysisEvaluation
)

parsed_completion_evaluation = completion.to_dict()['choices'][0]['message']['parsed']['questions']

This amplifies the previously mentioned issues, as you now rely on the model to correctly recall the list of questions, as well as their responses. This creates a hallucination risk which, in theory, structured outputs could completely alleviate.

So, to take the above example, is there a way to hard-code the list of questions and answers, in such a way that the evaluation model only has to generate the evaluation_explanation and evaluation fields?

I imagine that instead of having to create a “partial response” to a given schema, you could reframe this partial response as another response schema or pydantic model, but I’m having trouble imagining what this would look like.

Hi,

There’s no way to get a “partial response” from a completion. The response is complete, you may say. But you can do multiple things in stages using an Assistant.

  1. Multiple Assistants can work on a Single Thread.
  2. An Assistant with Function Calling can read the first message rout the proper question to the proper question-answering bot.
  3. Said proper bot can read the question and answer it, using response_format to put the message in the thread in the desired output.
  4. A Critique Assistant can correct the final output, and use structured_response again to ensure it’s to your schema.
1 Like

@sebastian.chejniak this is super interesting, I would suggest a few tips:

  • For the questions, I would actually just create an enum of question numbers, either [1, 2, 3, ... N] or questions labels, i.e. ["Q1", "Q2", .., "Q<N>"]. So instead of outputting full question, you would just output a question number or a label - this reduces hallucinations and saves you $$
  • Maybe I didn’t quite understand “partial” responses bit, but if you are after a scenario where a question may not be relevant and therefore should not be asked, you can look at using “optional” fields.
1 Like

This does seem like it would reduce hallucination, but it also seems overkill for my use case, and it doesn’t completely eliminate hallucination risk. It also increases token usage which I do not want.

The first approach is a nice idea, and I will probably implement it. To be clear, I always want to answer all questions, I do not want the model to drop any questions.

I am now thinking that having question_label and question_text as separate fields is unnecessary too. It might be better to have one field question which is of the form “Q1: Does the job description specify a salary expectation?”, and use an enum for that. I see no downsides to it, only upsides.

That being said, it doesn’t address the problem of allowing the evaluation model to hallucinate the answers of the first model.

But you could take a similar approach to tackle this problem. Instead of the evaluation response schema being the same as the original, but with the evaluation_explanation and evaluation fields added, I might modify it so that I convert every question-response pair to a string of the format “Q1: Does the job description specify a salary expectation? A: Yes”, and create an enum field (call it question_and_answer) out of that. So the model would have the fields question_and_answer, evaluation_explanation, and evaluation. It seems messy to have to redefine a pydantic model for every individual prompt, but the solution feels like an improvement overall.

So the state of the solution now:

  • No hallucination risk involved with quoting an individual question correctly.
  • Hallucination risk: It is still possible that a question will be left out by the model, though it is probably even less likely now that I use the enum approach
  • I do not know how structured outputs considers the schema itself as well as the literals extracted or even other aspects of the output (such as the deterministic formatting like when it autocompletes “Q1: …” since the first few characters determine the rest of the question in my case. Any insights into how this approach might affect / not affect input / output token usage?

In the upcoming OSS framework, the fundamental abstraction is a goal :

gc = GoalComposer(provider="OpenAI", model="gpt-4o-mini")
gc = gc(global_context="global_context")

gc\
    .goal("read text file", with_goal_args={'file_name': 'jd.txt'})\
    .goal("answer the question from document", with_goal_args=
          {'param': 'jd.txt',  'question': 'If the job description specifies a salary expectation: Is this an exact salary?'})

...
gc.run()
  • You can also express all questions in a loop (start_loop, end_loop)
  • You can verify by rewording the question /answer into a fact and verifying the fact
Original Question   : Does the job description specify a salary expectation?
Original Answer     : Yes, the job description specifies a salary expectation of $55,000 - $75,000 annually.
Converted Fact      : The job description specifies a salary expectation of $55,000 - $75,000 annually.
Verification of Fact: True. The job description specifies a salary expectation of $55,000 - $75,000 annually.


Original Question   : If the job description specifies a salary expectation: Is this a salary range?
Original Answer     : YES
Converted Fact      : A specified salary expectation in a job description typically indicates a salary range.
Verification of Fact: True. The job description specifies a salary range of $55,000 - $75,000 annually, which indicates a salary expectation.


Original Question   : If the job description specifies a salary expectation: Is this an exact salary?
Original Answer     : No, the salary specified is a range of $55,000 - $75,000 annually, not an exact salary.
Converted Fact      : The salary specified in the job description is a range of $55,000 - $75,000 annually, rather than an exact amount.
Verification of Fact: Yes, that is correct. The salary specified in the job description is a range of $55,000 - $75,000 annually, not an exact amount.


Original Question   : Does the job offer a hybrid working arrangement?
Original Answer     : Unable to deduce answer from the provided document.
Converted Fact      : The document does not provide information regarding whether the job offers a hybrid working arrangement.
Verification of Fact: Correct, the document specifies that the position is remote but does not provide information about a hybrid working arrangement.
  • So we (a) expose the JD for each question (b) convert question / answer into fact WITHOUT jd (c) verify fact with JD

  • The “answer the question from document” automagically turns to the following function

@manage_function(TOOLS_FUNCTIONS, "document_functions")
def answer_question( 
                  global_context:Annotated[Any, "This is the global context"], 
                  param:Annotated[str, "This is the parameter to be read from the global context"],
                  question:Annotated[str, "This is the question whose answer is required from the document."])\
    -> Annotated[dict, 
                 """
                 :return: Returns a dictionary with keys
                 - extracted_text(str): returns the answer from the document of the question asked. 
                 """]: 
     """ This function answers the specific question asked of the document"""


     messages = []
     messages.append({"role": "system", "content": SYSTEM_CONTEXT_QA})
     doc = global_context[param]

     messages.append({"role": "user", "content": f"START_DOCUMENT {doc} END_DOCUMENT \n"})
     messages.append({"role": "user", "content": question})

                    
     chat_completion = openai_chat.chat.completions.create(
            messages = messages,
            model=MODEL_OPENAI_GPT4_MINI,
            temperature=0.4,
    )    

     extracted_answer = chat_completion.choices[0].message.content


     messages = []
     messages.append({"role": "system", "content": SYSTEM_CONTEXT_QA_CONVERT_TO_FACT})
     messages.append({"role": "user", "content": question})
     messages.append({"role": "user", "content": extracted_answer})

     chat_completion = openai_chat.chat.completions.create(
            messages = messages,
            model=MODEL_OPENAI_GPT4_MINI,
            temperature=0.4,
    )
     extracted_fact = chat_completion.choices[0].message.content

     messages = []

     messages.append({"role": "system", "content": SYSTEM_CONTEXT_QA_CONVERT_TO_FACT_VERIFIER})
     doc = global_context[param]
     messages.append({"role": "user", "content": f"START_DOCUMENT {doc} END_DOCUMENT \n"})
     messages.append({"role": "user", "content": extracted_fact})


     chat_completion = openai_chat.chat.completions.create(
            messages = messages,
            model=MODEL_OPENAI_GPT4_MINI,
            temperature=0.4,
    )
     
     extracted_verification = chat_completion.choices[0].message.content

     print(f"Original Question   : {question}")
     print(f"Original Answer     : {extracted_answer}")
     print(f"Converted Fact      : {extracted_fact}")
     print(f"Verification of Fact: {extracted_verification}")

     return {'extracted_text': extracted_answer}

hth