Hi everyone! I’m building a simple text classifier using OpenAI API, and I was wondering if there is a way to explicitly define an input data structure?
For example, my input contains a list of texts to classify (texts
) and a list of available labels (labels
), and I want the API to match each text with one of the labels from the list. I’m defining the output format through the following data structure:
from pydantic import BaseModel
class LabeledText(BaseModel):
text: str
label: str
class LabeledTexts(BaseModel):
texts: list[LabeledText]
And then I pass this structure through the response_format
parameter:
response = await async_client.beta.chat.completions.parse(messages=messages, model=model, response_format=LabeledTexts)
But I’m not sure how can I define an input structure texts
and labels
in the same fashion (so I can explicitly separate those variables from the prompt instructions). My current solution is to pass the lists without any formatting:
prompt = f"""
Act like a text classifier. You will be given a list of texts and a list of labels. Your task is to match each text with a label. Return the results in a JSON format where each item contains the original text and the corresponding label.
Texts: {texts}
Labels: {labels}
"""
But the classification quality is not that great, and I feel like the input structure might be the main bottleneck here. Does anyone know some best practices to structure the inputs for API in similar scenarios?
Thanks!