I am fairly new to working with the API and I am facing an issue while using GPT-4o-mini for a classification task
My problem is as follows: I want to classify scraped text into categories so that I do not have to do it manually.
There are many categories—around 1500.
The scraped text varies from 1 to 50 words. Sometimes, it directly matches a category, but other times the information is more implicit and needs to be inferred. For example:
The main problem is that the model invents new categories even when I set the temperature to 0, even if sometimes the guess seems very straitforward…
Also, the output is sometimes incorrectly formatted. The model uses quotation marks or writes the answer in a sentence, even though I specified not to do so
This is the prompt for each request, maybe it could help you understand why the model does not obey:
messages = [
{‘role’: ‘system’, ‘content’: “You are an assistant that determines which reference from this list (1589 references): <list of the 1589 references> matches a given reference. Answer with the corresponding element in the list only.”},
{‘role’: ‘user’, ‘content’: “What is the corresponding reference for this given reference?: ”}
]
Do you have any recommendations for solving this problem? Fine-tuning did not help either… Thank you!
The model does not obey because it is simply too dumb to evaluate 1600 choices equally against its pretrained knowledge and the rest of the instructions you give. Mini means less quality and less attention.
If you want to constrain the output so the AI cannot write anything else, you could send four anyOf schemas in a response_format. Then each schema has a different name (perhaps a category) that the AI has to first write correctly, and then an enum string with 400 options (500 max per enum). See if that isn’t beyond the limit of the total object.
Remember: structured outputs with enum will force AI to use one of those, regardless of the match or quality. It should also have a “get out” schema for no matches.
Otherwise, you’ll have to make multiple requests with sub-lists, then have a final AI decide which of those is the best.
Agree with OP. I have a similar use case with less categories. The answer has been to use structured output + pydantic with 100% compliance by GPT even though I use temperature=0.3. You can check the documentation for structured outputs. This is the option I use in the request:
Your label space is too large for a model to perform effectively, one good approach here would be to set this up as a hierarchical classification task, and recursively call the model across different hierarchy levels.
This way you reduce the number of labels options you send to the model (which majorly tends to improve performance and reduces hallucination rate, also makes it easy to validate, (you could also technically create dynamic json schemas for each level in your hierarchy which can be used to take advantage of the structured output functionality of the api , further aiding in reducing hallucinations))
One drawback is you need to create a label hierarchy (very similar to a taxonomy), but i believe these models are quite adept at helping you create this. (Or you could try a mix of clustering and using LLM’s to guide/correct the clustering process).