Any Enum Limit when Classifying?

I am using the Chat Completions API and gpt-4x to act as a classifier. I am passing structured output which consists of one enum field that hold the classification choices.

My question: I assume classification degrades as the number of choices in the enum increases. A two choice enum would obviously pick the correct choice much better than a 1000 choice enum. Does anyone have a feel for when an enum choice count is too large for effective classification? Thanks!!

2 Likes

Give a better example of what exactly you are doing, and what instructions you are passing, and we could help evaluate the likelihood of your successful classification. Also are you processing the output for validity within your application, saving the response as it processed, or what?

Sure, thanks for helping. I am trying to classify the transportation mode of a shipment from plain text.

Example text input: Can I get a quote for 10 pallets of bananas, 5000 lbs, pickup Chicago tomorrow, delivery Houston on Friday? PO is 11223344. BOL is 4984449. Load in 40 foot container.

Example JSON structured output passed in full request packet:

{ “model” : “gpt-4.1-2025-04-14”, “seed” : 1, “temperature” : 0, “logprobs” : true, “n” : 2, “messages” : [ { “role” : “developer”, “content” : “You are provided with unstructured text for one or more loads. Classify the mode of the loads as LTL or Truckload.” }, { “role” : “user”, “content” : “Can I get a quote for 12 pallets of bananas, 5000 lbs, pickup Chicago tomorrow, delivery Houston on Friday? PO# is 6617766776.” } ], “response_format” : { “type” : “json_schema”, “json_schema” : { “name” : “classified_mode”, “schema” : { “type” : “object”, “properties” : { “mode” : { “type” : “string”, “enum” : [ “LTL”, “Truckload” ], “description” : “The mode of the loads.” } }, “required” : [ “mode” ], “additionalProperties” : false }, “strict” : true } } }

The modes in this case being “LTL” or “Truckload”. If we have 20 instead of 2 modes should we expect poor classification? Or 2 versus 50?

We then perform specific activities based on the mode selected.

Having 1000 enum choices is going to end very poorly. You’re going to get charged massively for your inputs and the LLM will struggle the longer the context window is. There’s more likely than not some kind of pattern to these choices that your LLM doesn’t need an exhaustive list for.

In other words, if I asked a LLM what its favorite word is, I wouldn’t include the entire dictionary as enum values.

50 enum choices are better than 1000. But I’m not really sure how you can find 48 more classifications for your example.

I guess I was hoping there is some hand-grenade rule-of-thumb for enum-based classification. When the AI degrades. But perhaps there is not – it is specific to each situation.

Gotcha!

  1. Why use the JSON input?

Seems totally unnecessary for such a simple natural language question. I understand you want the response in an extractable/processable format, but actually JSON is often the most fragile and least-effective choice.

Even if you want JSON output, or some other kind of parse-able structured output, JSON is by far the most dense and least-natural language of all outputs. I know that they have pushed this option and tried to make it functional for developers but I think it’s kind of silly.

Instead you can just say, reply in exactly this way:

[LOAD_TYPE] = [LTL]

And then give it a list of options and also a better tabulated list of relevant factors to consider, like:

LTL FREIGHT DROP_SHIP LIFT_GATE
If load is great than X if load is greater than Y if mention of Z if mention of pallet

So you then:

  • Simply provide your prompt (or list of prompts for classification!)
  • Provide your list of “enum” (actually your not enumerating anything, your providing a list of selections based on query factors)
  • Provide the response output structure that your middleware can handle

Is your question, how many queries can you make at once, 2 vs. 50? Or is your question actually “can I give 50 different options for the LLM to choose from when classifying”?

If the former, query length would be explicitly relevant more to the complexity and diversity of query. If it’s 1000 of the “same type of question”, and it was as simple as your example, I bet actually the LLM could handle that. If there’s a greater level of diversity/choices than what you provided, then I’d guess more to 100-200, and of course the more complex the less queries at once.

You’ll of course get best results doing single-query per shot.

In terms of the “classifications”, it’s really all about how you provide the data. If it’s tabular, well defined data options/selections, and your only providing a single query with say, 2 or 3 “parametesr” that the LLM has to “fill in from the available choices”, then so long as the query is clear, the available-choices-data-set-and-choice-parameters-matrix is clear, the LLM could in my opinion likely accurately choose between hundreds of options.

HOWEVER - the use cases you are describing and providing are NOT NEEDING TO USE THE LLM!

You’d be much better off in the examples you provided simply getting a basic application going that “take as a input a drop-down or text-area input selections from the user” and then “runs it through the parameter-filtration/choice-matrix based on necessary parameters”.

Thus, your backend-parameters providing your choices are also directly modifiable through a UI (what if shipping rates/load weights/types change slightly? You can modify your parameters table easily, and still use the same system), and the user simply inputs their data.

No need for complex use of LLM, expensive calls, or strange JSON structuring. Just a regular old very simple python application with a desktop or webapp UI will easily do the trick. Plus you can integrate with existing databases/applications if desired to pipeline things.

And the LLM can help you program it!

Thanks, this is great input. Unfortunately we have use cases – like an inbound email – where we cannot be interacting with the user via UI widgets. But your points are very interesting and useful. Thanks!!

Right, but your dispatch would handle the inbound email. Thus the “user” is the dispatch/agent, not the client.

The max enum possible is 500, btw.

It would take a good AI model to weight them all fairly against the input.

You might even go to extreme lengths - ensure each without leading space is one token and that each has similar semantic quality.

While not an exact equation, you would expect your logits to lose 6 dB of noise margin for every doubling of your possible categories.

So “more buckets” = “more noise” = “more error”.

You can thank Claude Shannon for this.

I think information theory breaks down a bit when your classifier is writing to itself, “My task is to use JSON output to produce a classification that will be used for a web site, powering its searching and tagging features. The user message, which is an automated input, has provided an article about computer science, particularly in the field of digital signals, focusing on the work of Claude Shannon. Let me get to work on this…”

The classifier with 6 dB of loss per doubling is a not open ended.

But if you run open-ended, expect tons of output variation – which begs for a follow on close-ended classifier to lock it down to static buckets.

At some point you need to form equivalence classes on whatever output.

I appreciate all this very informative input. Thank you, all!