Why are completion tokens so high?

I am using the gpt-5-nano model for a step in a chatbot to determine a user’s intention with their latest comment. There are about 15 categories that I am trying to identify for now, and I’ll reduce the number as I see which have most value.

So I have a reasonably large number of input tokens, which is fine, but what I don’t understand is the very high output tokens count that I am getting, usually 800-1000. For example, this response was 1057 tokens. It’s nearly as high when I ask for only the code without the reasoning. If I set max_completion_tokens any lower then it just ends up failing with a length error. Any idea what’s up?

{“code”: “USER-COMPLIMENTS-ASSISTANT”, “reason”: “User said the assistant’s joke was very funny.”}

Thanks!

Steve

  1. gpt-5 models are reasoning models. You pay for their internal thinking as output.
  2. gpt-5-nano thinks excessively long for poor results. Better to just use mini.
  3. use the API parameter “reasoning_effort”, and set it to “low”. That will indicate to the model how much to think (the parameter for Chat Completions).
  4. Or simply use gpt-4.1, which goes right to producing output without first deliberating about and valuing which “code” to generate. Use a “top_p”: 0.01 if you want consistent answers instead of random ones.
2 Likes

Why use reasoning at all for this? It sounds like a classical categorisation task that you could evaluate with a basic LLM without any internal “loops”.

I second @_j suggestion of using 4.1

interesting answer. Which of the models? 4.1, 4.1-mini, 4.1-nano, do you think is best. I’ll try the top_p thing too