Setting max tokens for output issues

Getting really long responses sometimes, so I have made this prompt in backend

“You are an extremely helpful assistant, be concise and relevant. All responses should use 250 completion_tokens or less.”

Although it’s following this rule, it’s now cutting mid sentence. Sometimes it finished the query correctly in under 250 tokens, sometimes, it cuts mid sentences while answering the exact same query.

Any solution to this?

Also, Any solution to streaming text format structure?

Hi there - rather than being specific about the tokens, I would just try some variations with the prompt. By being explicit that the response should be concise, you should be able to achieve this goal. You could e.g. add to respond with only one concise sentence.

Alternatively, you may try to instruct the model to only return complete sentences.

I can’t comment on the streaming question.

Why not just actually set max_tokens = 250?

Because the setting does not inform the model in any way what type of response to construct.

You instead get text that is truncated, which is the symptom seen here.

The AI doesn’t perceive tokens, words, etc in the same way we see them. It doesn’t have an attention mechanism to total all tokens of its response for every generated token and have that predict the manner in which things should be phrased in continuation.

If you want a length, the best way to break down the task is “three paragraphs, averaging ten words each” or similar instruction.

1 Like

Fair enough.

This is solid advice for more when targeting specific lengths. Personally, for short responses I usually just go with something along the lines of, “your response must be sharp, concise, and terse.”