Max_tokens seems to do nothing for me 3.5 Turbo

Max_tokens seems to do nothing for me whatsoever. I want to limit the response length I’m getting to about 150 tokens.

If I set max_tokens to prompt + 150, it doesn’t keep to it at all. If I try to test it by doing something extreme, like setting max_tokens to 400 when my prompt alone is 850, the API just continues as usual and takes an 850 prompt + outputs a 250 response.

It’s like max_tokens has no effect on anything, really.

Here is my prompt:

completion = openai.ChatCompletion.create(model=“gpt-3.5-turbo”,
max_tokens=1000,
messages=[{
“role”: “system”,
“content”: my_prompt
}, {
“role”: “user”,
“content”: val
}])

Are you using a stop sequence? It can likely help you…

https://help.openai.com/en/articles/5072263-how-do-i-use-stop-sequences

I’m not quite sure how to implement this in the context of the API. For example, if it’s responding with 250 tokens in 5 paragraphs, and I wanted 100 tokens in 2-3 paragraphs, what would my stop sequence look like?

Depends on your entire prompt. What are you trying to get it to output? Can you show it an example with a stop sequence added?

I could show it an example, for example one with only 100 tokens. However how would the stop sequence be integrated such that it would do the same when the API response came through?

Prompt:

Give me some output about XYZ…

Sure! Here’s your information about XYZ. ###

Give me some output about XYZ…

Then set your stop sequence to “###” and it should follow the one-shot example.

It might be easier to help if you could share the prompt or what you’re trying to achieve.

Counting words/tokens is hard for the LLM…

You seem a bit confused. The max_tokens parameter is only the context length reservation of the response. If you set it to 250, you will have a response that is truncated at 250 tokens. If you set it to 850 or whatever incorrect idea you have, 850 tokens is the amount of tokens that can be used for the reply.

The specification doesn’t inform the AI what type of response it should craft at all, the AI doesn’t know this setting. You’ll have to instruct the AI with language like “two brief paragraphs” to shape the length of the output.

2 Likes

Ok thanks this worked, I thought I had tried it like that but evidently I had not. Lots of incorrect info out there about this parameter.

With regard to putting a descriptor of output structure directly within the prompt, this seems to be ineffectual as I tried it quite a few times and it didn’t make a difference.

So it would look something like this:

completion = openai.ChatCompletion.create(model=“gpt-3.5-turbo”,
max_tokens=1000,
messages=[{
“role”: “system”,
“content”: You are a bot that knows x (x being some knowledge I'm pulling from a dB, which is variable depending upon which group is accessing it) Tell me about x.
}, {
“role”: “user”,
“content”: val
}])

I simply want the output here to be 150 tokens or less, and it would be preferable if it were not truncated awkwardly via forcing max tokens to be 150.

Again, no, you set max tokens to only the length of the response desired.

And then you use prompt language that curtails even a physics textbook topic to the length desired (here 150 tokens would result in truncation):
Untitled

2 Likes

Yes, this. It’s not always easy to get it to stop at X words or X tokens, though, which is why I suggested a stop word if it could be added.

This is where prompt engineering becomes more art than science. I would recommend providing two clearly delineated examples when asking the question which are the desired length. That should help you get the right amount of text back.

Additionally, your max_tokens setting will not affect the GPT generation process. As mentioned by _j, it is truncated.

It’s not recommended to set this token since it doesn’t impact the generation or even billing. It only determines the output you receive through truncation.