Max_tokens seems to do nothing for me whatsoever. I want to limit the response length I’m getting to about 150 tokens.
If I set max_tokens to prompt + 150, it doesn’t keep to it at all. If I try to test it by doing something extreme, like setting max_tokens to 400 when my prompt alone is 850, the API just continues as usual and takes an 850 prompt + outputs a 250 response.
It’s like max_tokens has no effect on anything, really.
I’m not quite sure how to implement this in the context of the API. For example, if it’s responding with 250 tokens in 5 paragraphs, and I wanted 100 tokens in 2-3 paragraphs, what would my stop sequence look like?
I could show it an example, for example one with only 100 tokens. However how would the stop sequence be integrated such that it would do the same when the API response came through?
You seem a bit confused. The max_tokens parameter is only the context length reservation of the response. If you set it to 250, you will have a response that is truncated at 250 tokens. If you set it to 850 or whatever incorrect idea you have, 850 tokens is the amount of tokens that can be used for the reply.
The specification doesn’t inform the AI what type of response it should craft at all, the AI doesn’t know this setting. You’ll have to instruct the AI with language like “two brief paragraphs” to shape the length of the output.
Ok thanks this worked, I thought I had tried it like that but evidently I had not. Lots of incorrect info out there about this parameter.
With regard to putting a descriptor of output structure directly within the prompt, this seems to be ineffectual as I tried it quite a few times and it didn’t make a difference.
completion = openai.ChatCompletion.create(model=“gpt-3.5-turbo”,
max_tokens=1000,
messages=[{
“role”: “system”,
“content”: You are a bot that knows x (x being some knowledge I'm pulling from a dB, which is variable depending upon which group is accessing it) Tell me about x.
}, {
“role”: “user”,
“content”: val
}])
I simply want the output here to be 150 tokens or less, and it would be preferable if it were not truncated awkwardly via forcing max tokens to be 150.
This is where prompt engineering becomes more art than science. I would recommend providing two clearly delineated examples when asking the question which are the desired length. That should help you get the right amount of text back.
It’s not recommended to set this token since it doesn’t impact the generation or even billing. It only determines the output you receive through truncation.