Hi there,

the documentation says:

max_tokens - integer or null
Optional
Defaults to inf

The maximum number of tokens to generate in the chat completion.

The total length of input tokens and generated tokens is limited by the model’s context length (How to count tokens with tiktoken | OpenAI Cookbook) for counting tokens.

When using the GPT4-turbo with a context of max 128k token and a max response of 4k token - what value to use in the rest call???

Should max_tokens be
128k - {token used for prompt}
or 4k? (Just the max_tokens for the response)?

There is usually no need to set max_tokens. Unless you are doing something very specific or you are evaluating some aspect of the model or perhaps using the instruct model which has a legacy 200 token default then omitting the max tokens parameter is the way to go.

In our application, max_token is to be used as a “security mechanism” in order to be able to set a clear cost limit. We would therefore like to give users the option of setting the value.

1 Like

Then you will need to perform some dynamic token counting with tiktoken to calculate an appropriate value.

The Value should be

Current prompt Model max token context - tiktoken prompt token count - some small margin for system tokens… say 15 - your limiting amount

Okay… got it:

“message” → “max_tokens is too large: 126099. This model supports at most 4096 completion tokens, whereas you provided 126099.”

So in the new models the max_tokens seem to be for response only and in the old ones it is for promt and response…

It’s been so long since I used it. These days I stream everything and if I needed to implement some kind of limit I can just close the connection when I reach my token count, the model will rattle off a few more tokens… usually 7-15 while it detects the connection is closed and thats it.

3 Likes

@Foxabilo - A question on another topic.

We generate an output in a json format and have the problem that the output length can be longer than the token limit allows.

In the json format n elements are generated.

Is it possible to tell GPT that n elements should be generated, but only as many elements as the token limit allows, so that a valid JSON can still be generated and sent as a response?

Not reliably, the GPT series of models use a feed forwards network, they are not aware of what they have generated until they have generated it.

When I’m faced with a requirement like this I look for a way to split the request into sections, each one well within the limits of the models input and output and then use traditional code to concatenate the outputs or otherwise process the results into a larger whole once finished.

Then I run into the following problem:
Example:
I want 100 short biographies of the most important musicians of the 90s.
If I split this and always ask after 10 questions, I always get (partial) short biographies of the same musicians. How do you solve this problem?

Use the large 128K input context to show the model which entries have already been processed and instruct the model to avoid using the listed entries for new generation.