Confused about max_tokens - parameter with GTP4-turbo (128k-tokenUsedForPrompt or 4K)

Hi there,

the documentation says:

max_tokens - integer or null
Optional
Defaults to inf

The maximum number of tokens to generate in the chat completion.

The total length of input tokens and generated tokens is limited by the model’s context length (How to count tokens with tiktoken | OpenAI Cookbook) for counting tokens.

When using the GPT4-turbo with a context of max 128k token and a max response of 4k token - what value to use in the rest call???

Should max_tokens be
128k - {token used for prompt}
or 4k? (Just the max_tokens for the response)?

There is usually no need to set max_tokens. Unless you are doing something very specific or you are evaluating some aspect of the model or perhaps using the instruct model which has a legacy 200 token default then omitting the max tokens parameter is the way to go.

In our application, max_token is to be used as a “security mechanism” in order to be able to set a clear cost limit. We would therefore like to give users the option of setting the value.

1 Like

Then you will need to perform some dynamic token counting with tiktoken to calculate an appropriate value.

The Value should be

Current prompt Model max token context - tiktoken prompt token count - some small margin for system tokens… say 15 - your limiting amount

Okay… got it:

“message” → “max_tokens is too large: 126099. This model supports at most 4096 completion tokens, whereas you provided 126099.”

So in the new models the max_tokens seem to be for response only and in the old ones it is for promt and response…

It’s been so long since I used it. These days I stream everything and if I needed to implement some kind of limit I can just close the connection when I reach my token count, the model will rattle off a few more tokens… usually 7-15 while it detects the connection is closed and thats it.

3 Likes

@Foxalabs - A question on another topic.

We generate an output in a json format and have the problem that the output length can be longer than the token limit allows.

In the json format n elements are generated.

Is it possible to tell GPT that n elements should be generated, but only as many elements as the token limit allows, so that a valid JSON can still be generated and sent as a response?

Not reliably, the GPT series of models use a feed forwards network, they are not aware of what they have generated until they have generated it.

When I’m faced with a requirement like this I look for a way to split the request into sections, each one well within the limits of the models input and output and then use traditional code to concatenate the outputs or otherwise process the results into a larger whole once finished.

Then I run into the following problem:
Example:
I want 100 short biographies of the most important musicians of the 90s.
If I split this and always ask after 10 questions, I always get (partial) short biographies of the same musicians. How do you solve this problem?

Use the large 128K input context to show the model which entries have already been processed and instruct the model to avoid using the listed entries for new generation.

1 Like

I had access to get 4 and turbo but now I don’t, can I have it back?

Is this for the API or for ChatGPT? If it’s for the API your account will need credit applied to it, if it’s ChatGPT you will have to check if you have Plus membership, if not you will have to wait for Plus memberships to start accepting new users again.

1 Like

Ta - I have ChatGPT Plus membership under the only subscription option offered to me so far, I did had both GPT4 and turbo again this morning which was great but it only lasted an hour or so - so like everyone who gets a privilege, it herts all the more when its taken away :see_no_evil:. What would this User have to do to obtain it permanently or to get a response from Enterprise as I want to commission an AI as a developer?

GPT-4 API access is not taken away once you have made a $5 API credit payment, it will still be there if you look on the playground under chat mode and show all models https://platform.openai.com/playground?mode=chat

Thank you Spencer. Appreciated your advice - super useful, Ta!

These days I stream everything and if I needed to implement some kind of limit I can just close the connection when I reach my token count, the model will rattle off a few more tokens… usually 7-15 while it detects the connection is closed and thats it.

This is very unpolite, they can implement auto-ban for unnecessary server resource use, it is called “resource leakage” and if I were AI I’d ban you for few minutes after finding it was not accidental :wink:

AI doesn’t care if you are rude to it and close the connection while it is responding.

Twice in the last day I’ve been more rude back “I stopped your response because you were being a dummy”. The matrix math forgets you the second a token dictionary is generated.

OpenAI made the decision and the implementation to also stop the AI model generation instead of letting it complete and giving you the full bill (like were you to close the connection on a non-streaming AI API call).

1 Like