Anyone can explain me why I cannot set max_token to 32k?

Hi all,

this payload
{ “model”: “gpt-4”,
“messages”: [

{"role": "user", "content": "Générer les titres de  chapires que pourrait inclure un livre sur chatgpt."}

],
“max_tokens”: 20000,
“temperature”: 0.3,
“n”: 1,
“presence_penalty”: 2.0,
“frequency_penalty”: 0.1
}

returns this:
{
“error”: {
“message”: “This model’s maximum context length is 8192 tokens. However, you requested 20029 tokens (29 in the messages, 20000 in the completion). Please reduce the length of the messages or completion.”,
“type”: “invalid_request_error”,
“param”: “messages”,
“code”: “context_length_exceeded”
}
}

I’m tiers 5
How can I request a 32k max_token?

It seems to be possible.

Thanks to all for your support

The context length of GPT-4 is 8,192 tokens.

This is the total of both input and output (completion), so if the completion tokens exceed 8,192 tokens, an error will occur.

After the release of GPT-4, a 32K token length version of GPT-4 was also released, but it did not become mainstream, and most people cannot access it except for a very few who currently have access.

Additionally, the 32K model of GPT-4 will be discontinued for existing users in June next year and will no longer be available.

Instead, it is recommended to use models such as gpt-4o with a 128K context length.

2 Likes

max_tokens is a parameter for the maximum response length you will receive back. Perhaps you misinterpret the intention of this parameter.

Is it because you are using the model GPT-4, which has a context length of 8192 tokens? Thus making an attempt to reserving a 20000 token portion of that context solely for a response, which would be impossible? Yes.

The setting has particular value. Setting it can cut off the response of an AI if the AI model goes into a loop of nonsense, producing garbage far beyond the length your application needs.

No OpenAI model is completely copacetic with this idea of long output, though, even on the ones that support it, like the limited access “gpt-4-32k”, or “gpt-4o-2024-08-06” that allows a setting of 16k for output. The model will find a reason to wrap up the response before it approaches anything like that kind of response length.

3000 is reasonable to get anything the AI could write without special prompt engineering to coax it to go beyond its training.

1 Like

Hey, thank you for that great answer.
I create a book generator. When I ask to provide detailed information with short conclusion, the last paragraph usualy has a lack of completed sentence. missing words like these one “the” “of” “and” etc.

If I set to 8000k it’s worse.
If i set to 14k it’s still present, but little worse.

Can I ask you if you have a suggestion for this issue?

I can provide examples, how ever, it’s in french.

thanks

The AI language model never actually receives this parameter to base its response on. It only serves to terminate the output.

Therefore, you can simply set it to a point where no responses of yours are actually cut off.

If you aren’t looking at the “usage” statistics the API returns to show token consumption, you can also paste text to this site to see how much the AI is writing and how much you still have left to go were you to use a particular setting: https://platform.openai.com/tokenizer

You may have more consistent language use if you were to send lowered API parameters of sampling:

{
"model": "gpt-4-turbo",
"top_p": 0.5,  # default and max is 1.0
"temperature": 0.5,  # default is 1.0
"messages" : {...

Then, after that refinement is in place, the models selected themselves will have different qualities in how prone they are to errors in different world languages, and how much those sampling settings just shown need to be constrained even lower.

You can omit the max_tokens parameter entirely; you just won’t have the safety of a cutoff if the output goes into a repeating loop (which is a less common symptom these days).

1 Like

You mentioned that you are creating a book generator, but even with the traditional GPT-4, the quality of the sampled text tends to decrease as the output gets longer, possibly due to the limitations of the attention mechanism.

However, recently OpenAI announced a model that supports 64K output (though it is not yet available for use), so it is possible that this model has improvements.

https://openai.com/gpt-4o-long-output/

1 Like

Hi @dignity_for_all,

I read the link about the alpha program. Do you know how to request an interest to OpenAI?

thanks a lot!

It’s unfortunate that, no one knows who can use this kind of Alpha program, whether it will become a product in the future, and so on.

Even if you want to find a way to show interest to OpenAI, it can be difficult to contact them directly.

It is likely that only a select few, such as large corporations or researchers at academic institutions, have access.
If the 64K output model becomes a product in the future, it might become available as a beta version.

However, even if you can’t use the 64K output model, you can achieve your goal with creative approaches, such as generating text in segments and combining them into one book.

In fact, there are people who have published books using AI output.

How about an AI model that at least has a published path to gaining access (having paid a bunch of money to OpenAI and grown to Tier-5)?

o1-mini has an advertised (but not delivered in any attempt I’ve made) 64k output:

MODEL DESCRIPTION CONTEXT WINDOW MAX OUTPUT TOKENS TRAINING DATA
o1-preview Points to the most recent snapshot of the o1 model: o1-preview-2024-09-12 128,000 32,768 Up to Oct 2023
o1-mini Points to the most recent o1-mini snapshot: o1-mini-2024-09-12 128,000 65,536 Up to Oct 2023

The figure instead aligns more closely with “billable as output tokens you don’t receive” per internal iteration, because this AI will stll find a way to stop producing. If repeating back something exactly from a source, you might be able to go over the less than 12k that I have gotten by prompting for justified non-refused non-stop creative generation.