Gpt-3.5-turbo-16k Maximum Response Length

Could someone provide a prompt which will generate the longest gpt-3.5-turbo-16k response possible? The responses I’m receiving are only approx 1500 tokens no matter what I use as a prompt.

Even in the playground when I select ‘Chat beta’ and ‘gpt-3.5-turbo-16k’, the Maximum Length slider only goes up to 2048.


Have you tried adding direction on actual word count? I found that beefed up my outputs sometimes. It depends though. Giving it an actual target, in characters, words, tokens, etc, could be worth a try. If it helps please let us know because any input on this would help me too!


Thanks SteveOk, yea I have tried word counts but it ignores it and returns around 1500 tokens.


$messages = array(
array(“role” => “user”, “content” => “Create a complete ebook called ‘How to enhance your business using AI’. We are using a 16k openAI token model so please write the full ebook using the maximum number of words possible.”)

$data = array(
“messages” => $messages,
“model” => ‘gpt-3.5-turbo-16k’,
“temperature” => 0.25,
“max_tokens” => 15500,
“top_p” => 1,
“frequency_penalty” => 0,
“presence_penalty” => 0,
“stop” => [“STOP”],

That returns a table of contents and chapters which are approx 50 words each.

1 Like

So in the context of 16k tokens, what the real capabilities are here is being able to continuously send more context through the chat with each subsequent query.

Think of it more like the memory has improved rather than the output has increased. So when before you sent 6 messages or so depending on the length. ChatGPT would begin to forget the context of the conversation from chat message #1, like if you gave it certain instructions at the beginning. It wouldn’t follow them anymore the further on you went into the conversation. This was because of the token limit. Each time you send a new query, it sends the previous chat history for context to develop it’s next answer.

And with the limit in place it is only going to send the last X amount of words that would fit with the token. So with the increased token length, you are able to have longer conversations with an improved memory. Rather than thinking about it as being able to provide more length. It is true that it can provide more length, but you will have to continually seed it for more of what you want, and as the token increased, it will be able to create more coherent longer conversations.


@JustinC that’s interesting, does it mean that the engine’s working memory is like a trailing running total of its max token threshold, based on the sum of all of its outputs or outputs plus user input prompts?

Big thanks for your explanation JustinC. So I assume this is the same for the larger GPT4 models. Is there talk about enabling longer responses/output rather than just a benefit for the memory/input?

Does anyone know of a way to force longer responses from the 16k model or as you said JustinC, it’s not designed to provide longer responses and there is nothing you can do to force a 10k response for example?


After doing some tests with this, it seems that GPT is just wired to provide short responses, no matter how long you tell it to output.

One workaround was something like this, where I’d ask it to split something into several parts and then elaborate on those parts: AI Impact: Challenges & Opportunities


Requesting multiple paragraphs seems to work quite well. For example…

$messages = array(
    array("role" => "user", "content" => "Write 120 paragraphs for an article called 'How to make money using AI'. Each paragraph should be 2-4 sentences in length. This should be a how to guide with practical information.")

$data = array(
    "messages" => $messages,
    "model" => 'gpt-3.5-turbo-16k',
    "temperature" => 0.25,
    "max_tokens" => 15500,
    "top_p" => 1,
    "frequency_penalty" => 0,
    "presence_penalty" => 0,
    "stop" => ["STOP"],

I just ran it asking for 120 paragraphs and it returned a 5748 token response!


I guess, that the input layers and the output layer are fixed at 1024 tokens even if there is a context window of 16k or 32k. For a transformer to have bigger layers than previously needs new training, which is too expensive with the overall size of the models.

OpenAI might use some kind of batching techniques to have bigger input windows through chunking the inputs/memory into pieces. And maybe some kind of embedding techniques per conversation. This is why the GPT-4 models might be more expensive than the smaller models. With a context of 16k, I think, they need to do several requests internally before submitting the response.


Thanks @Jeffers for this thread, @JustinC for a very helpful explanation and @smuzani for sharing a worked example which I’ve copied - great to see this work with 3.5 (you didn’t even need the more expensive 16k version!) . Does it work as well using the API?

I’m trying to summarise ‘long read’ articles e.g. approx 10k words. Ideally I’m looking for ~ 1000 words (10% of original length) as the summary. I’m also asking for the key topics to be extracted. I’ve been getting some strange results using 16k API.

Any advice on pros & cons of using 16k API versus chunking with original 3.5 API? The ‘mini-summaries’ from 3.5 are not always cohesive when stitched back together, but using overlapping input, or tagging mini-summary#1 on to input#2, increases costs and may not improve results much.

Any suggestions greatly appreciated.

1 Like

hi! i am having the very same behaviour. I will try to use the new “plugins” / “function” feature to maybe create a similar apporach than the new feature in the official chatgpt chat where a button pops up “generate more” … maybe this helps and someone of you can try it, too

1 Like

While this makes sense in terms of chat - they have now released 16k on the API. The API does not have any chat history. I’m prepared to pay the extra costs to upload a large article, but I need to be able to produce decent output to make it worth it.

1 Like

Can somebody please explain this math(s)?
Using API: Tokens: 16384 Model: gpt-3.5-turbo-16k

Prompt: Write 30 paragraphs that summarise the key takeaways of the article below. Each paragraph should be 2-4 sentences in length.
Response:This model’s maximum context length is 16385 tokens. However, you requested 21864 tokens (5480 in the messages, 16384 in the completion). Please reduce the length of the messages or completion.",

New Prompt:Write 3 paragraphs…

This model’s maximum context length is 16385 tokens. However, you requested 21864 tokens (5480 in the messages, 16384 in the completion). Please reduce the length of the messages or completion.

By my recogning a paragraph - 2-4 sentences is around 100 tokens.
The response should easily be able to contain 30 short paragraphs.
What is the point of a large model that can’t even create some summary points for an article?

Maybe remove the 5480 from your message from the max token?

16k means message + completion - you won’t get 16k answer.

use a frame to solve it ,just like this:Write a paper about climate change and generate an outline first, which includes 13 chapters, with each chapter consisting of about 1000 words. if use API, repeat calls in the process.

Have you found a way to deal with it? I have triend so much and nothing works as for now

Yeah, did you click the link? It goes to a prompt that works:

It appears to be trained to stop writing at around 1500 words. It’ll write a conclusion, close a paragraph, poem, whatever at around that point, no matter how you instruct it.

So the trick is to ask it to write several parts of around 1500 words in length and then just combine it into something bigger.


The whole point is to summarise a large article - the 5480 is uploading it. If a paragraph is about 100 tokens - there should be space to generate about 100 of them within the token limit.

I couldn’t agree more with @JustinC

The larger context, even for GPT-4-32k, is more about input context, not output length. To get more output, you need to coax it out of the model by hitting the model many times.

But the larger window is great when you need more input context without resorting to excessive truncation of the input.

But I still don’t have any hard data on whether they increased the number of attention heads for the larger context, or if they are diluting the attention over the larger context. Anyone?

1 Like