After doing some tests with this, it seems that GPT is just wired to provide short responses, no matter how long you tell it to output.

One workaround was something like this, where I’d ask it to split something into several parts and then elaborate on those parts: AI Impact: Challenges & Opportunities

5 Likes

Requesting multiple paragraphs seems to work quite well. For example…

$messages = array(
    array("role" => "user", "content" => "Write 120 paragraphs for an article called 'How to make money using AI'. Each paragraph should be 2-4 sentences in length. This should be a how to guide with practical information.")
);

$data = array(
    "messages" => $messages,
    "model" => 'gpt-3.5-turbo-16k',
    "temperature" => 0.25,
    "max_tokens" => 15500,
    "top_p" => 1,
    "frequency_penalty" => 0,
    "presence_penalty" => 0,
    "stop" => ["STOP"],
);
5 Likes

I just ran it asking for 120 paragraphs and it returned a 5748 token response!

9 Likes

I guess, that the input layers and the output layer are fixed at 1024 tokens even if there is a context window of 16k or 32k. For a transformer to have bigger layers than previously needs new training, which is too expensive with the overall size of the models.

OpenAI might use some kind of batching techniques to have bigger input windows through chunking the inputs/memory into pieces. And maybe some kind of embedding techniques per conversation. This is why the GPT-4 models might be more expensive than the smaller models. With a context of 16k, I think, they need to do several requests internally before submitting the response.

2 Likes

Thanks @Jeffers for this thread, @JustinC for a very helpful explanation and @smuzani for sharing a worked example which I’ve copied - great to see this work with 3.5 (you didn’t even need the more expensive 16k version!) . Does it work as well using the API?

I’m trying to summarise ‘long read’ articles e.g. approx 10k words. Ideally I’m looking for ~ 1000 words (10% of original length) as the summary. I’m also asking for the key topics to be extracted. I’ve been getting some strange results using 16k API.

Any advice on pros & cons of using 16k API versus chunking with original 3.5 API? The ‘mini-summaries’ from 3.5 are not always cohesive when stitched back together, but using overlapping input, or tagging mini-summary#1 on to input#2, increases costs and may not improve results much.

Any suggestions greatly appreciated.

1 Like

hi! i am having the very same behaviour. I will try to use the new “plugins” / “function” feature to maybe create a similar apporach than the new feature in the official chatgpt chat where a button pops up “generate more” … maybe this helps and someone of you can try it, too

1 Like

While this makes sense in terms of chat - they have now released 16k on the API. The API does not have any chat history. I’m prepared to pay the extra costs to upload a large article, but I need to be able to produce decent output to make it worth it.

1 Like

Can somebody please explain this math(s)?
Using API: Tokens: 16384 Model: gpt-3.5-turbo-16k

Prompt: Write 30 paragraphs that summarise the key takeaways of the article below. Each paragraph should be 2-4 sentences in length.
Article:…
Response:This model’s maximum context length is 16385 tokens. However, you requested 21864 tokens (5480 in the messages, 16384 in the completion). Please reduce the length of the messages or completion.",

New Prompt:Write 3 paragraphs…

This model’s maximum context length is 16385 tokens. However, you requested 21864 tokens (5480 in the messages, 16384 in the completion). Please reduce the length of the messages or completion.

By my recogning a paragraph - 2-4 sentences is around 100 tokens.
The response should easily be able to contain 30 short paragraphs.
What is the point of a large model that can’t even create some summary points for an article?

Maybe remove the 5480 from your message from the max token?

16k means message + completion - you won’t get 16k answer.

use a frame to solve it ,just like this:Write a paper about climate change and generate an outline first, which includes 13 chapters, with each chapter consisting of about 1000 words. if use API, repeat calls in the process.

Have you found a way to deal with it? I have triend so much and nothing works as for now

Yeah, did you click the link? It goes to a prompt that works:

It appears to be trained to stop writing at around 1500 words. It’ll write a conclusion, close a paragraph, poem, whatever at around that point, no matter how you instruct it.

So the trick is to ask it to write several parts of around 1500 words in length and then just combine it into something bigger.

2 Likes

The whole point is to summarise a large article - the 5480 is uploading it. If a paragraph is about 100 tokens - there should be space to generate about 100 of them within the token limit.

I couldn’t agree more with @JustinC

The larger context, even for GPT-4-32k, is more about input context, not output length. To get more output, you need to coax it out of the model by hitting the model many times.

But the larger window is great when you need more input context without resorting to excessive truncation of the input.

But I still don’t have any hard data on whether they increased the number of attention heads for the larger context, or if they are diluting the attention over the larger context. Anyone?

1 Like

I really need more input length to keep context in novel creation.

The output could even be 2-4k.

I made an app called PromptMinutes (Prompt Minutes from meetings etc) available on Apple Store. Cut n paste your API key in the ‘action’ view, and you are good to go. You can record , transcribe, and summarize. My tests show an hour worth of audio, transcription, and summary cost approx. 0.75USD . There is also a field to enter your custom prompt.

1 Like

This is correct. Check out this working example:

[{“role”:“system”,“content”:“Step 1 - List 10 popular questions about generative AI.\nStep 2 - take the 1st question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable.\nStep 3 - take the 2nd question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable, lists and tables where applicable.\nStep 4 - take the 3rd question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable.\nStep 5 - take the 4th question from the list from Step 1 and write a 1000 word article using markdown formatting.\nStep 6 - take the 5th question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable.\nStep 7 - take the 6th question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable.\nStep 8 - take the 7th question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable.\nStep 9 - take the 8th question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable.\nStep 10 - take the 9th question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable.\nStep 11 - take the 10th question from the list from Step 1 and write a 1000 word article using markdown formatting, lists and tables where applicable.“},{“role”:“user”,“content”:“Execute steps 1-11"}

1 Like

The GPT-3.5-turbo-16k model has a maximum response length of around 1500 tokens, regardless of the prompt used. Even in the Playground, the Maximum Length slider only goes up to 2048 tokens. The model’s increased token limit primarily benefits the input context rather than the output length. To generate longer responses, you can try sending multiple queries in a chat-like format, providing additional context with each subsequent message. However, the model is not designed to produce excessively long responses, and there is currently no way to force a specific response length, such as a 10k response.

I’m not sure that is completely accurate mosssmo, I can generate a 5000 token response with the 16k model.

3 Likes

Thanks, however increased contextual memory and output are what most would infer from 16K renewed token length.