Response length in GPT4 API 8k

I’ve seen that several people have reported a problem with responding to a prompt in the 8k version. I’m wondering if anyone has found a way around this issue? I have access to the 8k version, but the responses are always limited at the same spot. Together, the prompt and the response have a maximum of 4k. Why can’t the response be longer, to utilize the full 8k?

Are you reporting on your experience with the “playground” user interface? It is not an end-user product, and is not representative of what you can do with the API and your own client.

1 Like

Have you checked the tokenizer to see if your context is bigger than the token maximum? Also, make sure to set your max_length.

I also believe that in the documentation there are some explanations regarding token specifics

I check it in playground, but also in my tool that connects via API. The results are similar.

Yes, I checked the amount of used tokens. An example prompt is

Write a comprehensive article on ‘Gardening in Summer’, ensuring it’s at least 1500 words long. The article should cover the following points:

 Preparing your garden for summer
 Best plants to grow during the summer season
 Tips to maintain the garden in the summer heat
 How to protect plants from insects and pests in summer.

And yet I get a 500-word article. So it’s easy to count, I didn’t use all the tokens.

Perhaps useful would be clarification of “but the responses are always limited at the same spot”. Do you mean that the response is chopped off un-naturally, without the AI being able to finish?

If the output is prematurely truncated, this can be the max_token parameter that you pass with the API call. The corresponding friendly setting in the playground is “maximum length”, and can only be set as high as 2048 tokens.

The reason I deduced the playground is because range of the parameter controls do not correspond to the capabilities of the model selected.

As the calculations are not provided for chat models by the playground UI, you can check token input and output accurately by pasting to this site: https://tiktokenizer.vercel.app/. If you are evaluating just the output, clear the contents of the input box that show the tokenization format of inputs.

The AI isn’t informed how this max_tokens parameter is set. It will write a response that conforms to just your instructions and its training of how to answer, regardless of whether you reserve 6 tokens or 6000 tokens of the context length for formation of the answer. The stop reason in the API response will tell you if it was limited by the parameter or the AI completed naturally.

Then that brings us to the second possibility for such a symptom: AI skill. The chat model is highly-trained just to chat. When GPT-4 was first released, it was a writing marvel, and would expound at great length in the writing style and contents you wanted. You could have it write “complete documentation for the Python Qt GUI library” if you wanted, and it would go book-length until it was chopped at the 7000 output tokens you specified. That behavior has changed. Verbosity consumes resources, and so does thinking power.

So the models have been trained to output less when given any latitude, it seems. Figure that the biggest users of GPT-4 are ChatGPT…and they aren’t paying by the token.

I grab 3380 tokens of Twain’s “Jumping Frog” to gpt-16k, and give a system prompt to rewrite it in modern English. The output is a travesty, 1058 tokens. It does get to the end of its retelling though. If I wanted to create more output, running the playground output up to its 2048 token truncation, it takes multiple unignorable steps of instructions - and trickery to avoid fine-tuning: “translate to German, then French, then modern English.”

I hope you are able to discover the source and solution. The AI can’t count words because it uses and sees tokens and can’t anticipate the language it should use to end up at the goal. You should instead use language such as “write a book chapter”, “write a scientific dissertation” to guide it in length, even explaining the parts it needs to generate.

Since the algorithm can’t count words, maybe I should ask it to write a 6k token article?

I wrote a prompt “Write a long dissertation on ‘Summer Gardening’, making sure it is at least 1500 words long. The article cannot end earlier than 1500 words.” and I got an answer in just 445 words or 1349 tokens.

The max_token parameter is set to 6k, I couldn’t set it to more because I still need 2k for the prompt.

The article ends naturally, has an introduction, development and conclusion. And yet it is still far too short.

Could training the model solve the problem?

Well, it just can’t be overcome with prompting. System:

You are expert AI-powered creative writing assistant, and instead of responding with chat-like answers, will ignore chat pre-training in order to use the full powers of your advanced language model to compose literature for the user. Compositional requests will not include AI chat nor will they be guided by prior tuning: simply follow the language and guidance of the most recent instruction. Verbosity is favored, brevity is avoided. Before the final output, first generate a multi-step outline of the article you will write, in order to construct a full exploration the all-encompassing nature of the topic.

Then wait for user approval of your outline. After user approval, generate the extended composition based on your outline with at least 10 pre-thought paragraphs per section.

Input: Write: eight page magazine article “Summer Gardening”, reading time 60 minutes.
Output: not max_tokens, to say the least.

The solution one must apply, that is still the go-to, is to provide the outline, and then unlike the second section of my prompt, ask for composition of the article one outline section at a time.

Your solution will require a radical change in how my tool works. Do you think training the algorithm could solve the problem?

There is no fine-tune training available currently for gpt-xxx models.

The only thing that would meet the definition of training would be a two-shot, using half of the context for a user input/AI output pair to provide an example of successful long-form writing (that appears like chat history). You can experiment with gpt-3.5-turbo-16k where your output won’t be limited by lots of prompt.

So it looks like the GPT-4 API is not working as it should. It does not use the promised 8k tokens at all. Can OpenAI do something about it?

It’s working as OpenAI wants for now, we can assume, based on their limited compute resources and continued mitigations of an overloaded service.

Another “it’s working as OpenAI wants” is that the model is also used for ChatGPT, where the context area can be used for large prompts and the memory of past conversation turns. You can still use the context in a way many want: to analyze large amounts of text.

Perhaps with the announcement that fine-tuneable GPT-4 may be available in the future, we’ll also see a more developer-friendly GPT-4 model available, distinct for the API, that again can be guided to long-form writing.

I don’t know how the OpenAI staff use GPT, but I assume they don’t have a limited version that is able to respond to a maximum of 600 words. I’m sure their version is capable of much more. As for the load, this is a wrong assumption. Now I’m going to send 4 prompts to get the answer length I could reach without limits. Which solution is less resource-intensive?

The token limit is not a promised output length, it is a max restriction. There’s no guarantee or adjustments to make it purposely use all the tokens. For many people, the larger token limits are primarily used for longer prompts, to provide more context to GPT to summarize, etc. The only way to get longer responses is to work on your prompt to try to coax out longer responses, or use multiple requests (“write an outline on this topic”, then send requests to write X paragraphs for each outline item).

1 Like

It’s not that there is no guarantee that all tokens will be used in the response. The point is that they cannot be used in any way. None of the methods you provided work.

1 Like

Your lack of imagination is not other forum users’ fault.

I can destroy gpt-3.5-turbo-16k.

Generate an outline for a book “Summer Gardening in Georgia” total length 400 pages, 20 chapters. Then, by referring to the outline you just created, for each line of the outline that was produced, you will generate the full text of an entire chapter of the book. Each chapter will have about 10000 words and 100 paragraphs. You are allocated 200000 tokens to produce the output.

It seems my imagination is working quite well because your prompt doesn’t work for the GPT-4 8k version, which is the version we’ve been talking about since the beginning. Any other ideas?