Trying to understand why I'm hitting token limit with API

Hi all - I’ve been reading past posts but I can’t quite grasp the answer.

Right now I’m sending 22,891 tokens via API and asking for content of around 1900 to 3000 tokens back (around 1500 words +). I cannot get it to give me more than around 600 to 800 words max, and sometimes I just get an error. I assume all of this is because of the large prompts I’m using.

I’m experimenting with the gpt-4-turbo-preview model and I thought I’d be able to get a 128k token window, but I also know it says about the 4096 token output…

I am on Tier 2 currently.

What am I missing? Is there anything I can do to make this work with the size of the tokens I’m sending? I know it’s large but surely it’s nowhere near as large as enterprise type companies are sending?

I’d appreciate any help thank you!

1 Like

Hi @mattrosine

It’d really help us help you if you could share the stop_reason from the response and the API call you’re making with all its hyper-parameters.

Also the usage object within the response tells you the exact token count for input and output tokens.

1 Like

is it possible the returned answer is just that long…?

The limitation is because the model has intense training, or perhaps even some injected governor, that compels it to stop writing and wrap up its output. This is a preconceived planned notion also. Ask for 40 descriptions, and they will be cut to half the size of 20 descriptions. Make an infallible prompt that reproduces lines of input into processed lines of output, and you will just be cut off arbitrarily.

22k tokens of input can be processed (and billed) almost instantly to a hidden state because of the attention masking techniques, but producing the following tokens takes computations, that apparently they don’t want you to even pay for. You don’t get a different model than one now extensively trained to make ChatGPT less expensive for OpenAI.

2 Likes

Hey, thanks for the reply. Here is the error I get:

Workflow error - The service Master OpenAI API - API Call just returned an error (HTTP 400). Please consult their documentation to ensure your call is setup properly. Raw error: { "error": { "message": "This model's maximum context length is 8192 tokens. However, your messages resulted in 22561 tokens. Please reduce the length of the messages.", "type": "invalid_request_error", "param": "messages", "code": "context_length_exceeded" } }

I think the max context length of 8192 tokens is from when I was experimenting with one of the other models.

Here is the API call:

{
  "model": "<model>",
  "messages": [
      {"role": "system", "content": "<system_prompt>"},
   	  {"role": "user", "content": "<user_prompt1>"},
      {"role": "assistant", "content": "<assistant_prompt1>"},
      {"role": "user", "content": "<user_prompt1a>"},
      {"role": "user", "content": "<user_prompt2>"},
      {"role": "assistant", "content": "<assistant_prompt2>"},
      {"role": "user", "content": "<user_prompt2a>"},
      {"role": "user", "content": "<user_prompt3>"},
      {"role": "assistant", "content": "<assistant_prompt3>"},
      {"role": "user", "content": "<user_prompt3a>"},
      {"role": "user", "content": "<user_prompt4>"},
      {"role": "assistant", "content": "<assistant_prompt4>"},
      {"role": "user", "content": "<user_prompt5>"},
      {"role": "assistant", "content": "<assistant_prompt5>"},
      {"role": "user", "content": "<user_prompt6>"},
      {"role": "assistant", "content": "<assistant_prompt6>"},
      {"role": "user", "content": "<prompt>"}
  ],
  "temperature": 1.7,
  "max_tokens": 4096,
  "top_p": 0.5,
  "frequency_penalty": 1,
  "presence_penalty": 0.7
}

As I said, the API is working, I’m just not able to get anything more than max 800 words out of it.

Does the above help at all?

Oh, that’s interesting. I hadn’t ever considered that the limit was optional… I can remove the line in the API call and I’ll be free of that restriction?

When I’m using the Playground though, am I right in thinking the limit (the slider bar on the right) is not something you can remove…?

I think so.

If I remember correctly, you can top up this value to 128,000 in the Playground.

It may also be because this model was trained to refuse generating more tokens after 1000. This is not related to the max output it has, but the internal training (this is what _j mentioned).

You don’t need to set the max_tokens on chat models because they are designed to avail all the remaining context length other than that consumed by input tokens.

We really can’t ask the the model to reply in a certain amount of tokens especially if the number is large.

I also noticed that you’re using unusually high value for temperature.

Apart from that, I find the number of messages quite high given that this is the first API call. Additionally there are non successive messages with the same role - which although isn’t wrong but wasn’t likely the pattern the model was trained on.

I’d recommend reducing the number of messages by sending the system prompt followed by a user-assistant grounding pair(if you have any) which is just a one shot example consisting of a user message and expected model response. Then simply add the user message you want the model to process.

So if I don’t need to set the ‘max_tokens’, what do I do with that line in the API call? If I remove it, the API breaks. If I set it to be higher than what the max is supposed to be (4096 in this example), it doesn’t work.

The tool I’m building is a copywriting/text generative platform, and the high temperature value is what I’ve found in my experiments to give the best results for my use case.

This wouldn’t allow me to achieve the same results that I’ve been able to achieve in the Playground . My whole strategy relies on few-shot prompting with a long system instruction including several long content examples, then a series of assistant/user prompts which I have written.

I think what I struggle to understand, as a relative beginner and someone who is desperately trying to learn, is why would the Playground allow me to mix the different assistant/user prompts if it won’t work in practical use?

1 Like