Question about token limit differences in API vs Chat

Why is it that when I copy and past a prompt into the website it produces an output just fine, but when I take that same prompt into the API it says I’m over the limit? Is there any way to get my API to work like the chat is working and give me the output that it produces without bumping up against a max?

Because the way large language models work is that they have context window, so the prompt has a limit of how long it can be. That’s why API tells you that you are over the limit. However, there are ways to deal with that problem that are implemented on the website, so that’s why you don’t get that problem on the website.

The well-knows methods to deal with that are as follows:

  1. Sliding context window - if you have a chat with more than 4000 tokens, you send only the last 4000 tokens to API, so you are not over the limit. The disadvantage is that the GPT will not remember what was before 4000 tokens.
  2. Embeddings - you go through the text of the previous conversation, divide into parts and use embeddings endpoint to find the parts that are semantically similar to the last part in the conversation. You include the similar parts in the prompt. That way, the AI assistant has some “long-term memory” because it remembers the parts of the conversation that are relevant (or semantically similar) to the last part of the conversation.
  3. Summarization - you can summarize the previous parts of the conversation (with an additional request) and optionally recursively summarize those summaries and include the summary in the prompt. That way, the AI assistant has some “long-term” memory as well.

You can use Langchain library to help you implement that faster.

1 Like

Can you share the exact error?

1 Like

I see. Thank you for your detailed response.

What I’m trying to do is take about 1000 responses from a survey and have the model generate categories of responses. Do you think the embeddings approach works well for this?


The message I’m given by the API is:

“This model’s maximum context length is 4097 tokens. However, you requested 5046 tokens (2046 in the messages, 3000 in the completion). Please reduce the length of the messages or completion.”

So you want to classify the responses from survey into categories? Yes, I think it would be possible to solve that problem with embeddings.

I suggest having a look at this:

Especially the parts that speak about classification and clustering. If the categories of responses are known, that would be a classification problem. If they are not known, that would be a clustering problem. Also check out the links that are there under “zero-shot classification” and “clustering” to see examples how it can be done.

1 Like