Question about token limit differences in API vs Chat

damc4 · May 25, 2023, 9:03am

Because the way large language models work is that they have context window, so the prompt has a limit of how long it can be. That’s why API tells you that you are over the limit. However, there are ways to deal with that problem that are implemented on the website, so that’s why you don’t get that problem on the website.

The well-knows methods to deal with that are as follows:

Sliding context window - if you have a chat with more than 4000 tokens, you send only the last 4000 tokens to API, so you are not over the limit. The disadvantage is that the GPT will not remember what was before 4000 tokens.
Embeddings - you go through the text of the previous conversation, divide into parts and use embeddings endpoint to find the parts that are semantically similar to the last part in the conversation. You include the similar parts in the prompt. That way, the AI assistant has some “long-term memory” because it remembers the parts of the conversation that are relevant (or semantically similar) to the last part of the conversation.
Summarization - you can summarize the previous parts of the conversation (with an additional request) and optionally recursively summarize those summaries and include the summary in the prompt. That way, the AI assistant has some “long-term” memory as well.

You can use Langchain library to help you implement that faster.

Topic		Replies	Views
API token limitation differs from website UI token limitation API	4	694	December 18, 2023
4096 response limit vs 128 000 context window API	11	15689	February 6, 2025
Token Limitization Error when prompting Prompting chatgpt , api	8	3717	December 6, 2023
Trying to understand why I'm hitting token limit with API API gpt-4 , api	8	4790	February 27, 2024
Chained Prompt to complete text larger than 4000 tokens? API	14	6476	December 25, 2023

Question about token limit differences in API vs Chat

Related topics