Question about token limit differences in API vs Chat

Because the way large language models work is that they have context window, so the prompt has a limit of how long it can be. That’s why API tells you that you are over the limit. However, there are ways to deal with that problem that are implemented on the website, so that’s why you don’t get that problem on the website.

The well-knows methods to deal with that are as follows:

  1. Sliding context window - if you have a chat with more than 4000 tokens, you send only the last 4000 tokens to API, so you are not over the limit. The disadvantage is that the GPT will not remember what was before 4000 tokens.
  2. Embeddings - you go through the text of the previous conversation, divide into parts and use embeddings endpoint to find the parts that are semantically similar to the last part in the conversation. You include the similar parts in the prompt. That way, the AI assistant has some “long-term memory” because it remembers the parts of the conversation that are relevant (or semantically similar) to the last part of the conversation.
  3. Summarization - you can summarize the previous parts of the conversation (with an additional request) and optionally recursively summarize those summaries and include the summary in the prompt. That way, the AI assistant has some “long-term” memory as well.

You can use Langchain library to help you implement that faster.