How To Handle Token Limit Error

Hi all,

I’m facing a persistent issue with OpenAI’s 8192-token limit (using gpt-4o-mini), and I want to proactively handle the token limit error before it happens — not just catch it after it throws. Here’s the specific error:

BadRequestError: 400 This model’s maximum context length is 8192 tokens, however you requested 8342 tokens (8342 in your prompt; 0 for the completion).


:gear: My Use Case

I’m building a custom AI QA feature using LangChain and MongoDB vector search. The call flow looks like this:

createQACompletion → _invokeAgent → _callDefaultAgent

The issue arises specifically in _callDefaultAgent, inside the similaritySearch() method:

const similaritySearch = await vectorStore.similaritySearch(queryObj.text, 5, {
  preFilter: {
    businessId: {"$eq": self.businessId},
    type: {"$in": ["businessQuestions", "plan"]}
  }
});

After similarity search, I join the results + some context:

const context = [
  await self._getBusinessDetailContext(vectorStore, queryObj.text),
  ...similaritySearch.map(doc => doc.pageContent)
].filter(Boolean).join("\n\n");

Then I build the RunnableSequence chain and call chain.invoke(queryObj.text).


:red_exclamation_mark:Problem

  • Sometimes the similarity search + context exceeds the 8192 token limit.
  • The error kills the flow right at the similaritySearch or chain.invoke() stage.
  • I already tried counting tokens of queryObj.text, but that’s insufficient because I don’t know the token count after context is built until it’s too late.

:white_check_mark: What I Want

I don’t want to “ignore” or “check for specific strings” in the error message. I want a solid, proactive way to:

  • Either calculate the token count before calling chain.invoke(), including context + prompt + question,
  • Also I don’t want the other tried solutions like truncation and all. i just want to handle the error

:magnifying_glass_tilted_left: What I’ve Tried

  • Counting tokens of queryObj.text :white_check_mark:
  • Logging token usage in handleLLMEnd() :white_check_mark:
  • Using LangChain callbacks :white_check_mark:
  • But the error still occurs before that gets hit, during similarity search.

:brain: Relevant Code Summary

const vectorStore = new MongoDBAtlasVectorSearch(...);
const similaritySearch = await vectorStore.similaritySearch(...); // 🔥 Error likely starts here

const context = [
  await self._getBusinessDetailContext(...),
  ...similaritySearch.map(doc => doc.pageContent)
].filter(Boolean).join("\n\n");

const chain = RunnableSequence.from([
  { context: () => context, question: new RunnablePassthrough() },
  promptTemplate,
  chatModel
]);

const answer = await chain.invoke(queryObj.text); // 🔥 Throws 8192+ token error

Would really appreciate guidance or code suggestions. Thank you!

It is not gpt-4o-mini that would have 8k token limitations, but the input context to embeddings models or to the search feature that is powered by it. gpt-4o-mini has a context window length of 128000 tokens.

“Handle the error” while “don’t want to truncate the input”?? If you send more than the model can accept, you will get an API error, and “handle” can thus only be “don’t completely crash”, because if you do not modify your technique, it cannot succeed.

You will need to come up with a strategy for handling this. What I would suggest is a token counter, and then if it would approach or exceed the embeddings model input limit, but not that of a language model such as gpt-4o-mini itself, have a summarizing API call.

This also can be not just “summarize this text”, but can use hypothetical answering, where you prompt the AI to write a new text that looks like the kind of answer and the kind of document that would contain such an answer. This increases your matches further.

If you are not using OpenAI’s product, but your own embeddings database, you can split texts and make multiple embeddings calls. Then either add and renormalize those vectors for a query vector, or combine the multiple results into a new weighted ranking.

Token counting is by tiktoken by OpenAI, in Python. Encode a string and see the length in token number elements. You can set up a token counting API worker yourself if you are coding in a different language and platform without Python.

Thanks a lot for the detailed explanation — you’re right that GPT-4o-mini supports up to 128k tokens and that this is most likely a limitation of the embedding model, not the language model itself.

To clarify my situation a bit more:

  • The error is coming up during similaritySearch(), which internally embeds the queryObj.text.
  • My queryObj.text is well under 8192 tokens based on manual counting using tiktoken, but I still hit the 8192 token limit error.
  • The issue is that I can’t calculate the final token count of the prompt at the point when similaritySearch() runs, because the full context (which includes similarity results) gets constructed after that.
  • So I’m trying to find a clean and reliable way to handle this error proactively, without relying on fragile checks like err.message.includes("maximum context length is").

I’m not against the error happening — I fully understand that if the input is too long, it should fail.

What I want is to catch and handle that error gracefully, so I can show a friendly UI message like “Your input is too long — please shorten it.” instead of letting the app crash or show a generic failure.

Appreciate the ideas around hypothetical answering and splitting documents — will definitely keep those in mind as I evaluate the options.

Thanks again!