Rate limit reached with large documents


Sorry, this isn’t in fact a bug, more like an inquiry.

I have a question about the rate limit for embeddings. I’m trying to process a text with a substantial amount of content, around 95,000 words or so, but I got the following error:

‘Rate limit reached for default-text-embedding-ada-002 in {organization} on tokens per min. Limit: 150,000 / min. Current: 0 / min. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit OpenAI Platform to add a payment method.’

It works fine with documents with fewer words, but I’d like to be able to process longer texts.

I understand this is because of the quota, and I’m not sure if there’s an alternative solution. However, if I need to increase the quota, I assume that the money associated with the rate limit ($2500) and the quota money can’t be shared, right? In other words, my question is, do I need to pay two separate amounts, one for the rate limit and another for the quota?

It’s likely that this won’t resolve the issue with very long documents anyway. In that case, is there an alternative to not having to add a sleep to wait for the minute to pass if it exceeds the token limit?

You cannot send that much text to text-embedding-ada-002. It has a context length of 8192 tokens. The tokens per minute limit prevents you from running the job because of the size even far beyond what the AI model can do.

You’ll need to use a weighting algorithm to combine the embeddings of small pieces of the document. Or simply return the same document for any embeddings chunk similarity.

Hi, the current token limit for the embeddings is way more than enough for the total list of all things I can imagine they can be used for…

So I’m really curious, what is the goal you’re trying to achieve with embeddings of that size?

Really curious.

@sergeliatko Mmm, well, I don’t have a specific goal, I simply want to process a variety of texts, both short and long, and obtain their embeddings since I’m trying to develop my own chatbot using documents as a database. I’ve gotten issues when trying to process longer texts and wanted to better understand the limits and how to work with them.

Yeah, the thing is I’m using langchain as a framework to develop my own chatbot. I obtain embeddings from documents and store them in chromadb. I’ve used methods to split the text, resulting in chunks of only 1000 tokens with an overlap of 20 tokens. However, I’ve never found an error specifically related to context length.

The error I’m getting is because of the tokens-per-minute; I’m splitting text as chunks since the get-go, but I’ve never found an error because of this.


If you don’t have a specific goal and pretty much just want to play around with the results then try to chunk the text into different sizes like 100 words or 200 or maybe even 500 words and account for some overlapping between the chunks. Do this for one document and test it (extensively). From there you can look into the more advanced techniques to handle large documents and especially several of them.

That gave me the impression that you could be naively sending huge texts directly to the embeddings engine. The endpoint makes an estimation of tokens and denies single requests over the rate limit even before tokens are actually counted or accepted or denied by the AI model.

The first thing odd is that “limit 150,000” on embeddings. Me:


You could actually be hitting the limit if you are letting software batch a whole document at once.

Because the rate limit doesn’t rely on token-counting, you don’t have to be elaborate and actually count real tokens either.

Just put in your own character-based rate limit so you hold back chunks until the next minute if you are approaching a formula like 3 characters = 1 token. Possible that the string of the vector return is also being counted.

@info227 I see.

If I get your point right, you need embeddings to be able to find the context from your database of documents related to the user query to be able to construct the prompt to answer the query.

Technically you will be comparing two vectors: query and text in the database to get the distance between them. The shorter the distance - the closest is the relation between your query and the text from database.

Usually, the queries contain 1 “idea” at a time that is mapped to a vector. While long documents contain multiple ideas mapped to a vector as a whole undivided item…

That creates a “semantic conflict” where you’ll be measuring distance of a “single idea” vector from a “multiple ideas” vector. This conflict can be seen by longer distance (precision loss) on vectors found in database. That might be ok when your database size is rather small, but when you have a huge database, your searches will return many results with long documents more or less same range of distance from your query, what creates an issue how to find the exact text you need.

There is a solution to that. The approach would be to isolate the ideas in the documents and embed them separately, so that each vector represents one idea at a time. If you do it, your search results will bring a set of vectors with wider range of distances between them and sorting by distance/similarity will get you more precise set of results.

But your results will be only parts of documents closely related to the query and you need to think through the database structure to allow finding the document the found section belongs to and be able to grab additional context from the whole document if needed.

I find it cheaper to process the documents before embeddings once (split on ideas) and use simple string search in document to match the vector-found section to its context, than using additional model to extract context from large document found each time you run a query…

But that really needs understanding the document types, their structure, etc. To be able to split them in the most optimal way.

1 Like

That one is because your script does not have a back-off strategy to slow down requests when approaching the limit.

1 Like

@sergeliatko the issue was that even though I implemented a backoff mechanism to slow down the processing when it exceeded the limit, there seemed to be an issue with using LangChain as an intermediate layer, because instead of processing the text, it returned the same error as before but this time saying ‘3 requests per minute’, which is different from the original ‘150,000 tokens per minute’. So now all my documents are affected by this new limit.

From openai support told me about requesting an increase in the rate limit, which I did recently, so I’m waiting for their response and will try some other open-source solutions in the meantime.

3 requests per minute? Free trial limitation.

Money into their system and the rate gets lifted.


So, I just need to add a payment plan? Since I only have a quota increased since a few days ago, but the thing is that I spend very little compared to the credit they granted me. So, I imagine that in order to get the 300,000 tokens per minute I only have to add a payment plan, is that correct?

If you have a card payment method added to your account in the billing->payment methods section, that should allow you to use the increased rate limits without you having to pay for them while your grant is still in effect.

1 Like

@_j @sergeliatko Thank you all for the help, I can’t mention to the other two, but thanks to them too.