Token counting in batch api/text embeddings

I’m trying to embed around 120000 small documents/pieces of text, which actually don’t require chunking and each is < 8192 tokens.
I’m trying to use batch api to do this. However, I reach enqueued token limit which for my case remains 20000000 tokens.
The problem is, I try to count tokens in the pieces of text and send that text to batch api so that I’m still in the token limit. My counting says I’m at less than 20000000 tokens with around 29000 pieces of text, however I still get a token limit error. Even if I make it count to, say, 15000000 tokens, I still get the error.
What might be the cause except possible bugs of the token counting library?
Note I’m not using python’s tiktoken, I’m using Microsoft.ML.Tokenizers which is c#, but I’d assume it should count correctly especially that it supports the correct token encoding and also supports creating token encoders by model name.
Any idea?

1 Like
  1. Ensure you are using the correct tokenizer dictionary
  1. Ensure that you indeed measure the same token counts using the OpenAI tokenizer site on texts

  2. Get the usage object for one of the embeddings API calls you make yourself, and see if you have measured it correctly.

  3. Understand that OpenAI’s rate limiters are dumb estimators and do not perform an accurate count on preliminary jobs, which may be the case here for batch inputs.

The rate limit is the total that is measured against all pending jobs. You can submit smaller batch jobs, such as 1/10th the enqueued limit, with back off and polling for job completion. That way you start getting returns and freed queue space instead of the job processor munching away on one big job, blocking everything or unable to be submitted.

1 Like

Is it the correct one? I’m using text-embedding3-small model which uses, I think, cl100kbase or how is it called?
Also I used the model name, not encoding name, when instantiating the tokenizer, so I was not responsible for selecting it. And it seemed that it selected what I said above.

This site probably doesn’t have text embedding models on the list…?

Well, haven’t tried that but I probably should.

But what’s the difference? I’d think it should still work if the difference is 5 mln tokens below the limit. Initially I made it 10k tokens below the limit. It’s inaccurate but can it be so inaccurate to say 75% of limit is too much?

This might be a nice thing, but it’s also not a problem because my program is a test program and I will not run anything else at the same time in parallel, so I didn’t account for that. But nice hint for real/prod usage. Could also help with that problem, but probably nice to research it as is to understand more how things work. I’m experimenting more than doing real things atm.

Here’s information from docs that corrects my top-of-head info:

For third-generation embedding models like text-embedding-3-small, use the cl100k_base encoding.

Not happy to believe anything for us, I craft up an input that is 1000 cl100k-base tokens (and is actually 1003 in o200k, so not of magnitude to affect your batch):

tiktoken with model name

Total Tokens by API usage: 1007
Model used: text-embedding-3-small
Query text (7): ‘How does artificial intelligence affect society?’
Paragraph text (1000): 'Microsoft.ML.Tokenizers\n\n

tiktoken with o200k_base

Total Tokens by API usage: 1007
Model used: text-embedding-3-small
Query text (7): ‘How does artificial intelligence affect society?’
Paragraph text (1003): 'Microsoft.ML.Tokenizers\n

So the API usage and the measured array items matches for cl100k encoding.

Limiter token impact

They throw an estimator in front of API calls. Your rate limit for single calls is impacted by the estimate. Lets see how much my rate goes down for 1007 tokens:

‘x-ratelimit-remaining-tokens’: ‘9998873’

1127 tokens “estimated” and actually counting against my remaining bucket.


Batches also can finish quickly - or can take near the 24 hours - or can be expired without them being run after 24h, so the smaller batch jobs have several benefits for you doing the monitoring.

so in theory it might be possible that the estimator had so much difference in tokens compared to real token count so that even 75% of limit wouldn’t be enough to make it happy?