Is there documentation for the maximum number of tokens for all samples sent in a single embeddings API request?
We are seeing a limit of about 300K tokens (36 inputs of 8K each) but are unable to find any official documentation of this.
It would be nice to know if this varied by model, or changes in the future.
Edit to add more information:
The error looks like this when we send a batch of 244 inputs of 8191 tokens each (just under 2M) to text-embedding-3-small on tier 5:
Error code: 400 - {‘error’: {‘message’: ‘Requested 499651 tokens, max 300000 tokens per request’, ‘type’: ‘max_tokens_per_request’, ‘param’: None, ‘code’: ‘max_tokens_per_request’}}
Similar error with 850K tokens:
Error code: 400 - {‘error’: {‘message’: ‘Requested 319449 tokens, max 300000 tokens per request’, ‘type’: ‘max_tokens_per_request’, ‘param’: None, ‘code’: ‘max_tokens_per_request’}}
I ran into this as well and it looks like OpenAIs API does have a limit for how many tokens you can send in each request. Not only can you send only 2048 chunks to be embedded at the same time. The sum of those chunks can’t be larger than 300k “tokens”
However the limit of those 300k tokens is not calculated with the actual tiktoken tokenizer but instead is an estimate of 0.25 tokens per utf-8 byte. So you can actually embed 1 million tokens in one request as long as they use few enough bytes.
Seems like a very odd limit to me, considering the request itself has a limit of 2048 chunks. But hey now we know.
edit: I think this is actually a bug in the embedding endpoints. The limit according to their api spec is 300k tokens per request:
ll embedding models enforce a maximum of 300,000 tokens summed across all inputs in a single request.
https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-input
However you can send tokens or raw strings.
If the payload is interpreted as a simple int array (array of tokens) it makes sense to simply divide the length in bytes by 4 and report that as the amount of tokens.
However when the user is sending utf-8 encoded text this leads to this weird behaviour where you can send way more (or way less) tokens, since the text will be tokenized after the size check.