That gave me the impression that you could be naively sending huge texts directly to the embeddings engine. The endpoint makes an estimation of tokens and denies single requests over the rate limit even before tokens are actually counted or accepted or denied by the AI model.
The first thing odd is that “limit 150,000” on embeddings. Me:
You could actually be hitting the limit if you are letting software batch a whole document at once.
Because the rate limit doesn’t rely on token-counting, you don’t have to be elaborate and actually count real tokens either.
Just put in your own character-based rate limit so you hold back chunks until the next minute if you are approaching a formula like 3 characters = 1 token. Possible that the string of the vector return is also being counted.