Rate limiting strategies?

The most common algorithm that I am aware of is “exponential backoff”. Basically you look for some kind of trigger or condition and you double (or triple) the amount of wait time between iterations until the trigger condition is alleviated.

One such condition could be error codes returned by the API, such as 429 “too many requests”.

Another condition could be your own rate limits. Here’s how I might implement it since there are multiple sources causing requests:

  1. Use a single broker to handle all communication with OpenAI API
  2. This broker handles all transactions from multiple sources
  3. The broker keeps track of all requests (including local timestamp)
  4. Use some global benchmarks (like 20 requests per minute max)
  5. Track a rolling request rate (n requests over last 60 seconds)
  6. As you approach that limit, increase delay until next request

Since you have multiple sources generating requests you can also keep track of the same information as above but for each individual requestor. Say each individual service is allowed 5 requests per minute, the broker will hold a queue and space out the requests according to the queue depth (if queue is >= 5, space will be 12 seconds, for example)

All of that being said, I think it’s overkill. Some of my experiments have several hundred CURIE completions per minute without issue, although these occur in bursts, not as a sustained rate. Maybe OpenAI hates users like me :slight_smile: after all, we pay for tokens so they want to serve us as fast as possible.

1 Like