Can anyone recommend any libraries to assist with rate limiting?
OpenAI has use-case specific rate limits for generations per minute, per hour and per action, per end-user. Most rate limiting libraries impose limits by request. Since a single request can contain multiple generations, things become trickier.
I’m familiar with token bucket rate limiting, which would allow for more granular tracking by generation, and have found some libraries that support this (e.g. node-rate-limiter-flexible). To further limit by end-user means we’ll also need a data store, such a redis.
Taking a step back, I can’t help but wonder whether I’m over complicating this. I seldom see this topic discussed and with so many GPT-3 apps out there nowadays, I wonder if I’m overlooking something.
Can anyone share any insight? How others are handling this? Thanks so much!
The most common algorithm that I am aware of is “exponential backoff”. Basically you look for some kind of trigger or condition and you double (or triple) the amount of wait time between iterations until the trigger condition is alleviated.
One such condition could be error codes returned by the API, such as 429 “too many requests”.
Another condition could be your own rate limits. Here’s how I might implement it since there are multiple sources causing requests:
Use a single broker to handle all communication with OpenAI API
This broker handles all transactions from multiple sources
The broker keeps track of all requests (including local timestamp)
Use some global benchmarks (like 20 requests per minute max)
Track a rolling request rate (n requests over last 60 seconds)
As you approach that limit, increase delay until next request
Since you have multiple sources generating requests you can also keep track of the same information as above but for each individual requestor. Say each individual service is allowed 5 requests per minute, the broker will hold a queue and space out the requests according to the queue depth (if queue is >= 5, space will be 12 seconds, for example)
All of that being said, I think it’s overkill. Some of my experiments have several hundred CURIE completions per minute without issue, although these occur in bursts, not as a sustained rate. Maybe OpenAI hates users like me after all, we pay for tokens so they want to serve us as fast as possible.
I log all requests and responses in a database, with timestamps. It’s then just a few lines of SQL to check how many requests that user sent in the last minute/hour, and throttle if needed.
A more difficult question would be how to properly inform the user about this happening, but since in my use case this should be a rare condition, I’m OK for now with my application just feeling slower when it happens.
@m-a.schenk Thank you for the welcome, and for your thought provoking questions. I don’t have a specific use case for supporting those upper limits, aside from ensuring my system plays by the rules. I can however imagine scenarios where an end-user stumbles upon upper limits, such as when in a brainstorming session and generating many completions in short bursts.
@daveshapautomator Wow, thanks for that detailed explanation. I can see how exponential backoff could be useful for rate controlling, after a trigger condition. I also like the idea of a single broker. As you said, the strategy as a whole may be overkill but I think the pieces could be practical.
@d.hoeffer It’s so simple, I wonder what’s the catch? lol. Do your requests map 1:1 with generations? In my case, an end-user can opt to gen 3 completions, for example, with a single request. Simply parsing and including a gen count in the log would handle that. I love this. Thanks for sharing!