Hi everyone!
I’ve been working on a Python library called concurrent-openai that aims to make working with the OpenAI API more efficient, especially for projects requiring high concurrency and hit the rate-limits.
How it works:
The library uses a simple system with two buckets:
A request bucket for tracking the number of API calls.
A token bucket for estimating token consumption.
Both buckets replenish automatically over time (every second), in line with OpenAI’s rate limits (you set these manually).
Before making a request, the library estimates how many tokens will be consumed and deducts them from the token bucket, along with one request token. Once the response arrives, it updates the token usage based on the actual consumption (to correct any estimation errors). This system ensures you don’t flood OpenAI’s servers while maximising throughput.
Why give it a try:
The replenishing buckets allow you to handle concurrency gracefully and manage high workloads without (hopefully) hitting rate limits. Hitting the token limit might still happen, if you under-estimate the length of the assistant answers.
Nevertheless, I’ve been using this in a couple of projects, and it does the job (for me at least).
I’d really appreciate to hear your feedback, thoughts, or ideas on how to improve it!