I am a product manager of a social application, and recently, I am planning to introduce Chat GPT Enterprise Edition as a virtual chatbot in our app. Our app has around 300,000-400,000 active users who use it extensively on a daily basis. We anticipate generating around 10,000-15,000 GPT API requests per minute during peak usage.
However, I have some concerns regarding the current limit of 200 requests per minute for Chat GPT 4.0 Enterprise Edition. Hence, I am seeking the assistance and advice from community administrators to explore the best approach to increase the number of requests per minute. I would like to understand the process of applying to OpenAI for an increase in the number of requests per minute and whether there are any recommended and timely solutions to handle the expected peak request volume.
If anyone has any suggestions, experiences, or knowledge about this matter, I sincerely request your guidance. Our goal is to optimize the performance of the virtual chatbot and provide an excellent user experience to our vast user base.
Thank you very much for your support and help! Looking forward to sharing and discussing with all of you.
There’s no such thing as ChatGPT Enterprise Edition.
What you’re describing would likely be on the order of at least $200 / minute in API use. Assuming 10,000 requests per minute at 400 tokens per request at $0.05 per token (if it’s 15,000, 800-token calls it’s $600 / minute).
The standard use limit is $120 / month, that goes up to $500 / month with a good billing history, submitting a request, and waiting several weeks. The largest usage limit I’ve seen is on the order of a few-thousand-dollars per month.
My point here is, that even if you got them to lift your API call rate by 50x–75x what it currently is, you would blow through the standard monthly billing limit in about 36 seconds (or as little as 12 seconds).
By my rough estimation you’re looking for a minimum of $400,000–$500,000 / month of usage (assuming 60 minutes / day of use) but easily up to $1.5M (and possibly much, much more when including off-peak use).
Microsoft Azure is where you need to go if you want to deploy essentially unlimited GPT-4.
With that much volume, check into the dedicated instances:
Dedicated instances
We are also now offering dedicated instances for users who want deeper control over the specific model version and system performance. By default, requests are run on compute infrastructure shared with other users, who pay per request. Our API runs on Azure, and with dedicated instances, developers will pay by time period for an allocation of compute infrastructure that’s reserved for serving their requests.
Developers get full control over the instance’s load (higher load improves throughput but makes each request slower), the option to enable features such as longer context limits, and the ability to pin the model snapshot.
Dedicated instances can make economic sense for developers running beyond ~450M tokens per day. Additionally, it enables directly optimizing a developer’s workload against hardware performance, which can dramatically reduce costs relative to shared infrastructure. For dedicated instance inquiries, contact us.
May I suggest starting off conversations using GPT-4 and then switching to 3.5?
That may actually work. Assuming, most of the requests are not new conversations, you may be able to leverage conversation history for more accurate output. "
You should be able to leverage prompt engineering as well.
On top of this, have you considered something like Stable Beluga 2?
Thank you for your assistance.
Will the cost come down after deploying to Microsoft Azure based on the solution you provided? I have carefully calculated that there will be 300,000 users per day, and with load balancing, the peak requests per second will be controlled within 15. Is there a better cost control plan? Thank you very much