It is normal to wait ~3-4 minutes for “total_tokens”: 4148 ? with gpt-3.5-turbo-16k
I can get about 110,000 gpt-3.5-turbo-16k tokens within 5 minutes … when asking for the same question to be answered 50 times.
The normal production rate of -16k has tended around 30-40 tokens per second. That’s 2000 per minute. If you output seems to be chopped to half that or less, you can check your account “rate limits”, where you can also see what trust level tier you are in. Tier 1: slow models too.
Thx for the answer. Now I see why i slower than before, I’m on tier 1…
The only way to get a faster speed is by “Tier 2 $50 paid and 7+ days since first successful payment”
Waiting 7 days?
Other’s have reported getting back their speed much faster.
The waiting is likely to be trusted to put in and use more money of the next tier.
The claim that getting on Tier 2 is the answer to huge delays makes common sense but I am on Tier 1 and I’ve reviewed the rate limits for that tier and GPT-3.5-Turbo. It is totally impossible that my chatbot is getting anywhere near those limits.but I’m getting 30 sec and up delays, occasionally more than 100 sec. If needing Tier 2 is the answer then it certainly isn’t explained by OpenAI documentation!
They aren’t going to directly write “We took a whole bunch of accounts that were low-value and put them into an API filter buffer that simulates slow output to decrease their satisfaction. Goal: get them to pay more to return to normal.”
OpenAI rewrote the text on the “rate limits” page to:
“Organizations in higher tiers also get access to lower latency models.”
Previously: " “As your usage tier increases, we may also move your account onto lower latency models behind the scenes.”"
Lower latency “models” makes no sense. Why leave your generation in an overloaded time-sliced server when it is more efficient to generate 100 tokens a second and then the unit processor is freed for another user. Only if there was no way for them to generate the current customer load without hiring slow energy-inefficient GPU instances of older technology.
I guess that’s the way it is. I, for one, would prefer they be more transparent about it. I can handle the truth.
Let me share an interesting observation with you.
Until I opened my API account one week ago, my ChatGPT account was very fast. But from the moment I opened my API account, ChatGPT became painfully slow. API is also painfully slow. The funny thing is, that my wife’s ChatGPT account has been fast the whole time.
Conclusion:
It’s part of their business plan to slow it down for you to make you pay more money. They will fail. I’m going to cancel my API account.
Hi and welcome to the Developer Forum!
You get an API account automatically when you create an OpenAI account, it’s likely that you are currently being assigned to a busy server while your wife is not, this is not something that is carefully calculated in the background, just the luck of the draw. Could easily be reversed tomorrow. The system is under heavy load after Devday, please bare with it.
Let me share an interesting observation with you.
You had an API account the whole time.
Also, let me know if you find where to “cancel” an API account. The only way I know how you can do that is to write sex fantasies about bombs and guns.
API has a tier system, where, as the rest of this topic describes, they trickle out the tokens to degrade the experience until you pay more in credits.
Thanks Foxabilo.
I introduced the API service in my web application a week ago. And it has been a total disaster. Response times up to 3 minutes for simple questions. So right now I have removed that service again from my application. I can’t rely on that technology. Just imagine that my application was 100 % depending on the API service, and suddenly response times explodes for a week or more. That would be a complete disaster.
The greatest laugh of all is the fear of AI cancelling millions of jobs. AI is run by incompetent leaders destroying their own business.
I have filed a support ticket for cancellation of my API account. The AI supporter told me that a real human being would contact me.
I started using it from yesterday. API responses are a disaster. Don’t trust what they tell you on the dev conference Faster Api calls
I’m following a Generative AI for Beginners course and am using the API on a Jupyter Notebook.
It took 38s to get a poem response…
Absolute disaster.
i think using a older or not the lastest API helps with the times, but then the results become inconsistent even using seed 0
Yes! I rolled back to 0.28.0 and the response is much better with OpenAI and Azure OpenAI. I was just about to mention this. The fact that I was getting this EVEN WITH A SUPPOSEDLY PRIVATE DEPLOYMENT OF GPT is eye-opening
What the hell is going on? This new version is like a messed up Windows update
Summary created by AI.
Participants in this thread are expressing concerns about the slow response times from the OpenAI API, with issues persisting across different types of API calls and accounts. Initial posts by jayben71, dlflannery, and rob.wheatley indicate experiences of delay, with responses taking up to a minute. This issue is also observed by users on free accounts and does not seem tied to rate limits, as clarified by N2U and rob.wheatley.
Many users attempt to address the problem with different solutions. rob.wheatley explores the possibility of using the stream:true
property to return results faster to users. He later reports improved user experiences after implementing streaming post 9, while other users express interest in this approach, such as in post 10 as well as posts from aayush_shah. SomebodySysop suggests handling streamed responses with PHP’s cURL library.
However, not everyone has found a consistent solution. sorn.denis and npozega1 describe an on-going issue of slow response times and time-outs. rob.wheatley and AliGreer43 note that the problem may be linked to capacity issues at OpenAI. They suggest that OpenAI should consider prioritizing capacity for paying API users over free accounts.
_j offers potential solutions to these problems, suggesting the exploration of different platforms and accounts. They propose running Python locally or switching datacenters, and even creating a second account to test if accounts are being discriminated against (post 39). They later point to new rate limit documentation, and suggest prepay users may be receiving a higher priority (post 62).
jvandenaardweg points out that not all organizations are experiencing the slowdown and shares video evidence demonstrating the issue. They express frustration that OpenAI’s support has been unhelpful, despite demonstrating the problem with the demo video.
Summarized with AI on Nov 30 2023
AI used: gpt-4-32k
I’m also facing a (user-perceived, latency related) slowness of theChat Completion API and would like to contribute with this topic and overall OpenAI developer experience.
-
The Improving latencies guide is very useful place to start troubleshooting slow responses
-
SUGGESTION: it would really be nice to have more transparency on the completion_tokens/ second ratio of the API. The articles about Rate Limits and Usage tiers are very useful, but they don’t mention the response time and the completion tokens per second ratio. Would other developers be interested if this information was made public? Maybe adding a section on that document listing the expected performance range per tier, per zone.
-
GOING FURTHER, I couldn’t find a way to vote for improvements on the performance (latency) of the Chat Completion API; this thread was closed, and I kindly asked them to reopen so I could contribute. Are other developers also interested in voting on feature / docs requests?
Near Real-time Web App Exploration
I run some experiments with different versions of GPT in December 2023, between Tiers 1 to 3 and the first conclusion is that only gpt-3.5-turbo-1106 is production-ready for near real-time apps, irrespective of the tier. And that requires using streaming or adding some UX animation for the user wait.
GPT 3.5 Turbo (starting on 12/11/23)
I got responses in 4 – 9s (6s on average), with rates of 150 – 400 total_tokens/s (~200 on avg) on tier 3; if we only consider the tokens completions, the item that matters the most in terms of latency, my ratio is 90 tokens per second on average 520 chat completion tokens
→ Tier 3 performance for gpt-3.5-turbo-1106 is slightly better:
- tier 1 avg response was 10s
- tier 2 avg response was 9s
→ Performance for gpt-3.5-turbo-0613 changes significantly between tiers 1 to 3.
GPT-4 and GPT-4-Turbo
It got responses in 1-2 min which is not acceptable for production.
-
gpt-4-0613: I got on avg 1 min response time and ratio of 36 total_tokens/s
-
gpt-4-1106-preview: I got on avg 1.3 min response time and ratio of 22 total_tokens/s
→ No noticeable difference between tiers 1 to 3.
Notes
I collected about 100 data points, using on average 1.2k tokens per request (.7k prompt and .5k completion); I was way below my RPM, RPD, TPM and TPD limits. I can provide more data upon request.
Update - Aug/2024
My original experiment was around Dec/2023; I tried a smaller experiment in Jun/2024 and GPT-3.5-turbo was still faster (compared to gpt-4-turbo-2024-04-09
or even to 4o
).
Now, in Aug/2024, I run another small experiment (~30 data points), comparing only gpt-3.5-turbo-0125
with gpt-4o-2024-05-13
with around 1k completion_tokens
and 3k total_tokens
, and GPT-4o was slightly better, average response in 13.3s vs 13.8s with GPT3.5 Turbo.
Seems wait times have improved since this post for me at least. I am on tier 4 and using gpt 4 1106. The response times for this model seem reasonable. Still a bit sluggish.
Tier 3 here, 300,000TPM.
4097 token prompt takes 1 minute 38 seconds for gpt-4-1106-preview