_j
62
New rate limit documentation may indirectly and evasively describe what is going on to so many accounts.
Answer if your gpt-3.5-turbo is slow:
- are you currently in a prepay plan?
- have you paid OpenAI over $50 in prepaid credits, over a week ago?
That seems to be the criteria for getting to new “tiers” for prepay credit users, and it seems that quality of service comes along with giving OpenAI unrefundable money. Or rather, they “may” move you to lower latency.
Then answer:
- Go to your account rate limits page,
- See if you have been assigned one of the new unique and distinct rate limits of 20000 or 40000 tokens per minute, meaning you are in a lower “tier”.
Also, OpenAI has said they wouldn’t consider any rate request form for increase of GPT-4 TPM rate of 10000 that has been recently given to new users? Well, there’s your answer on that page, PAY UP.
6 Likes
Interesting. But that’s still about hard token limits. Clearly, some users/organizations are being throttled, well within all those limits, and that’s not mentioned anywhere. The only mention of throttling in the docs is about throttling it yourself to stay with the provided limits.
I have 2 organizations, one of those is experiencing the slowness, the other is fast as we are used to with GPT 3.5 Turbo. However, the organization that did not have any traffic in the last couple of months is faster than the one i’m using in a production setting with not a lot of users, so very well within all the limits and quotas.
Demo of the issue:
Another topic about the slowenss, that mentions it is only happening for certain organizations: GPT-3.5 API is 30x slower than ChatGPT equivalent prompt
The slowness just doesn’t make any sense. It’s not documented anywhere. There’s no status update. And any support request to OpenAI is closed mentioning it is probably your own fault. EVEN when I send them the demo video you see above.
For now I have fixed it by using a different organization and by using Azure’s OpenAI API’s. But that’s not really a fix. Thinking about migrating every service I have to Azure OpenAI.
6 Likes
_j
64
I’ll make it more explicit:
As your usage tier increases, we may also move your account onto lower latency models behind the scenes.
Translation: You’ve been moved to higher latency models behind the scenes.
(the correct term is not “latency” like ping time; they surely mean low output rates.)
2 Likes
Still doesn’t make sense, because the organization that has the highest (paid) usage was seeing decreased performance, and still had decreased performance even when sitting idle now for a few days. The other organization that was sitting idle for months (still) has the fast turbo performance.
edit: just confirmed both organizations have the same rate limits
1 Like
senols
66
This is a perfect explanation of what has been happening lately, where some accounts are generating content faster than others. Thank you for this insight!
3 Likes
dan24
67
The question of “did I break something or did the models just change” has driven me bonkers in the past.
I decided to put together a latency benchmark article and newsletter where I burned a bunch of tokens to get average latency marks for different models and providers (Azure, Anthropic, OAI). Will re-run the experiment and update monthly.
Here’s a link to the article and there is a place to drop your email if you want to get the new numbers every month.
1 Like
I have noticed outages across an enterprise environment and an individual environment. I don’t think the problem is what everyone is currently thinking. There are certain parts of the day it is better and others it’s not. I’m in Australia so it works well during the day but at night when the u.a market comes online, than I can see times up to 4 / 5 minutes.
In the enterprise environment, the regular chatgpt model has been playing up without reporting the incident on the status page but with a test pool of 5000+ it’s easily identified. The errors we see is something along the lines of currently unavailable please try again in a couple of mins. The issue is is intermittent and is not all users at once.
With the open ai models, yes if you sacrifice quality for newer tech and better responses, then you will see very short response times. The higher models will do as previously stated, 4/5 minutes. It’s got gradually worse over the last couple of months with both chatgpt and api responses. I received emails this last week from google play console and apple and both are stating they have seen a huge spike in api calls and applications using AI. They are also advising of stricter requirements and a 30 day response in order to comply.
The reason I have given the above information, is because with all the above considered, one would think that openAI is struggling with the huge boom that they are seeing and I think in the new year we will see massive improvements to the latency due to the huge spike in new applications hitting the market and requiring quicker speeds. I have found that training your own model off the 3.5 gpt model can help with response times but we are still subject to the highly growing demand for this service. Also, I am doing a direct post bypassing servers.
Same situation for me, I’m developing a web app that doesn’t use more than 1000 tokens per API call and I randomly get one of these 3 outcomes:
- instant response
- very long waiting times, 30-40 seconds
- sometimes the app hangs (I’m guessing there’s a network error, like I’m often seeing on ChatGPT lately, but I haven’t verified it yet)
With outcomes 1 and 2 being the most frequent and like @mattholdsme I have the feeling that there’s a correlation with the time of the day I’m calling the API (I only tested it from Italy so far).
I was having the same slowdown problem, which was only solved by adding $50 in credit. After that, the response time returned to normal.
1 Like
reyemtm
71
First test of the API vs the chat openai com - this is pretty much not usable for anything due to the response times. My limits are 90,000 TPM 3,500 RPM
ex - pt 425, completion tokens 100, total 525. - 8.9 seconds.
ex - pt: 432, completion tokens: 100, total: 532 - 24.9s
It needs to be in the < 1s range to be usable for my use case - so pretty much an instant response. Is this possible?
FWIW, here is my GPT-4 token generation graph for the last 20 days.
I am at Tier 4.
About 10-11 days ago it started slowing down.
1 Like
As an aside, and I know this doesn’t apply to many people, but the 32k context model for GPT-4 doesn’t appear to be impacted. However, the standard deviation of the 32k model is so large, that it’s hard to gauge any real trend without smoothing the data.
I would imagine the OAI 32K model used for alpha testing is on it’s own node somewhere and is only affected by those with access, so small numbers relative to everything else.
Yeah 32k has always been a bit erratic. Here is a plot going back 160 days ago to present.
Something happened 125 days ago which improved the speed. Then there is some dip 75 days ago that has since been rectified.
No idea why the performance is so variable, but overall it’s faster than the 8k GPT-4 version.
_j
76
You need 100 tokens of text returned to you in under a second?
The only way you’d be able to reliably do that is on your own A100/H100 cards and a tiny model.
reyemtm
77
What would an expected response time for 100 tokens be?
The 100 tokens would take about 5 seconds. See the rough 20 tokens per second in my graph above. But it could take as long as 20 seconds just due to model performance variations (5 tokens per second).
For instant response going out, maybe see if streaming is a fit for you.
_j
79
We can measure 100 tokens. An average of 2.3 seconds is a good score.
For 3 trials of gpt-3.5-turbo-0613
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.375 |
Max: 0.782 |
Avg: 0.534 |
| total response (s) |
Min: 2.0236 |
Max: 2.482 |
Avg: 2.226 |
| total rate |
Min: 40.29 |
Max: 49.417 |
Avg: 45.198 |
| stream rate |
Min: 57.7 |
Max: 60.4 |
Avg: 58.550 |
| response tokens |
Min: 100 |
Max: 100 |
Avg: 100.000 |
For 3 trials of ft:gpt-3.5-turbo-0613:xxxx::yyyyy
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.376 |
Max: 0.416 |
Avg: 0.396 |
| total response (s) |
Min: 1.531 |
Max: 1.667 |
Avg: 1.597 |
| total rate |
Min: 59.988 |
Max: 65.317 |
Avg: 62.680 |
| stream rate |
Min: 77.8 |
Max: 85.7 |
Avg: 82.500 |
| response tokens |
Min: 100 |
Max: 100 |
Avg: 100.000 |
You can see a fine-tune model is fastest (due to concurrency or hardware), at 82 tokens per second as a capability, but that is only after the network delay and model context loading of 10 prompt tokens at 0.4 seconds.
However with the max wait found in a mere three trials of the normal model being 0.8 seconds to receive the first token (at top_p =0.01 for minimum sampler input), you don’t get many more tokens before 1.0 seconds has elapsed on the best day that doesn’t go into multiple seconds. Let’s get responses as fast as we can, 3 tokens in, 5 out:
For 50 trials of gpt-3.5-turbo-0613:
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.195 |
Max: 2.091 |
Avg: 0.595 |
| total response (s) |
Min: 0.253 |
Max: 2.101 |
Avg: 0.647 |
| total rate |
Min: 2.38 |
Max: 19.763 |
Avg: 9.751 |
| stream rate |
Min: 21.4 |
Max: 666.7 |
Avg: 141.630 |
| response tokens |
Min: 5 |
Max: 7 |
Avg: 5.040 |
0.65 seconds to barely say “hi”. (the extreme “stream” rate of tokens 2-5 is likely a miniscule delay in opening the API network response to you as the model is generating)
5-token cat chat at temperature = 1.9
Cats, miracle piecesCats are fascinating animalsCats are mesmerizingCats,
those mysteriousCats, the curiousCats are itdeCats are captivating creaturesCats
are majestic animalsCats are incredibly fascinatingCats have the abilityCats
are fluffy andCats wander the worldCats have occupied aVAMarketing-winder’sCats
are perhaps theCats, beloved creaturesCats are charismatic andCats, though
muchCats are fascinating creaturesCats, known allCats, with theirCats, also
knownCats are fascinating creaturesCats are beloved creaturesCats, notorious
forCats: The UltimateCats, scientifically knownThere is something inherently
mysticalCats, also knownCats, one ofCats. MysteriousCats have drawn peopleCats:
Understanding OurOur beloved feline companionsCats, also knownCats are
amazing creaturesCats, wandering mystCats, often regardedCats are beloved
companionsCats are small mammalsCats, one ofCats are small,Cats have
attained aCats, longtime popularCats are fascinating creaturesCats, also
knownCats are enigmaticCats, domesticsCats are prevalent inCats, little
homem
My numbers are for GPT-4. GPT-3.5-Turbo will be much faster.
Thanks for this, I’m building a very simple app that generates a personal value statement with gpt3.5-turbo (prompt + reply don’t take more than 700 tokens) and I couldn’t figure out why with my API key everything worked smoothly, but with my client’s API key the script takes very long to execute (10-20 seconds) and often times out.
I’ll post the video I made as a proof anyway, but I guess telling my client to buy at least $50 of credits should fix the problem
EDIT
TLDR: raise your rate limits by buying credits