New rate limit documentation may indirectly and evasively describe what is going on to so many accounts.
Answer if your gpt-3.5-turbo is slow:
are you currently in a prepay plan?
have you paid OpenAI over $50 in prepaid credits, over a week ago?
That seems to be the criteria for getting to new “tiers” for prepay credit users, and it seems that quality of service comes along with giving OpenAI unrefundable money. Or rather, they “may” move you to lower latency.
See if you have been assigned one of the new unique and distinct rate limits of 20000 or 40000 tokens per minute, meaning you are in a lower “tier”.
Also, OpenAI has said they wouldn’t consider any rate request form for increase of GPT-4 TPM rate of 10000 that has been recently given to new users? Well, there’s your answer on that page, PAY UP.
Interesting. But that’s still about hard token limits. Clearly, some users/organizations are being throttled, well within all those limits, and that’s not mentioned anywhere. The only mention of throttling in the docs is about throttling it yourself to stay with the provided limits.
I have 2 organizations, one of those is experiencing the slowness, the other is fast as we are used to with GPT 3.5 Turbo. However, the organization that did not have any traffic in the last couple of months is faster than the one i’m using in a production setting with not a lot of users, so very well within all the limits and quotas.
The slowness just doesn’t make any sense. It’s not documented anywhere. There’s no status update. And any support request to OpenAI is closed mentioning it is probably your own fault. EVEN when I send them the demo video you see above.
For now I have fixed it by using a different organization and by using Azure’s OpenAI API’s. But that’s not really a fix. Thinking about migrating every service I have to Azure OpenAI.
Still doesn’t make sense, because the organization that has the highest (paid) usage was seeing decreased performance, and still had decreased performance even when sitting idle now for a few days. The other organization that was sitting idle for months (still) has the fast turbo performance.
edit: just confirmed both organizations have the same rate limits
This is a perfect explanation of what has been happening lately, where some accounts are generating content faster than others. Thank you for this insight!
I have noticed outages across an enterprise environment and an individual environment. I don’t think the problem is what everyone is currently thinking. There are certain parts of the day it is better and others it’s not. I’m in Australia so it works well during the day but at night when the u.a market comes online, than I can see times up to 4 / 5 minutes.
In the enterprise environment, the regular chatgpt model has been playing up without reporting the incident on the status page but with a test pool of 5000+ it’s easily identified. The errors we see is something along the lines of currently unavailable please try again in a couple of mins. The issue is is intermittent and is not all users at once.
With the open ai models, yes if you sacrifice quality for newer tech and better responses, then you will see very short response times. The higher models will do as previously stated, 4/5 minutes. It’s got gradually worse over the last couple of months with both chatgpt and api responses. I received emails this last week from google play console and apple and both are stating they have seen a huge spike in api calls and applications using AI. They are also advising of stricter requirements and a 30 day response in order to comply.
The reason I have given the above information, is because with all the above considered, one would think that openAI is struggling with the huge boom that they are seeing and I think in the new year we will see massive improvements to the latency due to the huge spike in new applications hitting the market and requiring quicker speeds. I have found that training your own model off the 3.5 gpt model can help with response times but we are still subject to the highly growing demand for this service. Also, I am doing a direct post bypassing servers.
Same situation for me, I’m developing a web app that doesn’t use more than 1000 tokens per API call and I randomly get one of these 3 outcomes:
instant response
very long waiting times, 30-40 seconds
sometimes the app hangs (I’m guessing there’s a network error, like I’m often seeing on ChatGPT lately, but I haven’t verified it yet)
With outcomes 1 and 2 being the most frequent and like @mattholdsmeI have the feeling that there’s a correlation with the time of the day I’m calling the API (I only tested it from Italy so far).
First test of the API vs the chat openai com - this is pretty much not usable for anything due to the response times. My limits are 90,000 TPM 3,500 RPM
ex - pt 425, completion tokens 100, total 525. - 8.9 seconds.
ex - pt: 432, completion tokens: 100, total: 532 - 24.9s
It needs to be in the < 1s range to be usable for my use case - so pretty much an instant response. Is this possible?
As an aside, and I know this doesn’t apply to many people, but the 32k context model for GPT-4 doesn’t appear to be impacted. However, the standard deviation of the 32k model is so large, that it’s hard to gauge any real trend without smoothing the data.
I would imagine the OAI 32K model used for alpha testing is on it’s own node somewhere and is only affected by those with access, so small numbers relative to everything else.
The 100 tokens would take about 5 seconds. See the rough 20 tokens per second in my graph above. But it could take as long as 20 seconds just due to model performance variations (5 tokens per second).
For instant response going out, maybe see if streaming is a fit for you.
We can measure 100 tokens. An average of 2.3 seconds is a good score.
For 3 trials of gpt-3.5-turbo-0613
Stat
Minimum
Maximum
Average
latency (s)
Min: 0.375
Max: 0.782
Avg: 0.534
total response (s)
Min: 2.0236
Max: 2.482
Avg: 2.226
total rate
Min: 40.29
Max: 49.417
Avg: 45.198
stream rate
Min: 57.7
Max: 60.4
Avg: 58.550
response tokens
Min: 100
Max: 100
Avg: 100.000
For 3 trials of ft:gpt-3.5-turbo-0613:xxxx::yyyyy
Stat
Minimum
Maximum
Average
latency (s)
Min: 0.376
Max: 0.416
Avg: 0.396
total response (s)
Min: 1.531
Max: 1.667
Avg: 1.597
total rate
Min: 59.988
Max: 65.317
Avg: 62.680
stream rate
Min: 77.8
Max: 85.7
Avg: 82.500
response tokens
Min: 100
Max: 100
Avg: 100.000
You can see a fine-tune model is fastest (due to concurrency or hardware), at 82 tokens per second as a capability, but that is only after the network delay and model context loading of 10 prompt tokens at 0.4 seconds.
However with the max wait found in a mere three trials of the normal model being 0.8 seconds to receive the first token (at top_p =0.01 for minimum sampler input), you don’t get many more tokens before 1.0 seconds has elapsed on the best day that doesn’t go into multiple seconds. Let’s get responses as fast as we can, 3 tokens in, 5 out:
For 50 trials of gpt-3.5-turbo-0613:
Stat
Minimum
Maximum
Average
latency (s)
Min: 0.195
Max: 2.091
Avg: 0.595
total response (s)
Min: 0.253
Max: 2.101
Avg: 0.647
total rate
Min: 2.38
Max: 19.763
Avg: 9.751
stream rate
Min: 21.4
Max: 666.7
Avg: 141.630
response tokens
Min: 5
Max: 7
Avg: 5.040
0.65 seconds to barely say “hi”. (the extreme “stream” rate of tokens 2-5 is likely a miniscule delay in opening the API network response to you as the model is generating)
5-token cat chat at temperature = 1.9
Cats, miracle piecesCats are fascinating animalsCats are mesmerizingCats,
those mysteriousCats, the curiousCats are itdeCats are captivating creaturesCats
are majestic animalsCats are incredibly fascinatingCats have the abilityCats
are fluffy andCats wander the worldCats have occupied aVAMarketing-winder’sCats
are perhaps theCats, beloved creaturesCats are charismatic andCats, though
muchCats are fascinating creaturesCats, known allCats, with theirCats, also
knownCats are fascinating creaturesCats are beloved creaturesCats, notorious
forCats: The UltimateCats, scientifically knownThere is something inherently
mysticalCats, also knownCats, one ofCats. MysteriousCats have drawn peopleCats:
Understanding OurOur beloved feline companionsCats, also knownCats are
amazing creaturesCats, wandering mystCats, often regardedCats are beloved
companionsCats are small mammalsCats, one ofCats are small,Cats have
attained aCats, longtime popularCats are fascinating creaturesCats, also
knownCats are enigmaticCats, domesticsCats are prevalent inCats, little
homem
Thanks for this, I’m building a very simple app that generates a personal value statement with gpt3.5-turbo (prompt + reply don’t take more than 700 tokens) and I couldn’t figure out why with my API key everything worked smoothly, but with my client’s API key the script takes very long to execute (10-20 seconds) and often times out.
I’ll post the video I made as a proof anyway, but I guess telling my client to buy at least $50 of credits should fix the problem
EDIT TLDR: raise your rate limits by buying credits
Thanks for the responses. In my use case - user asks a question, AI generates SQL from question which then queries db - in a search bar is not feasible. Sounds like 10s would be about right for this flow which is not a reasonable user experience.