OpenAI Why Are The API Calls So Slow? When will it be fixed?

_j · October 24, 2023, 9:43am

New rate limit documentation may indirectly and evasively describe what is going on to so many accounts.

Answer if your gpt-3.5-turbo is slow:

are you currently in a prepay plan?
have you paid OpenAI over $50 in prepaid credits, over a week ago?

That seems to be the criteria for getting to new “tiers” for prepay credit users, and it seems that quality of service comes along with giving OpenAI unrefundable money. Or rather, they “may” move you to lower latency.

Then answer:

Go to your account rate limits page,
See if you have been assigned one of the new unique and distinct rate limits of 20000 or 40000 tokens per minute, meaning you are in a lower “tier”.

Also, OpenAI has said they wouldn’t consider any rate request form for increase of GPT-4 TPM rate of 10000 that has been recently given to new users? Well, there’s your answer on that page, PAY UP.

jvandenaardweg · October 24, 2023, 1:12pm

Interesting. But that’s still about hard token limits. Clearly, some users/organizations are being throttled, well within all those limits, and that’s not mentioned anywhere. The only mention of throttling in the docs is about throttling it yourself to stay with the provided limits.

I have 2 organizations, one of those is experiencing the slowness, the other is fast as we are used to with GPT 3.5 Turbo. However, the organization that did not have any traffic in the last couple of months is faster than the one i’m using in a production setting with not a lot of users, so very well within all the limits and quotas.

Demo of the issue:

Another topic about the slowenss, that mentions it is only happening for certain organizations: GPT-3.5 API is 30x slower than ChatGPT equivalent prompt

The slowness just doesn’t make any sense. It’s not documented anywhere. There’s no status update. And any support request to OpenAI is closed mentioning it is probably your own fault. EVEN when I send them the demo video you see above.

For now I have fixed it by using a different organization and by using Azure’s OpenAI API’s. But that’s not really a fix. Thinking about migrating every service I have to Azure OpenAI.

_j · October 24, 2023, 4:32pm

I’ll make it more explicit:

As your usage tier increases, we may also move your account onto lower latency models behind the scenes.

Translation: You’ve been moved to higher latency models behind the scenes.

(the correct term is not “latency” like ping time; they surely mean low output rates.)

jvandenaardweg · October 24, 2023, 5:51pm

Still doesn’t make sense, because the organization that has the highest (paid) usage was seeing decreased performance, and still had decreased performance even when sitting idle now for a few days. The other organization that was sitting idle for months (still) has the fast turbo performance.

edit: just confirmed both organizations have the same rate limits

senols · October 25, 2023, 6:54am

This is a perfect explanation of what has been happening lately, where some accounts are generating content faster than others. Thank you for this insight!

mattholdsme · October 31, 2023, 11:17am

I have noticed outages across an enterprise environment and an individual environment. I don’t think the problem is what everyone is currently thinking. There are certain parts of the day it is better and others it’s not. I’m in Australia so it works well during the day but at night when the u.a market comes online, than I can see times up to 4 / 5 minutes.

In the enterprise environment, the regular chatgpt model has been playing up without reporting the incident on the status page but with a test pool of 5000+ it’s easily identified. The errors we see is something along the lines of currently unavailable please try again in a couple of mins. The issue is is intermittent and is not all users at once.

With the open ai models, yes if you sacrifice quality for newer tech and better responses, then you will see very short response times. The higher models will do as previously stated, 4/5 minutes. It’s got gradually worse over the last couple of months with both chatgpt and api responses. I received emails this last week from google play console and apple and both are stating they have seen a huge spike in api calls and applications using AI. They are also advising of stricter requirements and a 30 day response in order to comply.

The reason I have given the above information, is because with all the above considered, one would think that openAI is struggling with the huge boom that they are seeing and I think in the new year we will see massive improvements to the latency due to the huge spike in new applications hitting the market and requiring quicker speeds. I have found that training your own model off the 3.5 gpt model can help with response times but we are still subject to the highly growing demand for this service. Also, I am doing a direct post bypassing servers.

fedeandri · November 1, 2023, 4:34pm

Same situation for me, I’m developing a web app that doesn’t use more than 1000 tokens per API call and I randomly get one of these 3 outcomes:

instant response
very long waiting times, 30-40 seconds
sometimes the app hangs (I’m guessing there’s a network error, like I’m often seeing on ChatGPT lately, but I haven’t verified it yet)

With outcomes 1 and 2 being the most frequent and like @mattholdsme I have the feeling that there’s a correlation with the time of the day I’m calling the API (I only tested it from Italy so far).

thecaiosantos · November 1, 2023, 8:05pm

I was having the same slowdown problem, which was only solved by adding $50 in credit. After that, the response time returned to normal.

reyemtm · November 3, 2023, 5:04pm

First test of the API vs the chat openai com - this is pretty much not usable for anything due to the response times. My limits are 90,000 TPM 3,500 RPM

ex - pt 425, completion tokens 100, total 525. - 8.9 seconds.
ex - pt: 432, completion tokens: 100, total: 532 - 24.9s

It needs to be in the < 1s range to be usable for my use case - so pretty much an instant response. Is this possible?

curt.kennedy · November 3, 2023, 6:19pm

FWIW, here is my GPT-4 token generation graph for the last 20 days.

I am at Tier 4.

About 10-11 days ago it started slowing down.

curt.kennedy · November 3, 2023, 6:51pm

As an aside, and I know this doesn’t apply to many people, but the 32k context model for GPT-4 doesn’t appear to be impacted. However, the standard deviation of the 32k model is so large, that it’s hard to gauge any real trend without smoothing the data.

Foxalabs · November 3, 2023, 8:19pm

I would imagine the OAI 32K model used for alpha testing is on it’s own node somewhere and is only affected by those with access, so small numbers relative to everything else.

curt.kennedy · November 3, 2023, 9:34pm

Yeah 32k has always been a bit erratic. Here is a plot going back 160 days ago to present.

Something happened 125 days ago which improved the speed. Then there is some dip 75 days ago that has since been rectified.

No idea why the performance is so variable, but overall it’s faster than the 8k GPT-4 version.

_j · November 3, 2023, 10:32pm

You need 100 tokens of text returned to you in under a second?

The only way you’d be able to reliably do that is on your own A100/H100 cards and a tiny model.

reyemtm · November 3, 2023, 10:48pm

What would an expected response time for 100 tokens be?

curt.kennedy · November 3, 2023, 11:20pm

The 100 tokens would take about 5 seconds. See the rough 20 tokens per second in my graph above. But it could take as long as 20 seconds just due to model performance variations (5 tokens per second).

For instant response going out, maybe see if streaming is a fit for you.

_j · November 3, 2023, 11:50pm

We can measure 100 tokens. An average of 2.3 seconds is a good score.

For 3 trials of gpt-3.5-turbo-0613

Stat	Minimum	Maximum	Average
latency (s)	Min: 0.375	Max: 0.782	Avg: 0.534
total response (s)	Min: 2.0236	Max: 2.482	Avg: 2.226
total rate	Min: 40.29	Max: 49.417	Avg: 45.198
stream rate	Min: 57.7	Max: 60.4	Avg: 58.550
response tokens	Min: 100	Max: 100	Avg: 100.000

For 3 trials of ft:gpt-3.5-turbo-0613:xxxx::yyyyy

Stat	Minimum	Maximum	Average
latency (s)	Min: 0.376	Max: 0.416	Avg: 0.396
total response (s)	Min: 1.531	Max: 1.667	Avg: 1.597
total rate	Min: 59.988	Max: 65.317	Avg: 62.680
stream rate	Min: 77.8	Max: 85.7	Avg: 82.500
response tokens	Min: 100	Max: 100	Avg: 100.000

You can see a fine-tune model is fastest (due to concurrency or hardware), at 82 tokens per second as a capability, but that is only after the network delay and model context loading of 10 prompt tokens at 0.4 seconds.

However with the max wait found in a mere three trials of the normal model being 0.8 seconds to receive the first token (at top_p =0.01 for minimum sampler input), you don’t get many more tokens before 1.0 seconds has elapsed on the best day that doesn’t go into multiple seconds. Let’s get responses as fast as we can, 3 tokens in, 5 out:

For 50 trials of gpt-3.5-turbo-0613:

Stat	Minimum	Maximum	Average
latency (s)	Min: 0.195	Max: 2.091	Avg: 0.595
total response (s)	Min: 0.253	Max: 2.101	Avg: 0.647
total rate	Min: 2.38	Max: 19.763	Avg: 9.751
stream rate	Min: 21.4	Max: 666.7	Avg: 141.630
response tokens	Min: 5	Max: 7	Avg: 5.040

0.65 seconds to barely say “hi”. (the extreme “stream” rate of tokens 2-5 is likely a miniscule delay in opening the API network response to you as the model is generating)

5-token cat chat at temperature = 1.9

Cats, miracle piecesCats are fascinating animalsCats are mesmerizingCats,
those mysteriousCats, the curiousCats are itdeCats are captivating creaturesCats
are majestic animalsCats are incredibly fascinatingCats have the abilityCats
are fluffy andCats wander the worldCats have occupied aVAMarketing-winder’sCats
are perhaps theCats, beloved creaturesCats are charismatic andCats, though
muchCats are fascinating creaturesCats, known allCats, with theirCats, also
knownCats are fascinating creaturesCats are beloved creaturesCats, notorious
forCats: The UltimateCats, scientifically knownThere is something inherently
mysticalCats, also knownCats, one ofCats. MysteriousCats have drawn peopleCats:
Understanding OurOur beloved feline companionsCats, also knownCats are
amazing creaturesCats, wandering mystCats, often regardedCats are beloved
companionsCats are small mammalsCats, one ofCats are small,Cats have
attained aCats, longtime popularCats are fascinating creaturesCats, also
knownCats are enigmaticCats, domesticsCats are prevalent inCats, little
homem

curt.kennedy · November 4, 2023, 12:02am

My numbers are for GPT-4. GPT-3.5-Turbo will be much faster.

fedeandri · November 4, 2023, 12:26pm

Thanks for this, I’m building a very simple app that generates a personal value statement with gpt3.5-turbo (prompt + reply don’t take more than 700 tokens) and I couldn’t figure out why with my API key everything worked smoothly, but with my client’s API key the script takes very long to execute (10-20 seconds) and often times out.

I’ll post the video I made as a proof anyway, but I guess telling my client to buy at least $50 of credits should fix the problem

EDIT
TLDR: raise your rate limits by buying credits

reyemtm · November 4, 2023, 5:36pm

Thanks for the responses. In my use case - user asks a question, AI generates SQL from question which then queries db - in a search bar is not feasible. Sounds like 10s would be about right for this flow which is not a reasonable user experience.

Topic		Replies	Views
GPT-3.5 Turbo API response is slow API	20	12142	November 11, 2023
Gpt-3.5-turbo-1106 is very slow API chatgpt	46	7646	December 19, 2023
We proved the API is intentionally slow API	56	17467	May 2, 2023
GPT-3.5 API is 30x slower than ChatGPT equivalent prompt API gpt-35-turbo , api	69	13743	November 30, 2023
Introducing ChatGPT and Whisper APIs Announcements whisper	77	19548	December 13, 2023

OpenAI Why Are The API Calls So Slow? When will it be fixed?

For 3 trials of gpt-3.5-turbo-0613

For 3 trials of ft:gpt-3.5-turbo-0613:xxxx::yyyyy

For 50 trials of gpt-3.5-turbo-0613:

Related topics