OpenAI Why Are The API Calls So Slow? When will it be fixed?

New rate limit documentation may indirectly and evasively describe what is going on to so many accounts.

Answer if your gpt-3.5-turbo is slow:

  • are you currently in a prepay plan?
  • have you paid OpenAI over $50 in prepaid credits, over a week ago?

That seems to be the criteria for getting to new “tiers” for prepay credit users, and it seems that quality of service comes along with giving OpenAI unrefundable money. Or rather, they “may” move you to lower latency.

Then answer:

  • Go to your account rate limits page,
  • See if you have been assigned one of the new unique and distinct rate limits of 20000 or 40000 tokens per minute, meaning you are in a lower “tier”.

Also, OpenAI has said they wouldn’t consider any rate request form for increase of GPT-4 TPM rate of 10000 that has been recently given to new users? Well, there’s your answer on that page, PAY UP.

7 Likes

Interesting. But that’s still about hard token limits. Clearly, some users/organizations are being throttled, well within all those limits, and that’s not mentioned anywhere. The only mention of throttling in the docs is about throttling it yourself to stay with the provided limits.

I have 2 organizations, one of those is experiencing the slowness, the other is fast as we are used to with GPT 3.5 Turbo. However, the organization that did not have any traffic in the last couple of months is faster than the one i’m using in a production setting with not a lot of users, so very well within all the limits and quotas.

Demo of the issue:

Another topic about the slowenss, that mentions it is only happening for certain organizations: GPT-3.5 API is 30x slower than ChatGPT equivalent prompt

The slowness just doesn’t make any sense. It’s not documented anywhere. There’s no status update. And any support request to OpenAI is closed mentioning it is probably your own fault. EVEN when I send them the demo video you see above.

For now I have fixed it by using a different organization and by using Azure’s OpenAI API’s. But that’s not really a fix. Thinking about migrating every service I have to Azure OpenAI.

6 Likes

I’ll make it more explicit:

As your usage tier increases, we may also move your account onto lower latency models behind the scenes.

Translation: You’ve been moved to higher latency models behind the scenes.

(the correct term is not “latency” like ping time; they surely mean low output rates.)

3 Likes

Still doesn’t make sense, because the organization that has the highest (paid) usage was seeing decreased performance, and still had decreased performance even when sitting idle now for a few days. The other organization that was sitting idle for months (still) has the fast turbo performance.

edit: just confirmed both organizations have the same rate limits

1 Like

This is a perfect explanation of what has been happening lately, where some accounts are generating content faster than others. Thank you for this insight!

3 Likes

I have noticed outages across an enterprise environment and an individual environment. I don’t think the problem is what everyone is currently thinking. There are certain parts of the day it is better and others it’s not. I’m in Australia so it works well during the day but at night when the u.a market comes online, than I can see times up to 4 / 5 minutes.

In the enterprise environment, the regular chatgpt model has been playing up without reporting the incident on the status page but with a test pool of 5000+ it’s easily identified. The errors we see is something along the lines of currently unavailable please try again in a couple of mins. The issue is is intermittent and is not all users at once.

With the open ai models, yes if you sacrifice quality for newer tech and better responses, then you will see very short response times. The higher models will do as previously stated, 4/5 minutes. It’s got gradually worse over the last couple of months with both chatgpt and api responses. I received emails this last week from google play console and apple and both are stating they have seen a huge spike in api calls and applications using AI. They are also advising of stricter requirements and a 30 day response in order to comply.

The reason I have given the above information, is because with all the above considered, one would think that openAI is struggling with the huge boom that they are seeing and I think in the new year we will see massive improvements to the latency due to the huge spike in new applications hitting the market and requiring quicker speeds. I have found that training your own model off the 3.5 gpt model can help with response times but we are still subject to the highly growing demand for this service. Also, I am doing a direct post bypassing servers.

Same situation for me, I’m developing a web app that doesn’t use more than 1000 tokens per API call and I randomly get one of these 3 outcomes:

  1. instant response
  2. very long waiting times, 30-40 seconds
  3. sometimes the app hangs (I’m guessing there’s a network error, like I’m often seeing on ChatGPT lately, but I haven’t verified it yet)

With outcomes 1 and 2 being the most frequent and like @mattholdsme I have the feeling that there’s a correlation with the time of the day I’m calling the API (I only tested it from Italy so far).

I was having the same slowdown problem, which was only solved by adding $50 in credit. After that, the response time returned to normal.

1 Like

First test of the API vs the chat openai com - this is pretty much not usable for anything due to the response times. My limits are 90,000 TPM 3,500 RPM

ex - pt 425, completion tokens 100, total 525. - 8.9 seconds.
ex - pt: 432, completion tokens: 100, total: 532 - 24.9s

It needs to be in the < 1s range to be usable for my use case - so pretty much an instant response. Is this possible?

FWIW, here is my GPT-4 token generation graph for the last 20 days.

I am at Tier 4.

About 10-11 days ago it started slowing down.

2 Likes

As an aside, and I know this doesn’t apply to many people, but the 32k context model for GPT-4 doesn’t appear to be impacted. However, the standard deviation of the 32k model is so large, that it’s hard to gauge any real trend without smoothing the data.

1 Like

I would imagine the OAI 32K model used for alpha testing is on it’s own node somewhere and is only affected by those with access, so small numbers relative to everything else.

Yeah 32k has always been a bit erratic. Here is a plot going back 160 days ago to present.

Something happened 125 days ago which improved the speed. Then there is some dip 75 days ago that has since been rectified.

No idea why the performance is so variable, but overall it’s faster than the 8k GPT-4 version.

You need 100 tokens of text returned to you in under a second?

The only way you’d be able to reliably do that is on your own A100/H100 cards and a tiny model.

What would an expected response time for 100 tokens be?

The 100 tokens would take about 5 seconds. See the rough 20 tokens per second in my graph above. But it could take as long as 20 seconds just due to model performance variations (5 tokens per second).

For instant response going out, maybe see if streaming is a fit for you.

We can measure 100 tokens. An average of 2.3 seconds is a good score.

For 3 trials of gpt-3.5-turbo-0613

Stat Minimum Maximum Average
latency (s) Min: 0.375 Max: 0.782 Avg: 0.534
total response (s) Min: 2.0236 Max: 2.482 Avg: 2.226
total rate Min: 40.29 Max: 49.417 Avg: 45.198
stream rate Min: 57.7 Max: 60.4 Avg: 58.550
response tokens Min: 100 Max: 100 Avg: 100.000

For 3 trials of ft:gpt-3.5-turbo-0613:xxxx::yyyyy

Stat Minimum Maximum Average
latency (s) Min: 0.376 Max: 0.416 Avg: 0.396
total response (s) Min: 1.531 Max: 1.667 Avg: 1.597
total rate Min: 59.988 Max: 65.317 Avg: 62.680
stream rate Min: 77.8 Max: 85.7 Avg: 82.500
response tokens Min: 100 Max: 100 Avg: 100.000

You can see a fine-tune model is fastest (due to concurrency or hardware), at 82 tokens per second as a capability, but that is only after the network delay and model context loading of 10 prompt tokens at 0.4 seconds.

However with the max wait found in a mere three trials of the normal model being 0.8 seconds to receive the first token (at top_p =0.01 for minimum sampler input), you don’t get many more tokens before 1.0 seconds has elapsed on the best day that doesn’t go into multiple seconds. Let’s get responses as fast as we can, 3 tokens in, 5 out:

For 50 trials of gpt-3.5-turbo-0613:

Stat Minimum Maximum Average
latency (s) Min: 0.195 Max: 2.091 Avg: 0.595
total response (s) Min: 0.253 Max: 2.101 Avg: 0.647
total rate Min: 2.38 Max: 19.763 Avg: 9.751
stream rate Min: 21.4 Max: 666.7 Avg: 141.630
response tokens Min: 5 Max: 7 Avg: 5.040

0.65 seconds to barely say “hi”. (the extreme “stream” rate of tokens 2-5 is likely a miniscule delay in opening the API network response to you as the model is generating)

5-token cat chat at temperature = 1.9

Cats, miracle piecesCats are fascinating animalsCats are mesmerizingCats,
those mysteriousCats, the curiousCats are itdeCats are captivating creaturesCats
are majestic animalsCats are incredibly fascinatingCats have the abilityCats
are fluffy andCats wander the worldCats have occupied aVAMarketing-winder’sCats
are perhaps theCats, beloved creaturesCats are charismatic andCats, though
muchCats are fascinating creaturesCats, known allCats, with theirCats, also
knownCats are fascinating creaturesCats are beloved creaturesCats, notorious
forCats: The UltimateCats, scientifically knownThere is something inherently
mysticalCats, also knownCats, one ofCats. MysteriousCats have drawn peopleCats:
Understanding OurOur beloved feline companionsCats, also knownCats are
amazing creaturesCats, wandering mystCats, often regardedCats are beloved
companionsCats are small mammalsCats, one ofCats are small,Cats have
attained aCats, longtime popularCats are fascinating creaturesCats, also
knownCats are enigmaticCats, domesticsCats are prevalent inCats, little
homem

:point_up_2: My numbers are for GPT-4. GPT-3.5-Turbo will be much faster.

Thanks for this, I’m building a very simple app that generates a personal value statement with gpt3.5-turbo (prompt + reply don’t take more than 700 tokens) and I couldn’t figure out why with my API key everything worked smoothly, but with my client’s API key the script takes very long to execute (10-20 seconds) and often times out.

I’ll post the video I made as a proof anyway, but I guess telling my client to buy at least $50 of credits should fix the problem

EDIT
TLDR: raise your rate limits by buying credits

Thanks for the responses. In my use case - user asks a question, AI generates SQL from question which then queries db - in a search bar is not feasible. Sounds like 10s would be about right for this flow which is not a reasonable user experience.