OpenAI Why Are The API Calls So Slow? When will it be fixed?

Have you tried streaming instead?


You need to decide if your priorities lie with retaining your current server provider or making use of AI technology.

I wish there was a better way to do this but until clients place pressure on server hosts to cater for AI requirements they will continue with practices like not supporting SSE and short timeouts.

1 Like

Yep, this is our end goal as it will fix the issue with timeouts.

Regardless of the timeout issue, the noticeable degradation in response times is still happening and this has only started happening in the past couple of days. We would really hope that this is just a short-term issue and will be fixed soon


Yes, a 30-second timeout for a server that makes use of AI technologies is not good. However, up until now, 99% of requests for my client app were succeeding so it has been good enough. Moving my server over to a provider with longer timeouts has been on my to-do list for a good while. It has just moved to the top of the list after this issue with increased response times :smile:


I have also experienced the same issue lately. I recommend you to switch to cloud. E.g. on AWS lambda it’s super easy to adjust timeout counter. You can then relay the results to your own server or somewhere else.


I implore you to reconsider.

Having a high timeout threshold can lead to a lot of unintended, negative side effects. There is a massive difference between an unresponsive server, and a server that is currently processing your request.

This is a band-aid solution. Longer timeouts has nothing to do with AI solutions.

I assisted with a forum member here (who was unexpectedly perma-banned for advertising their service… which was a free Spanish website to help teachers build curriculums for students… they were asking for help) who ran into the same issue, and had their incompetent dev say “Durr we can’t do streaming, you need a better provider”.

Lo and behold. They added streaming and everything worked great. Their dev was still writing their websites using jQuery and Bootstrap (backend was PHP but that’s not bad).

This is an ideal option if you cannot use streaming.

1 Like

Streaming is absolutely the best solution, short controlled blocks, there are however still a large number of users requesting complex prompts that will take more than 30 seconds to respond back with… until streaming becomes the norm… not sure what else you can do.

1 Like

Agreed. If it is a must (and what a terrible must it is), then that’s just the way it is. It’s just in my experience that raising/removing safeguards should be considered the absolute LAST option.

I mean, who’s to say that even a 1 minute timeout is sufficient. Then if the server is actually down/unresponsive and the host receives numerous requests per seconds.

That is a brutal bottleneck / disaster waiting to happen


I have the same issue with my own chatbot I wrote in Python. It caused me to learn how to write better code because at first the calls took so long they froze my chatbot.

Adding moderation API has made it even worse. It’s so slow now even on GPT-3.5 Turbo that I find it kind of useless.

Recently, OpenAI launched their new higher priority higher context limit privacy focused ChatGPT for corporations that have special deals with them.

They have finite resources, to give one group a better service means we get worse service. Consistently, OpenAI has been prioritizing their relationships with big corporations and their original mission of leveling access to AI has gone down the toilet.

Unfortunately, those corporations have the big $$, so I expect that while they will maintain our smaller accessibility, I expect that to be selectively provisioned based on how much money we are paying them. I don’t expect a universally equal experience and I think the discrepancies will only increase over time.

1 Like

With respect, that is pure speculation. Nobody here knows what the server infrastructure situation is like and we have no idea if the corporate ChatGPT is even running on the same hardware, it is very unlikely to be the case as then you would not have any separation, therefor it should have no impact on existing services.

New hardware is brought online every day, scaling datacentres for AI is an absolutely mammoth task and is happening at breakneck speed globally. There are also software refinements and optimisations being done all the time. There have been slowdowns before and there will be again, it is the nature of beta testing bleeding edge technology.


It absolutely is speculation, however, it is not some randomly made up speculation out of nowhere, there’s some pretty common sense involved.

OpenAI would not need to rate limit the way they do if resources were simply not an issue. These were added due to the demand of GPT-4, and at first, the limits only got tighter and tighter.

  1. There’s no magical servers out there that appeared to offer a better version of the exact same service that is experiencing degradation. Just as they had to use rate limits to insure they had resources to go around for users, they have to allocate resources to be able to guarantee users under certain circumstances have a guarantee of things like faster response times, priority que, more token limits, etc.

  2. We don’t need to speculate whether such a structure for prioritization exists, they are already implementing it and have been for a very long time. What do you think ChatGPT Plus is? We, who pay for access, receive higher priority on our accounts than free users do. Do you, in any capacity, actually believe that they will not enact the exact same prioritization to corporations who are paying a lot more money per user than pur $20/month? I know companies who are paying over $1k/month on average per user. I have a friend who works for a big insurance company and was the one responsible for negotiating their AI integration and their agreement with OpenAI. I can tell you factually, I know what they are getting and it is VASTLY different than what we get. It is, however, still powered by the same ChatGPT MANY of us have noticed degradation in.

  3. Anyone who was around before ChatGPT Plus noticed things like an increase in response time, longer waits, and even difficulty logging in because of the way they managed the flow of usage. There isn’t enough resources in free usage to offer a vastly superior product to ChatGPT Plus to huge companies in an efford to get them to integrate and use ChatGPT to take those resources from free models. Yes, they are growing and expanding, and they’ve also lost a lot of users, especially in ChatGPT Plus. That is not enough to compensate for offering companies who are going to heavily use ChatGPT in ways described in their new corporate plan. As you’ve said, AI is extremely resource heavy. So do you imagine that somehow a company who already struggles with rate limits in this market with even competition of resources, has the ability to fulfill all of Microsoft’s ambitious expansion and integration of AI AND offer a faster, higher priority, more accurate, larger context, privacy focused 32k+ version of ChatGPT 4 with training to companies who are going to massively integrate that into their business and put extremely heavy load on OpenAI’s servers, which they will pay a premium for, and that is not going to impact us and our tiny $20 a month model?

So yes, it is speculation, but with all do respect, I think expecting otherwise is idealistic, foolish, and absolutely ridiculous. We may have no choice but to speculate because OpenAI has become horribly non-transparent as a company. If the answers were things we would, they would have no issues about addressing them. Instead, they are deafeningly silent on many issues with their public models, something I assure you is not the case with their large corporate clients.

I prefer my speculation that must be made to be based on obvious evidence combined with logic and reasoning. Sure, it’s not perfect, sometimes it might be wrong, but idealistically advocating for a company that gives zero reason to do so is not how I form the speculations I must form when making business decisions, you can’t operate a business that way if you wish to be in business long nor can you function a business under the premise of “well we don’t know…”. I leave such an approach to those who play political sportsball representing officials instead of the officials having to represent them.

OpenAI is horrible tight lipped and not remotely transparent to customers using their products on our level. That alone is enough for me to question their actions and to speculate without idealism. If people didn’t foolishly advocate on their behalf with no knowledge to back doing so and no more information than those making speculations, but instead devoted their time to questioning them, then as a company they would likely feel more pressure to address public concerns and we wouldn’t need to speculate that which they should be transparent about.


Sorry, I’m responding on my phone and have bad coverage. I just wanted to point out one last issue with your reply.

This, as in the ChatGPT and API the people on this forum use, is NOT bleeding-edge technology.

The company, OpenAI is in fact a bleeding edge AI developer, however, as has largely been the case for a LONG time, OpenAI has a pretty consistent and obvious track record of limiting the whole “bleeding edge” parts to hand-selected companies and partners they work closely with, and do not communicate with through forums such as this.

What we are using, is at best, the leftovers of the bleeding-edge technology they sell to much more wealthy sponsors and use as a face to bring awareness to the technology, which includes hype and momentum to fuel their sales to clients which could actually have a hope of paying enough money that they can profit off the technology.

1 Like

I’ve noticed that API calls to gpt-3.5-turbo-16k can have very long response times, and that those response times are much longer when executed from within AWS lambda functions (and no, not within a VPC) as compared to local execution outside of AWS. On average, I’m finding that API calls take 11% longer to respond when made from AWS (us-west-2) than locally. Response times are typically 30-40 seconds, though one reached over 12 minutes. I recognize that I have a large number of tokens used, both in my calls and in the responses. I also recognize that I could turn on streaming so that I can see the response as it is being processed. However, on both fronts, my use case needs the full length of the tokens in/out and needs the complete answer before I can use it (not a chat bot where a user can ‘feel’ like their getting a better experience by reading the output as it comes).

I assume that OpenAI will eventually be adding to capacity alleviate these bottlenecks. However, in the meantime, I just wish I had more transparency on what causes the disparities in latency and targeted timeline to resolution. I’m curious if OpenAI could publish their system-wide API response times, maybe split by model and with some tiering by token count. That way at least I could benchmark my code against the overall average. I could, for example, decide if it’s worth my time looking into further debottlenecking, such as trying to move around to different AWS regions or try other cloud providers perhaps. If I knew for sure that my wait times were no different than the average, then I’d know whether or not this additional planned time on my part was likely to pay off. I’ve already done the obvious, like async calls, where applicable.

I’m on a paid API service, by the way, so I don’t expect that this is an issue of being pushed to a lower class tier.


You can (and should) still stream responses regardless of use case, open sockets for extended periods of time is always undesirable.

For every hardware and software capability speedup that gets introduced there will be a corresponding demand in more complex and longer replies, so it’s worth taking a leaf from modern API’s and apps that deal with either slow to occur or large databases. You will find almost all of those package their data in a stream format to allow safe and efficient port and packet handling.

1 Like

streaming is understandable for long prompts/responses. A mere 11 token prompt took 1.5 minutes!!
It has become miserable slow for the last few days.


yes same here i have been facing this same issue can some one suggest some other method to overcome this problem.


Seme problem here. Nowadays its taking longer that you time. its taking more than 12sec on Firebase function. and for larger content it tooks more than 40s. which is really bad.


So it seems that the way to go nowadays, is through pure completions endpoint (no chat) and via instruct engine… Which in my case it’s precisely what I need. But I see no way to deprecate legacy completions endpoint then… Hope you don’t!


The language about the completions endpoint just plain being turned off has been changed in the blog announcement that announced the deprecation of older models and their shut-off January 4:

Deprecation of older models in the Completions API

As part of our increased investment in the Chat Completions API and our efforts to optimize our compute capacity, in 6 months we will be retiring some of our older models using the Completions API. While this API will remain accessible, we will label it as “legacy” in our developer documentation starting today. We plan for future model and product improvements to focus on the Chat Completions API, and do not have plans to publicly release new models using the Completions API.

Starting January 4, 2024, older completion models will no longer be available, and will be replaced with the following models:

There are three replacement completion models since the announcement was made that are now functioning, the single gpt-3.5-turbo-instruct that can understand your tasks, and two fine-tunable completion engines that continue writing where you left off.

However, the instruct model doesn’t “chat” with users in the same manner, even when using a similar container format to send messages. For many, that’s a good thing, you can have AI process texts or generate batch output without AI wanting to say "Sure I can do that".

As replacement completion models continue using the same endpoint, there is no reason for them to go away in the future until it doesn’t make economic sense. Completions (and the quality of GPT-3) is vital for AI ML research. Replacements have been adequately neutered so you don’t get the unfettered abilities of GPT-3 models.

You should not have to use a different model because of slowdowns. OpenAI should be forthcoming and make a statement why a subset of accounts have been directed to slower output gpt-3.5-turbo models.

we see your users are just chatting, you don’t need words faster than you can read them, and so we moved you to our expanded slower datacenter instances so we can use the fast inference engines machines for our own main product ChatGPT Plus.”

1 Like

I am facing the same as rob described :point_down:t2:

:information_source: A little background

I have implemented an app which produces a paragraph of about ~500 tokens and I developed that about a month ago. Working fine that time. It was taking around 4-5 seconds.

But now, the same prompt, the same configurations and the same model is taking around 40 seconds.

:man_technologist:t2: The model and the code

The API call is something like:

response = openai.Completion.create(
		prompt=paragraph_prompts, # it is a list of 3 prompts

Yeah it is a list of 3 prompts so basically, I am generating 3 paragraphs paralelly in batches following batching method.

Which takes three prompts and gives results at once instead of 3 sequencial API calls. It saves time.

As said, the same was taking about 4-5 seconds before but now it is over 40 seconds.

:thinking: Should I…

  1. Change the model from DaVinCi-003 to Turbo-3.5?
  2. Or the blame is on OpenAI?

I am not getting any Rate limit error, because otherwise I couldn’t even generate the paragraphs in the first place, right?

Please guide,
Thanks :pray:t2: