Open AI GPT 4 API is absurdly slow

Dear OpenAI:

We were extremely excited to have access to the GPT-4 API. We have been waiting with marked anticipation to use it to help build our platform. However, within minutes of using the API it became readily apparent that it is egregiously and prohibitively slow. It is literally taking anywhere from 5-10 minutes for about 10-20,000 tokens. In its present state the GPT-4 API is basically unusable for us.

We used GPT-3.5-turbo but the fidelity of the answers provided was drastically worse than GPT-4, thereby precluding us from using GPT-3.5-turbo.

We are paid users of the API and have no issues spending money on it, but it will literally take us over 1 year to do what we intend with the API as currently situated. In contrast, the same work would only take days using GPT-3.5-turbo as currently constructed.

Interestingly, the GPT-4 API queries are taking far longer than those on the Chat GPT site. We would greatly appreciate it if the GPT-4 API was much faster.

Thank you for your kind consideration…

Basem Goueli MD, PhD, MBA


Yes, in the same bucket here. Via the web client, it’s not super-fast, but my api calls are not passing a lot of context and are extremely slow. I’m working on a realtime experience for users, so it’s not possible for me at the current rates.

I wonder if it’s a hardware limitation that they are keeping the generations so slow in the API - I can understand why it’s faster in ChatGPT, because they are still limiting you. Or perhaps it’s to prevent an onslaught of users hoping to train local models with it? Either way, I’m sure they will eventually release the floodgates…

I am in the same situation, Every call I make to the gpt4 api takes more than 1 minute, even with short responses.
Is there any restriction by the type of user? In my case, access to the gpt-4 api belongs to my personal account, I am not associated with a business account

Same situation here! Extremely slow. It was better when I got the access a few months ago. I am even happy with the same speed as playground but api response is really slow.

1 Like

I switched to PaLM. For certain R&D projects.

1 Like

Welcome to SOTA (state of the art). Try multiplying X billion numbers over and over for each output token. :upside_down_face:

I’m not trying to excuse OpenAI for the model latency, just trying to set expectations.

Good news though, algorithms improve, speeds go up, and BOOM, things improve.

Just curious, but are we talking lower latency as the driver here?

If so, how much faster is it? 2x, 1.2x. What about quality?

There are many parts to this question, but performance is one part. I think it’s widely agreed that PaLM is not as intellectually capable as other models. But it is much faster (1.5 to 3.5 times quicker) and we don’t see timeout lapses at all.

In certain use cases, especially those based largely on embedings, there is no perceptible inferencing difference. Vectors, however, come back in 400 milliseconds instead of 1.5 seconds. These are experiences from my anecdotal tests.

I think Google will ultimately pull a few rabbits out of the hat, so this AGI battle seems beneficial for us all.


I can totally see this. Really, all we need is the LLM to “speak English” and be able to repackage data from the prompt that the embeddings provided.

So, essentially, “lower tier” LLM’s + Embeddings are all that is required to make the magic happen (if you’ve got embeddings laying around).

These are significant too. My only concern is that, at least in my case, my largest requirement is quality, not latency. So if quality starts to suffer due to trying to improve latency, I wouldn’t want that. This is why I rarely use GPT-3.5-Turbo, because I need quality over latency.

Here my use case is not using embeddings, but influencing the model to respond back to one thing with different personalities and perspectives. This is one thing I don’t think PaLM is good at, but I have an open mind here, and probably just need to experiment with PaLM directly to find out for myself.

1 Like

Yep - I have a baby-agent project that is also not using embeddings, and PaLM struggles compared to GPT-4. However, I have noticed PaLM seems to be a little better at certain math computations, and this article seems to indicate this may be the case.

But who knows, really? We’re all just one crappy prompt away from disaster. :wink:


Same here, but strangely it works fine on the playground…