Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo

Using the new model, i;m finding my response times have made my application impossible to use. The average token generation is around 4000, it wasn’t the fastest before either (between 1 minute to 1.5 minute response times) but now its taking almost 5-6 minutes…even if i go back down to gpt-4-1106-preview

Downgrading to 3.5 turbo is significantly faster but response quality is way worse.

Is anyone else experiencing this issue? Any ideas on speeding up the responses?

Just look at these! First one is gpt-4-0125, and third one is gpt-4-1106. the other 2 are 3.5 turbo

Screenshot 2024-02-19 at 8.15.48 AM

Hey, this is somewhat to be expected. The GPT-4 series models will always be slower than the 3.5T series models. Which model were you using before, if I am understanding you right, your saying the token generation time went from 1 min to 5 min?

1 Like

yup, my average generation time on gpt-4-1106 before was 50 seconds - 1.5 minutes. I understand 4 might be slower, but the difference before was much closer. Now it’s unbearable. That 4.7 mins in the screenshot is the fastest I’ve seen all day.

How many tokens did you use as input and/or what’s your area of application? I’ve been using GPT-4 models a few times today and completion time was normal.

{ prompt_tokens: 2635, completion_tokens: 614, total_tokens: 3249 } ( this on took 4.9 minutes on gpt-4-1106-preview)

Somewhere between this and 4k total, usually, the completion tokens are up a bit higher. It takes in some documents and rewrites certain aspects of it

This seems odd. For similar token levels, my completion times are usually within 30-60 seconds per API call for GPT-4 turbo models. Just to rule this out, are there any steps prior to ingestion by the model that could be causing this?

Nope, the front end hits the route directly. its just one prompt on that route, nothing else happening besides the open ai api call. And internet is not an issue either, getting about 700mbps down and 100 upload.

I can confirm slowness. (ChatGPT’s vision, vision being based on 1106, was also very slow at production when tested earlier.)

A march toward consistent low token production rate showing, although the 1106 model also mentioned above is not evaluated:

My own single requests:
[32 tokens in 3.3s. 9.6 tps]
[600 tokens in 32.0s. 18.8 tps]
[32 tokens in 3.0s. 10.6 tps]
[600 tokens in 45.2s. 13.3 tps]
[32 tokens in 0.9s. 33.7 tps]
[600 tokens in 11.7s. 51.4 tps]

This is with a small input context with a writing request.

Also note that the token production rate has, in the past (for those who recognized the big switch on their account when slowing was implemented), been slower for those at tier 1 of API payment history.

That might be it, I’m on tier 1. But I still don’t understand the sudden decrease in speed on my current tier, putting a big wrench in my application.

By geography, it is possible that you may be routed to different datacenters, considering those on Azure with commercial Microsoft OpenAI services can pick from many deployment locations. You thus could get different performance than others. There’s no clarity about where OpenAI API requests are serviced from.

It would be nice to think that one is no longer discriminated against because of how much they prepaid.

The rate limits and tier documentation has had this prior text eradicated: As your usage tier increases, we may also move your account onto lower latency models behind the scenes.

The alternate case is the availability of services remains directly dispensed in priority by payment trust tier. OpenAI may not be willing to go on the record about their service management policies.

1 Like

Update: Might just have been the timing of using the API (Which still worries me if it suddenly goes up when I go to production) but now my response times are ~2 mins. Still not the best, but better. Hoping to jump up in tiers and get this down even more.

1 Like

Sorry, wrong thread. Please remove this comment.