GPT-3.5 and GPT-4 API response time measurements - FYI

taivo · May 29, 2023, 7:29pm

Hi all,

Since API slowness is a consistent issue, I made some experiments to test the response times of GPT-3.5 and GPT-4, comparing both OpenAI and Azure.

As a reminder, mostly the response time depends on the number of output tokens generated by the model.

Here’s a summary of the results:

Or in three numbers:

OpenAI gpt-3.5-turbo: 73ms per generated token
Azure gpt-3.5-turbo: 34ms per generated token
OpenAI gpt-4: 196ms per generated token

You can use these values to approximate the response time. E.g. for a request to Azure gpt-3.5-turbo with 600 output tokens, the latency will be roughly 34ms x 600 = 20.4 seconds.

I’ll spare the full details of the experiments; you can see these in my blog post about GPT response times.

DavidOS366 · May 29, 2023, 10:23pm

I haven’t caught up on the LLM wars, but when we say gpt3.5 azure, it’s the same model we are talking to, hosted on azure?

taivo · May 30, 2023, 2:02am

Correct. As far as I understand the models should be identical. (I am not sure about the safety layer or other things on top which may be different though).

JustinC · May 31, 2023, 8:32am

Cool, not only is Azure faster, it is cheaper. Anyone seriously using GPT-3.5 should be using it.

qrdl · June 8, 2023, 1:28pm

Blockquote

Cheaper? How much cheaper? I thought they were the same price.

blakewenloe · June 8, 2023, 3:27pm

Did you test Azure GPT-4 by chance?

taivo · June 9, 2023, 6:51am

I don’t have access to Azure GPT-4 yet but have requested – when I do get access I’ll update the results!

taivo · June 9, 2023, 6:51am

AFAICT Azure and OpenAI GPT-s indeed have the same price.

samuelrut · August 12, 2023, 9:49am

Is there a way to program the api of GPT-4 to respond in a streaming manner token by token, the way it is done on the chatgpt platform?

Foxalabs · August 12, 2023, 3:05pm

Hi and welcome to the developer forum!

Yes there is, here is an example in python.

def stream_openai_response(prompt):
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo-16k',
        temperature= 0.5,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=10000,
        stream=True,
    )
    for event in response:
        yield event

obviously you link this with the web page side of things and have some js code receiving the events

nevo.krien · October 19, 2023, 11:14am

do you have the numbers for the corelation with input token length and total token length?

I suspect input has a small effect at least with gpt4 since I have observed it in my own usage

Foxalabs · October 19, 2023, 12:14pm

I’m not sure what you are asking here, if the input tokens say “please say the word “hello”” then the response back will likely be a single token that points to the word “hello”

The output is not “liinked” to the input just by the number of tokens used, it depends on the contents of those tokens.

cbdeveloper · October 20, 2023, 2:44pm

From Azure docs:

Azure OpenAI requires registration and is currently only available to approved enterprise customers and partners.

taivo · October 27, 2023, 4:03pm

I did run the experiments and it is basically constant.

phamvantoan · December 7, 2023, 1:49am

Hi,

In another link from GPT response times as you showed, I saw that
“OpenAI GPT-4: 94ms per generated token” while it takes around 196ms in this post?

Also, have you ever tried to calculate time response of GPT-4 Turbo?

Thank you!

taivo · December 7, 2023, 7:27am

The numbers get out of date pretty quickly as everyone makes improvements to their platform.

The most recent I measured was this (from my blog post on Nov 7):

gpt-4-1106-preview (“gpt-4-turbo”) runs in 18ms/token
gpt-3.5-turbo-1106 (“the newest version of gpt-3.5”) runs in just 6.5ms/token

phamvantoan · December 7, 2023, 7:57am

Thank you for your reply!

Can you please explain more details about your testing sceneriors?

The response time you measured is based on many times of running and then computed the average value?
How many tokens of the text input you feed to GPt-4 Turbo? How the number of text input will affect to the response time?

taivo · December 7, 2023, 8:10am

I ran just once - the per-token number is very robust because the output contains 1,000+ tokens. I’ve run many repetitions as well but the results don’t change.
IIRC about 10 input tokens, but latency does not depend on input token count - see the last paragraph in this blog post.

phamvantoan · December 7, 2023, 8:38am

I agree with your time computing.
In my application, for each query, I need to feed around 3000 - 5000 tokens to GPT-4 Turbo, and the time I got the result takes around 9.1s for around 100 output tokens.

Actually, I read your blog post. So, I think my latency mostly comes from “const.” component in your formula because the number of input tokens is large in my case.

Topic		Replies	Views
GPT 4 API is Very Slow Still API gpt-4 , chatgpt , api	15	6807	December 16, 2023
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9599	July 22, 2024
Gpt-4-0125-preview is slower than gpt-4-0613? Feedback gpt-4 , api	5	5580	January 30, 2024
Benchmarking response time for GPT4 by context+output tokens API gpt-4 , api-speed	6	6873	November 3, 2023
Gpt-4o tokens per second comparable to gpt-3.5-turbo. Data and analysis API gpt-4 , gpt-35-turbo , playground , gpt-4-turbo , gpt-4o	3	12987	August 16, 2024

GPT-3.5 and GPT-4 API response time measurements - FYI

Related topics