GPT-3.5 and GPT-4 API response time measurements - FYI

Hi all,

Since API slowness is a consistent issue, I made some experiments to test the response times of GPT-3.5 and GPT-4, comparing both OpenAI and Azure.

As a reminder, mostly the response time depends on the number of output tokens generated by the model.

Here’s a summary of the results:

Or in three numbers:

  • OpenAI gpt-3.5-turbo: 73ms per generated token
  • Azure gpt-3.5-turbo: 34ms per generated token
  • OpenAI gpt-4: 196ms per generated token

You can use these values to approximate the response time. E.g. for a request to Azure gpt-3.5-turbo with 600 output tokens, the latency will be roughly 34ms x 600 = 20.4 seconds.

I’ll spare the full details of the experiments; you can see these in my blog post about GPT response times.

13 Likes

I haven’t caught up on the LLM wars, but when we say gpt3.5 azure, it’s the same model we are talking to, hosted on azure?

Correct. As far as I understand the models should be identical. (I am not sure about the safety layer or other things on top which may be different though).

Cool, not only is Azure faster, it is cheaper. Anyone seriously using GPT-3.5 should be using it.

1 Like

Blockquote

Cheaper? How much cheaper? I thought they were the same price.

1 Like

Did you test Azure GPT-4 by chance?

I don’t have access to Azure GPT-4 yet but have requested – when I do get access I’ll update the results!

AFAICT Azure and OpenAI GPT-s indeed have the same price.

Is there a way to program the api of GPT-4 to respond in a streaming manner token by token, the way it is done on the chatgpt platform?

Hi and welcome to the developer forum!

Yes there is, here is an example in python.

def stream_openai_response(prompt):
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo-16k',
        temperature= 0.5,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=10000,
        stream=True,
    )
    for event in response:
        yield event

obviously you link this with the web page side of things and have some js code receiving the events

2 Likes

do you have the numbers for the corelation with input token length and total token length?

I suspect input has a small effect at least with gpt4 since I have observed it in my own usage

I’m not sure what you are asking here, if the input tokens say “please say the word “hello”” then the response back will likely be a single token that points to the word “hello”

The output is not “liinked” to the input just by the number of tokens used, it depends on the contents of those tokens.

From Azure docs:

Azure OpenAI requires registration and is currently only available to approved enterprise customers and partners.

:frowning:

1 Like

I did run the experiments and it is basically constant.

1 Like

Hi,

In another link from GPT response times as you showed, I saw that
“OpenAI GPT-4: 94ms per generated token” while it takes around 196ms in this post?

Also, have you ever tried to calculate time response of GPT-4 Turbo?

Thank you!

The numbers get out of date pretty quickly as everyone makes improvements to their platform.

The most recent I measured was this (from my blog post on Nov 7):

  • gpt-4-1106-preview (“gpt-4-turbo”) runs in 18ms/token
  • gpt-3.5-turbo-1106 (“the newest version of gpt-3.5”) runs in just 6.5ms/token

3 Likes

Thank you for your reply!

Can you please explain more details about your testing sceneriors?

  • The response time you measured is based on many times of running and then computed the average value?

  • How many tokens of the text input you feed to GPt-4 Turbo? How the number of text input will affect to the response time?

  • I ran just once - the per-token number is very robust because the output contains 1,000+ tokens. I’ve run many repetitions as well but the results don’t change.

  • IIRC about 10 input tokens, but latency does not depend on input token count - see the last paragraph in this blog post.

1 Like
  1. I agree with your time computing.

  2. In my application, for each query, I need to feed around 3000 - 5000 tokens to GPT-4 Turbo, and the time I got the result takes around 9.1s for around 100 output tokens.

Actually, I read your blog post. So, I think my latency mostly comes from “const.” component in your formula because the number of input tokens is large in my case.