GPT-3.5 and GPT-4 API response time measurements - FYI

Hi all,

Since API slowness is a consistent issue, I made some experiments to test the response times of GPT-3.5 and GPT-4, comparing both OpenAI and Azure.

As a reminder, mostly the response time depends on the number of output tokens generated by the model.

Here’s a summary of the results:

Or in three numbers:

  • OpenAI gpt-3.5-turbo: 73ms per generated token
  • Azure gpt-3.5-turbo: 34ms per generated token
  • OpenAI gpt-4: 196ms per generated token

You can use these values to approximate the response time. E.g. for a request to Azure gpt-3.5-turbo with 600 output tokens, the latency will be roughly 34ms x 600 = 20.4 seconds.

I’ll spare the full details of the experiments; you can see these in my blog post about GPT response times.


I haven’t caught up on the LLM wars, but when we say gpt3.5 azure, it’s the same model we are talking to, hosted on azure?

Correct. As far as I understand the models should be identical. (I am not sure about the safety layer or other things on top which may be different though).

Cool, not only is Azure faster, it is cheaper. Anyone seriously using GPT-3.5 should be using it.

1 Like


Cheaper? How much cheaper? I thought they were the same price.

1 Like

Did you test Azure GPT-4 by chance?

I don’t have access to Azure GPT-4 yet but have requested – when I do get access I’ll update the results!

AFAICT Azure and OpenAI GPT-s indeed have the same price.

Is there a way to program the api of GPT-4 to respond in a streaming manner token by token, the way it is done on the chatgpt platform?

Hi and welcome to the developer forum!

Yes there is, here is an example in python.

def stream_openai_response(prompt):
    response = openai.ChatCompletion.create(
        temperature= 0.5,
        messages=[{"role": "user", "content": prompt}],
    for event in response:
        yield event

obviously you link this with the web page side of things and have some js code receiving the events