Inference speed of different models

pietz · July 4, 2025, 2:34pm

I’m a bit puzzled how OpenAI rates the inference speed of their models and how the API speed compares to something like Gemini. I ran a quick speed test and these were the results:

$ uvx tacho gpt-4o gpt-4o-mini o4-mini o3 gpt-4.1-nano gpt-4.1-mini gpt-4.1
┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┓
┃ Model        ┃ Avg t/s ┃ Min t/s ┃ Max t/s ┃  Time ┃ Tokens ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━┩
│ o4-mini      │   179.4 │   165.8 │   190.9 │  5.6s │   1000 │
│ o3           │   115.4 │    96.6 │   134.6 │  8.8s │   1000 │
│ gpt-4.1-nano │    95.9 │    75.7 │   106.6 │  5.3s │    500 │
│ gpt-4.1      │    67.6 │    56.9 │    80.7 │  7.5s │    500 │
│ gpt-4.1-mini │    61.7 │    51.5 │    68.4 │  8.2s │    500 │
│ gpt-4o-mini  │    59.0 │    46.1 │    69.9 │  8.7s │    500 │
│ gpt-4o       │    31.7 │    29.1 │    35.9 │ 15.9s │    500 │
└──────────────┴─────────┴─────────┴─────────┴───────┴────────┘

For example, gpt-4.1-nano which has the highest speed rating of all models (5 stars) is 20% slower than o3 which is rated as the slowest model with 1 star. Also, gpt-4.1 which is rated at the same speed as gpt-4o is actually twice as fast.

Compared to Gemini models, the OpenAI API is also quite slow:

$ uvx tacho gemini/gemini-2.5-flash gemini/gemini-2.5-pro gemini/gemini-2.5-flash-lite-preview-06-17 openai/gpt-4.1-mini openai/gpt-4.1 openai/gpt-4.1-nano
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┓
┃ Model                                      ┃ Avg t/s ┃ Min t/s ┃ Max t/s ┃  Time ┃ Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━┩
│ gemini/gemini-2.5-flash-lite-preview-06-17 │   291.0 │   258.1 │   326.7 │  1.7s │    500 │
│ gemini/gemini-2.5-flash                    │   281.4 │   271.5 │   287.3 │  3.5s │    998 │
│ gemini/gemini-2.5-pro                      │   145.7 │   137.0 │   155.5 │  6.9s │    998 │
│ openai/gpt-4.1-nano                        │    86.0 │    58.2 │    97.8 │  6.0s │    500 │
│ openai/gpt-4.1-mini                        │    57.6 │    49.9 │    66.0 │  8.8s │    500 │
│ openai/gpt-4.1                             │    39.7 │    25.3 │    55.1 │ 13.8s │    500 │
└────────────────────────────────────────────┴─────────┴─────────┴─────────┴───────┴────────┘

Is there any chance this will improve in the near future?

_j · July 4, 2025, 8:22pm

pietz:

┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┓
┃ Model        ┃ Avg t/s ┃ Min t/s ┃ Max t/s ┃  Time ┃ Tokens ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━┩
│ gpt-4.1-mini │    61.7 │    51.5 │    68.4 │  8.2s │    500 │
│ gpt-4.1-nano │    95.9 │    75.7 │   106.6 │  5.3s │    500 │
...

At launch day, this is the speed obtained for the gpt-4.1 sub-models, 100 trials on chat completions to 256 tokens:

So even with your larger token count giving the startup latency less score impact, the models and the servers running them have slowed from the launch capability.

This isn’t “we dedicated more total computation per token for higher quality” … it never seems to go that direction. It is fully-loaded infrastructure, and you not paying hundreds of thousands for “scale tier” dedicated instances. Plus I suspect why the reason that mini can fall behind even full gpt-4.1 is that the hardware running it can be older with less footprint.

Metric	Average	Minimum	Maximum
mini total TPS	94.1	26.8	128.6
nano total TPS	153.1	26.1	242.7

Topic		Replies	Views
GPT-4o mini slow inference API gpt-4o , gpt-4o-mini	6	723	April 9, 2025
Gpt-4o tokens per second comparable to gpt-3.5-turbo. Data and analysis API gpt-4 , gpt-35-turbo , playground , gpt-4-turbo , gpt-4o	3	14200	August 16, 2024
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	13	9829	December 29, 2025
GPT-4o-mini randomly much slower than GPT-3.5-turbo Bugs gpt-4o-mini	8	1233	November 20, 2024
Gpt5-xxx (any version!) at least 10x slower than gpt4-...?! API chatgpt	1	13794	October 5, 2025

Inference speed of different models

Related topics