Inference speed of different models

I’m a bit puzzled how OpenAI rates the inference speed of their models and how the API speed compares to something like Gemini. I ran a quick speed test and these were the results:

$ uvx tacho gpt-4o gpt-4o-mini o4-mini o3 gpt-4.1-nano gpt-4.1-mini gpt-4.1
┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┓
┃ Model        ┃ Avg t/s ┃ Min t/s ┃ Max t/s ┃  Time ┃ Tokens ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━┩
│ o4-mini      │   179.4 │   165.8 │   190.9 │  5.6s │   1000 │
│ o3           │   115.4 │    96.6 │   134.6 │  8.8s │   1000 │
│ gpt-4.1-nano │    95.9 │    75.7 │   106.6 │  5.3s │    500 │
│ gpt-4.1      │    67.6 │    56.9 │    80.7 │  7.5s │    500 │
│ gpt-4.1-mini │    61.7 │    51.5 │    68.4 │  8.2s │    500 │
│ gpt-4o-mini  │    59.0 │    46.1 │    69.9 │  8.7s │    500 │
│ gpt-4o       │    31.7 │    29.1 │    35.9 │ 15.9s │    500 │
└──────────────┴─────────┴─────────┴─────────┴───────┴────────┘

For example, gpt-4.1-nano which has the highest speed rating of all models (5 stars) is 20% slower than o3 which is rated as the slowest model with 1 star. Also, gpt-4.1 which is rated at the same speed as gpt-4o is actually twice as fast.

Compared to Gemini models, the OpenAI API is also quite slow:

$ uvx tacho gemini/gemini-2.5-flash gemini/gemini-2.5-pro gemini/gemini-2.5-flash-lite-preview-06-17 openai/gpt-4.1-mini openai/gpt-4.1 openai/gpt-4.1-nano
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┓
┃ Model                                      ┃ Avg t/s ┃ Min t/s ┃ Max t/s ┃  Time ┃ Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━┩
│ gemini/gemini-2.5-flash-lite-preview-06-17 │   291.0 │   258.1 │   326.7 │  1.7s │    500 │
│ gemini/gemini-2.5-flash                    │   281.4 │   271.5 │   287.3 │  3.5s │    998 │
│ gemini/gemini-2.5-pro                      │   145.7 │   137.0 │   155.5 │  6.9s │    998 │
│ openai/gpt-4.1-nano                        │    86.0 │    58.2 │    97.8 │  6.0s │    500 │
│ openai/gpt-4.1-mini                        │    57.6 │    49.9 │    66.0 │  8.8s │    500 │
│ openai/gpt-4.1                             │    39.7 │    25.3 │    55.1 │ 13.8s │    500 │
└────────────────────────────────────────────┴─────────┴─────────┴─────────┴───────┴────────┘

Is there any chance this will improve in the near future?

At launch day, this is the speed obtained for the gpt-4.1 sub-models, 100 trials on chat completions to 256 tokens:

So even with your larger token count giving the startup latency less score impact, the models and the servers running them have slowed from the launch capability.

This isn’t “we dedicated more total computation per token for higher quality” … it never seems to go that direction. It is fully-loaded infrastructure, and you not paying hundreds of thousands for “scale tier” dedicated instances. Plus I suspect why the reason that mini can fall behind even full gpt-4.1 is that the hardware running it can be older with less footprint.