hello,
We noticed the new version of GPT-4o: GPT-4o-2024–08–06 is significantly slower (50-80%) then the previous GPT-4o-2024–05–13 , is this expected ?
any benchmark on latency to compare between version of the same model ?
I can give you a benchmark.
This particular one is with 1700+ tokens of input, a prior chat turn and long output before the question being posed. When repeated, this should invoke the context caching feature of latest gpt-4o, less AI computation, less expense, but perhaps not a guarantee of better performance. The client’s max_retries = 0, and any errors would be dropped from the report. The calls are run synchronously, alternating between models.
I already see by the progress of streaming chunk indicators, there is slowness and pauses in the output of gpt-4o-2024-08-06. Then the results:
For 5 trials of gpt-4o-2024-08-06 @ 2024-10-15 12:17AM:
Stat | Average | Cold | Minimum | Maximum |
---|---|---|---|---|
stream rate | Avg: 44.960 | Cold: 36.3 | Min: 36.3 | Max: 48.8 |
latency (s) | Avg: 0.811 | Cold: 1.5379 | Min: 0.459 | Max: 1.5379 |
total response (s) | Avg: 12.312 | Cold: 15.6141 | Min: 11.0963 | Max: 15.6141 |
total rate | Avg: 42.244 | Cold: 32.791 | Min: 32.791 | Max: 46.142 |
response tokens | Avg: 512.000 | Cold: 512 | Min: 512 | Max: 512 |
For 5 trials of gpt-4o-2024-05-13 @ 2024-10-15 12:17AM:
Stat | Average | Cold | Minimum | Maximum |
---|---|---|---|---|
stream rate | Avg: 92.340 | Cold: 98.1 | Min: 68.1 | Max: 102.3 |
latency (s) | Avg: 0.467 | Cold: 0.5039 | Min: 0.4 | Max: 0.57 |
total response (s) | Avg: 6.128 | Cold: 5.7147 | Min: 5.4467 | Max: 7.9079 |
total rate | Avg: 85.083 | Cold: 89.594 | Min: 64.745 | Max: 94.002 |
response tokens | Avg: 512.000 | Cold: 512 | Min: 512 | Max: 512 |
“cold” is the first call made to the model’s stats.
gpt-4o-2024-05-13 at its slowest is 40.3% faster than gpt-4o-2024-08-06 at its fastest.
One caveat is that the caching of 2024-08-06 also means that repeated calls may be pinned to one server instance, here returning all the same fingerprint, which means there may not be a large sampling of all possible speeds across different types and loads of servers in operation.
thank you for the benchmark, So it’s expected with the new version ?
Was wondering if there was an official announcement about the latency increase or expected with these kind of model refresh ?
Its not expected. Cheaper cost should relate to less computation and less computation means less time spent per token.
It is more likely just about the apportionment of compute units to the pool outside scale tier and keeping servers maxed out and running batches in idle time, while the older model previously pointed to for gpt-4o
might have less usage currently and be waiting for you, giving fast rates similar to launch day.
It also can be “we shrank this model so much, now it can run on five-year-old languishing GPU”…Layers on top of the new model screening its outputs as they are produced, or other architecture differences.
Basically - it is what it is, and hopefully OpenAI is being fair to all customers.
In case it is helpful to anyone else, here is a chart of the response times for all of our recent assistant requests. This spans lots of different assistants. We have a timeout of 60s, so you’ll see some that are being limited at that value. This chart spans about the last month.
Here’s a more useful chart, where you can see the actual dates. Note that most recent is on the left. We are seeing that durations are definitely longer since the beginning of the year.
What you have to look at currently is the minimum response time on the left in the current graph. Counter-intuitive.
Perhaps reverse the data so that the origin of the chart is the oldest and not the newest? Then you might not need 100 little dates as an axis legend.
Then a trend line. Then a different chart of the percentage of timeouts after they’ve been filtered from this.
OpenAI libraries will also retry silently upon failure. You’d need to shut this off with retries=0 to get truthful single-call metrics.
OpenAI did say due to lower use expected over the Christmas-New Years-Hannukah holiday, they were even giving out more Sora credits. Yesterday was the first “back at work” day for the white collar holiday business crowd that might shut down manufacturing and engineering for two weeks.
Sorry. But you got about 30 seconds of my time to generate that chart for you. Your suggestions are perfectly valid, but I’ll leave that to others. I’m too busy getting our solutions into customer hands!
I should mention that in our application, RAG is an important part so some of that delay is the time required to do search, but it’s not easy to disentangle the sources of delay.
One more point, I should mention that we’re using the Assistant “run” API along with file_search, and we are using gpt-4o-mini as the model.
x- Headers returned include a servicing time by the API. If using httpx (Python) there’s also a completion time to be obtained there.