GPT-4o-2024–08–06 slower then previous version

I can give you a benchmark.

This particular one is with 1700+ tokens of input, a prior chat turn and long output before the question being posed. When repeated, this should invoke the context caching feature of latest gpt-4o, less AI computation, less expense, but perhaps not a guarantee of better performance. The client’s max_retries = 0, and any errors would be dropped from the report. The calls are run synchronously, alternating between models.

I already see by the progress of streaming chunk indicators, there is slowness and pauses in the output of gpt-4o-2024-08-06. Then the results:

For 5 trials of gpt-4o-2024-08-06 @ 2024-10-15 12:17AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 44.960 Cold: 36.3 Min: 36.3 Max: 48.8
latency (s) Avg: 0.811 Cold: 1.5379 Min: 0.459 Max: 1.5379
total response (s) Avg: 12.312 Cold: 15.6141 Min: 11.0963 Max: 15.6141
total rate Avg: 42.244 Cold: 32.791 Min: 32.791 Max: 46.142
response tokens Avg: 512.000 Cold: 512 Min: 512 Max: 512

For 5 trials of gpt-4o-2024-05-13 @ 2024-10-15 12:17AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 92.340 Cold: 98.1 Min: 68.1 Max: 102.3
latency (s) Avg: 0.467 Cold: 0.5039 Min: 0.4 Max: 0.57
total response (s) Avg: 6.128 Cold: 5.7147 Min: 5.4467 Max: 7.9079
total rate Avg: 85.083 Cold: 89.594 Min: 64.745 Max: 94.002
response tokens Avg: 512.000 Cold: 512 Min: 512 Max: 512

“cold” is the first call made to the model’s stats.

gpt-4o-2024-05-13 at its slowest is 40.3% faster than gpt-4o-2024-08-06 at its fastest.

One caveat is that the caching of 2024-08-06 also means that repeated calls may be pinned to one server instance, here returning all the same fingerprint, which means there may not be a large sampling of all possible speeds across different types and loads of servers in operation.

3 Likes