I can give you a benchmark.
This particular one is with 1700+ tokens of input, a prior chat turn and long output before the question being posed. When repeated, this should invoke the context caching feature of latest gpt-4o, less AI computation, less expense, but perhaps not a guarantee of better performance. The client’s max_retries = 0, and any errors would be dropped from the report. The calls are run synchronously, alternating between models.
I already see by the progress of streaming chunk indicators, there is slowness and pauses in the output of gpt-4o-2024-08-06. Then the results:
For 5 trials of gpt-4o-2024-08-06 @ 2024-10-15 12:17AM:
Stat | Average | Cold | Minimum | Maximum |
---|---|---|---|---|
stream rate | Avg: 44.960 | Cold: 36.3 | Min: 36.3 | Max: 48.8 |
latency (s) | Avg: 0.811 | Cold: 1.5379 | Min: 0.459 | Max: 1.5379 |
total response (s) | Avg: 12.312 | Cold: 15.6141 | Min: 11.0963 | Max: 15.6141 |
total rate | Avg: 42.244 | Cold: 32.791 | Min: 32.791 | Max: 46.142 |
response tokens | Avg: 512.000 | Cold: 512 | Min: 512 | Max: 512 |
For 5 trials of gpt-4o-2024-05-13 @ 2024-10-15 12:17AM:
Stat | Average | Cold | Minimum | Maximum |
---|---|---|---|---|
stream rate | Avg: 92.340 | Cold: 98.1 | Min: 68.1 | Max: 102.3 |
latency (s) | Avg: 0.467 | Cold: 0.5039 | Min: 0.4 | Max: 0.57 |
total response (s) | Avg: 6.128 | Cold: 5.7147 | Min: 5.4467 | Max: 7.9079 |
total rate | Avg: 85.083 | Cold: 89.594 | Min: 64.745 | Max: 94.002 |
response tokens | Avg: 512.000 | Cold: 512 | Min: 512 | Max: 512 |
“cold” is the first call made to the model’s stats.
gpt-4o-2024-05-13 at its slowest is 40.3% faster than gpt-4o-2024-08-06 at its fastest.
One caveat is that the caching of 2024-08-06 also means that repeated calls may be pinned to one server instance, here returning all the same fingerprint, which means there may not be a large sampling of all possible speeds across different types and loads of servers in operation.