Gpt-4o tokens per second comparable to gpt-3.5-turbo. Data and analysis

Performance Comparison of GPT Models: An Informal Analysis

TLDR: Here’s the graph.

Introduction

In this analysis, I compare the performance of three different GPT models: gpt-35-turbo-0125, gpt-4o-2024-05-13, and gpt-4-turbo-2024-04-09. My focus is on understanding the tokens per second each model can produce, which serves as a metric for their efficiency and speed. By examining the descriptive statistics and visualizing the data, I aim to determine which model is the fastest and whether the gpt-4o-2024-05-13 offers significant improvements over its predecessor, gpt-4-turbo-2024-04-09.

Findings

The data collected includes latency in milliseconds and tokens generated for various essay prompts. From this data, I calculated the tokens per second for each model. Here are the descriptive statistics for each model:

Comparative Statistics Table

Statistic gpt-35-turbo-0125 gpt-4o-2024-05-13 gpt-4-turbo-2024-04-09
Count 8 8 8
Mean (tokens/sec) 67.83 63.32 35.68
Standard Deviation 11.61 14.49 3.31
Minimum 42.87 35.67 31.69
25th Percentile 64.16 56.87 33.25
Median 71.73 65.54 35.13
75th Percentile 75.90 72.45 37.44
Maximum 77.05 79.87 40.94

Analysis

Performance

The gpt-35-turbo-0125 model has the highest mean tokens per second (67.83), indicating it is the fastest among the three models tested. This is followed by gpt-4o-2024-05-13 with a mean of 63.32 tokens per second. The gpt-4-turbo-2024-04-09 model lags behind with a significantly lower mean of 35.68 tokens per second.

Consistency

The standard deviation of tokens per second helps understand the variability in performance. The gpt-4-turbo-2024-04-09 model has the lowest standard deviation (3.31), suggesting consistent performance but at a slower rate. The gpt-35-turbo-0125 model has a moderate standard deviation (11.61), indicating relatively consistent performance with high speed. The gpt-4o-2024-05-13 model, while faster than its predecessor, has the highest standard deviation (14.49), indicating more variability in performance.

Effective Speed Comparison

While gpt-4o-2024-05-13 is not as fast as gpt-35-turbo-0125, it demonstrates a significant improvement over gpt-4-turbo-2024-04-09. The mean tokens per second of gpt-4o-2024-05-13 (63.32) is almost double that of gpt-4-turbo-2024-04-09 (35.68), validating that gpt-4o-2024-05-13 is effectively faster and more efficient compared to its predecessor. Most importantly, gpt-4o-2024-05-13 offers a performance level close to that of gpt-35-turbo-0125, despite having a higher intelligence level. This makes gpt-4o-2024-05-13 a significant upgrade, effectively matching the speed of gpt-35-turbo-0125 while providing advanced capabilities.

Conclusion

My analysis shows that gpt-35-turbo-0125 is the fastest model in terms of tokens per second, making it the most efficient for generating large volumes of text quickly. However, gpt-4o-2024-05-13 is a significant improvement over gpt-4-turbo-2024-04-09, offering nearly the same level of performance as gpt-35-turbo-0125 with more variability. This suggests that gpt-4o-2024-05-13 is a valuable upgrade, effectively bridging the gap between the older and newer models by matching the speed of gpt-35-turbo-0125 while providing enhanced intelligence.

Appendix: Original Data

Prompt Description Model Latency (ms) Tokens Tokens per Second
Write a 20-paragraph essay on cars gpt-35-turbo-0125 21015 901 42.87
gpt-4o-2024-05-13 20919 1476 70.56
Write a 20-paragraph essay on birds gpt-35-turbo-0125 15473 1004 64.89
gpt-4o-2024-05-13 43814 1563 35.67
Write a 20-paragraph essay on birds gpt-35-turbo-0125 18126 1123 61.96
gpt-4o-2024-05-13 28712 1668 58.09
Write a 20-paragraph essay on dogs gpt-35-turbo-0125 14535 1120 77.05
gpt-4o-2024-05-13 21612 1688 78.10
Write a 20-paragraph essay on cats gpt-35-turbo-0125 15355 1161 75.61
gpt-4o-2024-05-13 20971 1444 68.86
Write a 2000-word essay on planes gpt-35-turbo-0125 14744 1132 76.78
gpt-4o-2024-05-13 32842 1748 53.22
Write a 2000-word essay on trucks gpt-35-turbo-0125 24725 1692 68.43
gpt-4o-2024-05-13 16565 1323 79.87
Write a 2000-word essay on roads gpt-35-turbo-0125 11863 890 75.02
gpt-4o-2024-05-13 28821 1793 62.21
Write a 2000-word essay on roads gpt-4-turbo-2024-04-09 26558 890 33.54
gpt-4-turbo-2024-04-09 23486 925 39.39
Write a 20-paragraph essay on cars gpt-4-turbo-2024-04-09 34428 1138 33.05
gpt-4-turbo-2024-04-09 35809 1193 33.41
Write a 20-paragraph essay on dogs gpt-4-turbo-2024-04-09 33506 1231 36.73
gpt-4-turbo-2024-04-09 31524 999 31.69
Write a 20-paragraph essay on cats gpt-4-turbo-2024-04-09 24770 1014 40.94
gpt-4-turbo-2024-04-09 30743 1131 36.78
4 Likes

Nice work.

Latency should be the network latency and context setup time - the time to the first token out. Not OpenAI’s printout at the bottom of playground. Otherwise it is less meaningful, especially in view of producing different token lengths. Such a measure also gives a user satisfaction index, how long they sit without seeing anything.

Latency is a time delay. Like the time it takes for a signal to bounce off a satellite and be returned, or the time it takes for your cell phone to encode 4g packets.

The time the total job takes is also valuable, but that is obscured here also by varying token counts, when you can use max_tokens and save yourself some money also.

I have a few times I’ve shown benchmarks - also asking for articles about kittens - and I deliniate first-token latency from streaming rate of the rest.

Time of day also matters. Check out this site where jobs are continuously run:

One thing I thought to investigate just for curiosity is how “bursty” models are. Investigate the statistics of 15 tokens released at a time by the content filter or such.

2 Likes

These are great numbers. I wonder how much these results deviate when testing at different regions and times of the day. When googling, I found comparisons of GPT-4o to GPT-4 Turbo between 1.5x and 5x.

I also wonder about the number of input tokens processed per second. I won’t be the only person for whom this is the bottleneck. I have not found any numbers about that.

Do you have data for 4o mini? How would it also compare to 7-8bn open source models hosted on Replicate?