Gpt-4o tokens per second comparable to gpt-3.5-turbo. Data and analysis

duncan.haywood · May 21, 2024, 6:04am

Performance Comparison of GPT Models: An Informal Analysis

TLDR: Here’s the graph.

Introduction

In this analysis, I compare the performance of three different GPT models: gpt-35-turbo-0125, gpt-4o-2024-05-13, and gpt-4-turbo-2024-04-09. My focus is on understanding the tokens per second each model can produce, which serves as a metric for their efficiency and speed. By examining the descriptive statistics and visualizing the data, I aim to determine which model is the fastest and whether the gpt-4o-2024-05-13 offers significant improvements over its predecessor, gpt-4-turbo-2024-04-09.

Findings

The data collected includes latency in milliseconds and tokens generated for various essay prompts. From this data, I calculated the tokens per second for each model. Here are the descriptive statistics for each model:

Comparative Statistics Table

Statistic	gpt-35-turbo-0125	gpt-4o-2024-05-13	gpt-4-turbo-2024-04-09
Count	8	8	8
Mean (tokens/sec)	67.83	63.32	35.68
Standard Deviation	11.61	14.49	3.31
Minimum	42.87	35.67	31.69
25th Percentile	64.16	56.87	33.25
Median	71.73	65.54	35.13
75th Percentile	75.90	72.45	37.44
Maximum	77.05	79.87	40.94

Analysis

Performance

The gpt-35-turbo-0125 model has the highest mean tokens per second (67.83), indicating it is the fastest among the three models tested. This is followed by gpt-4o-2024-05-13 with a mean of 63.32 tokens per second. The gpt-4-turbo-2024-04-09 model lags behind with a significantly lower mean of 35.68 tokens per second.

Consistency

The standard deviation of tokens per second helps understand the variability in performance. The gpt-4-turbo-2024-04-09 model has the lowest standard deviation (3.31), suggesting consistent performance but at a slower rate. The gpt-35-turbo-0125 model has a moderate standard deviation (11.61), indicating relatively consistent performance with high speed. The gpt-4o-2024-05-13 model, while faster than its predecessor, has the highest standard deviation (14.49), indicating more variability in performance.

Effective Speed Comparison

While gpt-4o-2024-05-13 is not as fast as gpt-35-turbo-0125, it demonstrates a significant improvement over gpt-4-turbo-2024-04-09. The mean tokens per second of gpt-4o-2024-05-13 (63.32) is almost double that of gpt-4-turbo-2024-04-09 (35.68), validating that gpt-4o-2024-05-13 is effectively faster and more efficient compared to its predecessor. Most importantly, gpt-4o-2024-05-13 offers a performance level close to that of gpt-35-turbo-0125, despite having a higher intelligence level. This makes gpt-4o-2024-05-13 a significant upgrade, effectively matching the speed of gpt-35-turbo-0125 while providing advanced capabilities.

Conclusion

My analysis shows that gpt-35-turbo-0125 is the fastest model in terms of tokens per second, making it the most efficient for generating large volumes of text quickly. However, gpt-4o-2024-05-13 is a significant improvement over gpt-4-turbo-2024-04-09, offering nearly the same level of performance as gpt-35-turbo-0125 with more variability. This suggests that gpt-4o-2024-05-13 is a valuable upgrade, effectively bridging the gap between the older and newer models by matching the speed of gpt-35-turbo-0125 while providing enhanced intelligence.

Appendix: Original Data

Prompt Description	Model	Latency (ms)	Tokens	Tokens per Second
Write a 20-paragraph essay on cars	gpt-35-turbo-0125	21015	901	42.87
	gpt-4o-2024-05-13	20919	1476	70.56
Write a 20-paragraph essay on birds	gpt-35-turbo-0125	15473	1004	64.89
	gpt-4o-2024-05-13	43814	1563	35.67
Write a 20-paragraph essay on birds	gpt-35-turbo-0125	18126	1123	61.96
	gpt-4o-2024-05-13	28712	1668	58.09
Write a 20-paragraph essay on dogs	gpt-35-turbo-0125	14535	1120	77.05
	gpt-4o-2024-05-13	21612	1688	78.10
Write a 20-paragraph essay on cats	gpt-35-turbo-0125	15355	1161	75.61
	gpt-4o-2024-05-13	20971	1444	68.86
Write a 2000-word essay on planes	gpt-35-turbo-0125	14744	1132	76.78
	gpt-4o-2024-05-13	32842	1748	53.22
Write a 2000-word essay on trucks	gpt-35-turbo-0125	24725	1692	68.43
	gpt-4o-2024-05-13	16565	1323	79.87
Write a 2000-word essay on roads	gpt-35-turbo-0125	11863	890	75.02
	gpt-4o-2024-05-13	28821	1793	62.21
Write a 2000-word essay on roads	gpt-4-turbo-2024-04-09	26558	890	33.54
	gpt-4-turbo-2024-04-09	23486	925	39.39
Write a 20-paragraph essay on cars	gpt-4-turbo-2024-04-09	34428	1138	33.05
	gpt-4-turbo-2024-04-09	35809	1193	33.41
Write a 20-paragraph essay on dogs	gpt-4-turbo-2024-04-09	33506	1231	36.73
	gpt-4-turbo-2024-04-09	31524	999	31.69
Write a 20-paragraph essay on cats	gpt-4-turbo-2024-04-09	24770	1014	40.94
	gpt-4-turbo-2024-04-09	30743	1131	36.78

_j · May 21, 2024, 6:24am

Nice work.

Latency should be the network latency and context setup time - the time to the first token out. Not OpenAI’s printout at the bottom of playground. Otherwise it is less meaningful, especially in view of producing different token lengths. Such a measure also gives a user satisfaction index, how long they sit without seeing anything.

Latency is a time delay. Like the time it takes for a signal to bounce off a satellite and be returned, or the time it takes for your cell phone to encode 4g packets.

The time the total job takes is also valuable, but that is obscured here also by varying token counts, when you can use max_tokens and save yourself some money also.

I have a few times I’ve shown benchmarks - also asking for articles about kittens - and I deliniate first-token latency from streaming rate of the rest.

Time of day also matters. Check out this site where jobs are continuously run:

One thing I thought to investigate just for curiosity is how “bursty” models are. Investigate the statistics of 15 tokens released at a time by the content filter or such.

LinqLover · July 5, 2024, 6:48pm

These are great numbers. I wonder how much these results deviate when testing at different regions and times of the day. When googling, I found comparisons of GPT-4o to GPT-4 Turbo between 1.5x and 5x.

I also wonder about the number of input tokens processed per second. I won’t be the only person for whom this is the bottleneck. I have not found any numbers about that.

oleksandr_gamaniuk · August 16, 2024, 10:40pm

Do you have data for 4o mini? How would it also compare to 7-8bn open source models hosted on Replicate?

Topic		Replies	Views
GPT-3.5 and GPT-4 API response time measurements - FYI API	19	41614	February 6, 2024
Gpt-4-0125-preview is slower than gpt-4-0613? Feedback gpt-4 , api	5	5633	January 30, 2024
Benchmarking response time for GPT4 by context+output tokens API gpt-4 , api-speed	6	7131	November 3, 2023
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9701	July 22, 2024
GPT-4 is faster again these days API gpt-4	3	1237	December 15, 2023