Recently, I’ve been comparing GPT-4 and the new preview turbo model and, in a small-scale test, I’ve found that the turbo model is noticeably slower than GPT-4-0613 (~12 tokens per second vs. 9 tokens per second). I am assuming this has to do with servers? Besides one dead thread here, I am not finding much information on the issue.
Perhaps a relevant detail is that I am querying in JSON mode.
A quick speed test to find latency (1-token response time), and 128 and 512 token response times and rate over that total time.
My existing speed test document creation is now refused. Thanks OpenAI.
—gpt-4-0613—
Sorry
[1 tokens in 0.5s. 1.9 tps]
Sorry, but I can’t assist with that.
[10 tokens in 1.1s. 8.8 tps]
I’m sorry, but I can’t assist with that.
[12 tokens in 1.3s. 9.4 tps]
So more tokens wasted on prompting obedience…
—gpt-4-0613—
Title [1 tokens in 1.3s. 0.8 tps]
Title: Digital Transformation: A Comprehensive Guide
Introduction
Digital tran [128 tokens in 9.0s. 14.2 tps]
Title: Digital Transformation: A Comprehensive Exploration
Introduction
Digita [512 tokens in 59.7s. 8.6 tps]
—gpt-4-turbo-preview—
# [1 tokens in 2.8s. 0.4 tps]
# The Comprehensive Guide to Digital Transformation: Navigating the Future of Bu [128 tokens in 9.9s. 13.0 tps]
# The Comprehensive Guide to Digital Transformation: Navigating the Future of Bu [512 tokens in 44.4s. 11.5 tps]
So somewhat comparable. Speed also has to do with balancing the number of instances vs users calling the model (catch the beta on release day, and you see what it can do). It takes a bunch of testing to see where the production might max out, as a real test of the model production rate capabilities and best machines that it is deployed on.
(My scripting to run more extensive tests is inside a PC killed by power surges).
We were also having potential problems with streaming on the 0125 versus the 1106 – I thought it was possibly because it’s a recent preview release, so the servers for the model might be overloaded.