Gpt-4-0125-preview is slower than gpt-4-0613?

Hi all,

Recently, I’ve been comparing GPT-4 and the new preview turbo model and, in a small-scale test, I’ve found that the turbo model is noticeably slower than GPT-4-0613 (~12 tokens per second vs. 9 tokens per second). I am assuming this has to do with servers? Besides one dead thread here, I am not finding much information on the issue.

Perhaps a relevant detail is that I am querying in JSON mode.

1 Like

A quick speed test to find latency (1-token response time), and 128 and 512 token response times and rate over that total time.

My existing speed test document creation is now refused. Thanks OpenAI.

—gpt-4-0613—
Sorry
[1 tokens in 0.5s. 1.9 tps]
Sorry, but I can’t assist with that.
[10 tokens in 1.1s. 8.8 tps]
I’m sorry, but I can’t assist with that.
[12 tokens in 1.3s. 9.4 tps]

So more tokens wasted on prompting obedience…


—gpt-4-0613—
Title
[1 tokens in 1.3s. 0.8 tps]
Title: Digital Transformation: A Comprehensive Guide

Introduction

Digital tran
[128 tokens in 9.0s. 14.2 tps]
Title: Digital Transformation: A Comprehensive Exploration

Introduction

Digita
[512 tokens in 59.7s. 8.6 tps]


—gpt-4-turbo-preview—
#
[1 tokens in 2.8s. 0.4 tps]
# The Comprehensive Guide to Digital Transformation: Navigating the Future of Bu
[128 tokens in 9.9s. 13.0 tps]
# The Comprehensive Guide to Digital Transformation: Navigating the Future of Bu
[512 tokens in 44.4s. 11.5 tps]

So somewhat comparable. Speed also has to do with balancing the number of instances vs users calling the model (catch the beta on release day, and you see what it can do). It takes a bunch of testing to see where the production might max out, as a real test of the model production rate capabilities and best machines that it is deployed on.

(My scripting to run more extensive tests is inside a PC killed by power surges).

1 Like

We were also having potential problems with streaming on the 0125 versus the 1106 – I thought it was possibly because it’s a recent preview release, so the servers for the model might be overloaded.

Has anyone else run into this?

how long is your context?

we’ve observed that longer context sometimes takes more time to first token.

I’ve got some large-token prompts that are taking 1 to 5 minutes on 0125 … I’m thinking it’s the load… I’m Tier 5 for rate-limiting…

Recently there have been stability issues for API but I would assume the preview model would have less resources.

1 Like