Benchmarking response time for GPT4 by context+output tokens

dan.raviv · November 3, 2023, 4:35pm

Is there any benchmarking of the time it takes for GPT-4 to respond (on average) given the input and output # tokens?

For example, it would have been awesome to see a matrix of response time given pair of (#input tokens, #output tokens).

In general, it seems response time is governed by #of output tokens, but I am interested to see if this also scales linearly with #of input tokens for a given output size

sps · November 3, 2023, 5:12pm

It has been previously observed that the response time increases with the max_tokens param. However, it’s hard to benchmark because it also depends on the number of requests hitting the API, meaning that the same request will take lesser time for on a day with relatively lower traffic and vice-versa.

Also, given how the number of customers is always increasing and that OpenAI keeps upgrading their infra, benchmarks wouldn’t mean much for development except API performance over time.

_j · November 3, 2023, 5:29pm

Tokens are tokens. It doesn’t matter if they are “input” or if they are something the AI just wrote. Having more of them in the context causes more computation requirement for the next token that must rely on all previous token weights and the masked attention layers that scan them to unravel cataphors and relevance. So by token 16383, you are getting up there in computation cost.

It would be pretty easy to benchmark, increase the knowledge augmentation in steps of 500 tokens and see how long it takes to follow up with 200 more, up to the maximum context of the model.

However, empirical data about something you are powerless to affect will only crystallize your frustration.

dan.raviv · November 3, 2023, 5:38pm

I was mostly interested if I can save response time if I reduce the amount of context in the user/system message.

A quick benchmarking I did now showed that’s not really the case; I told it to say the star spangled banner, and gave it each time a user/system message of 10/1000/4000/7000 tokens. There was no observable difference in the response time.

On the other hand, when I told it to create the star spangled banner X times, response time was almost exactly linear with X.

I also played around with max_tokens parameters for the same user/system message - didn’t see any observable difference in response time.

sps · November 3, 2023, 6:26pm

Were you streaming the tokens? @dan.raviv

curt.kennedy · November 3, 2023, 6:33pm

Output tokens is the dominant driver in overall response latency. For metrics, I really only look at generated output tokens per second.

This is largely invariant of how many tokens are in the input. The model does all its “work” on generating output.

dan.raviv · November 3, 2023, 7:11pm

No, I wasn’t streaming them, I was using the vanilla:

    result = openai.ChatCompletion.create(
        model="gpt-4", messages=messages, temperature=0, max_tokens=1000
    )

Topic		Replies	Views
GPT-3.5 and GPT-4 API response time measurements - FYI API	19	38117	February 6, 2024
Gpt-4o tokens per second comparable to gpt-3.5-turbo. Data and analysis API gpt-4 , gpt-35-turbo , playground , gpt-4-turbo , gpt-4o	3	12964	August 16, 2024
Does response/generation time of gpt 4 depends on size of input prompt? Community gpt-4	2	2674	May 30, 2023
Gpt-4-0125-preview is slower than gpt-4-0613? Feedback gpt-4 , api	5	5580	January 30, 2024
What's the best way to benchmark tokens/sec of fine-tuned model? API fine-tuning	2	3879	September 21, 2023

Benchmarking response time for GPT4 by context+output tokens

Related topics