Is there any benchmarking of the time it takes for GPT-4 to respond (on average) given the input and output # tokens?
For example, it would have been awesome to see a matrix of response time given pair of (#input tokens, #output tokens).
In general, it seems response time is governed by #of output tokens, but I am interested to see if this also scales linearly with #of input tokens for a given output size
It has been previously observed that the response time increases with the
max_tokens param. However, it’s hard to benchmark because it also depends on the number of requests hitting the API, meaning that the same request will take lesser time for on a day with relatively lower traffic and vice-versa.
Also, given how the number of customers is always increasing and that OpenAI keeps upgrading their infra, benchmarks wouldn’t mean much for development except API performance over time.
Tokens are tokens. It doesn’t matter if they are “input” or if they are something the AI just wrote. Having more of them in the context causes more computation requirement for the next token that must rely on all previous token weights and the masked attention layers that scan them to unravel cataphors and relevance. So by token 16383, you are getting up there in computation cost.
It would be pretty easy to benchmark, increase the knowledge augmentation in steps of 500 tokens and see how long it takes to follow up with 200 more, up to the maximum context of the model.
However, empirical data about something you are powerless to affect will only crystallize your frustration.
I was mostly interested if I can save response time if I reduce the amount of context in the user/system message.
A quick benchmarking I did now showed that’s not really the case; I told it to say the star spangled banner, and gave it each time a user/system message of 10/1000/4000/7000 tokens. There was no observable difference in the response time.
On the other hand, when I told it to create the star spangled banner X times, response time was almost exactly linear with X.
I also played around with max_tokens parameters for the same user/system message - didn’t see any observable difference in response time.
Were you streaming the tokens? @dan.raviv
Output tokens is the dominant driver in overall response latency. For metrics, I really only look at generated output tokens per second.
This is largely invariant of how many tokens are in the input. The model does all its “work” on generating output.
No, I wasn’t streaming them, I was using the vanilla:
result = openai.ChatCompletion.create(
model="gpt-4", messages=messages, temperature=0, max_tokens=1000