Is there any benchmarking of the time it takes for GPT-4 to respond (on average) given the input and output # tokens?
For example, it would have been awesome to see a matrix of response time given pair of (#input tokens, #output tokens).
In general, it seems response time is governed by #of output tokens, but I am interested to see if this also scales linearly with #of input tokens for a given output size
It has been previously observed that the response time increases with the max_tokens param. However, it’s hard to benchmark because it also depends on the number of requests hitting the API, meaning that the same request will take lesser time for on a day with relatively lower traffic and vice-versa.
Also, given how the number of customers is always increasing and that OpenAI keeps upgrading their infra, benchmarks wouldn’t mean much for development except API performance over time.
Tokens are tokens. It doesn’t matter if they are “input” or if they are something the AI just wrote. Having more of them in the context causes more computation requirement for the next token that must rely on all previous token weights and the masked attention layers that scan them to unravel cataphors and relevance. So by token 16383, you are getting up there in computation cost.
It would be pretty easy to benchmark, increase the knowledge augmentation in steps of 500 tokens and see how long it takes to follow up with 200 more, up to the maximum context of the model.
However, empirical data about something you are powerless to affect will only crystallize your frustration.
I was mostly interested if I can save response time if I reduce the amount of context in the user/system message.
A quick benchmarking I did now showed that’s not really the case; I told it to say the star spangled banner, and gave it each time a user/system message of 10/1000/4000/7000 tokens. There was no observable difference in the response time.