You can use these values to approximate the response time. E.g. for a request to Azure gpt-3.5-turbo with 600 output tokens, the latency will be roughly 34ms x 600 = 20.4 seconds.
I’ll spare the full details of the experiments; you can see these in my blog post about GPT response times.
Correct. As far as I understand the models should be identical. (I am not sure about the safety layer or other things on top which may be different though).
I’m not sure what you are asking here, if the input tokens say “please say the word “hello”” then the response back will likely be a single token that points to the word “hello”
The output is not “liinked” to the input just by the number of tokens used, it depends on the contents of those tokens.
I ran just once - the per-token number is very robust because the output contains 1,000+ tokens. I’ve run many repetitions as well but the results don’t change.
IIRC about 10 input tokens, but latency does not depend on input token count - see the last paragraph in this blog post.
In my application, for each query, I need to feed around 3000 - 5000 tokens to GPT-4 Turbo, and the time I got the result takes around 9.1s for around 100 output tokens.
Actually, I read your blog post. So, I think my latency mostly comes from “const.” component in your formula because the number of input tokens is large in my case.