If we’re benchmarking tokens per second we might as well do the same for humans
So i did some digging and found the study “Oral Reading Fluency of College Graduates: Toward a Deeper Understanding of College Ready Fluency” it suggests that college graduates can read somewhere between 138 and 287 words per minute.
When converted to words per second, this is approximately 2.3 to 4.78 words per second. Assuming that one token is equivalent to 0.75 words, the reading rate in tokens per second would be approximately 1.73 to 3.59 tokens per second.
But the impressive thing is that this isn’t reading rate, it’s writing rate! The reading rate for GPT-4 appears to be nearly instant, I’m guessing 5000 tokens per second or so. So it literally crushes humans.
But in responding, or writing, it beats most humans hands down. I’m guessing my response here took me maybe 1 or 2 tokens per second at best!
The problem is one of latency – there may be post-processing (for example, moderation!) that needs to happen before the text can be presented to the user, and that post-processing might not even be able to start until the full generation is complete.
I’m not saying faster isn’t better, it definitely is, I’m just saying that comparing GPT’s output tokens per second against human reading speed may be relevant when determining when to add more server capacity.
Mixing output rate and latency is not helpful in this case, the output rate is relevant when benchmarking the model, the latency is relevant if we’re benchmarking the networking.
In my primary use, humans are in the loop and read/edit the AI response. The AI does 90% of the drudgery work. Here there is no stringent latency requirement, since it could take several minutes for the human to get to the response. So to @N2U’s point, the output rate being faster than a human can read applies here.
However, with AI agents, and machine-to-machine interactions, I can really feel the latency kick in. So far, nothing “mission critical” being done here, but I can see that if you have an app and an impatient user on the other end, this could be disastrous for your app. But I’m curious how many people here are in this situation, and what is your use case? And is it “mission critical”, or more of a hassle.
In your situation with an end user waiting, I would stream the answer back to them. The first token latency is a few seconds, just like it is in the Playground.
Above we were mostly taking last token latency and/or token output rates per second, and not first token streaming latency, which is probably the most interesting performance parameter for your application.
I already said that I must post process the data before I can show it.
(One common case of this is moderation solutions.)
So, no, I can in fact not stream the data, as I already said, and token generation rate directly impacts the latency to result for the user, as I already said, so generation at significantly faster than reading speed is both useful and important, as I already said.
Another case where generation faster than human reading speed is also important, is when generating steps to execute, such as in plugins.
So, I stand by the assessment that GPT-4 generation speed currently is a significant impediment.
So I could interpret this to mean, in your case, stream to your users, then send the full result to moderation (after it’s done), and if it comes back flagged, the put the user on cooldown, or something similar, and see what happened.
I think you are taking something too literally, but feel free to share where you have to moderate everything before it goes out. If this were the case, the streaming feature is in violation of that policy. This is a contradiction.
But if you are referring to some other moderation or post processing thing, then yeah, as a batch, you aren’t getting past last token latency. But no AI model will have great responsive last token latency, at least for the next 6 months, until sub-quadratic models, that work and are good (and running on smoking hot FPGA’s or ASIC’s), come out.
I understand the argument, end user latency is a combination of multiple things, I’m sure people are aware of this. Benchmarking one thing separately still provides us with relevant information, that can be used by other people.
@jwatte I’m sorry if my information about human reading speed it’s relevant to you, it may be for other people, if you’re having issues with latency I’ll advise that you create a separate topic about this, as it will aid other people in helping you. This thread is for the (very-small) early rollout of GPT-4-32K
I’m expecting most people will be using the 32k while streaming to the end user, running an agent on GPT-4-32k is just too expensive