reyemtm
71
First test of the API vs the chat openai com - this is pretty much not usable for anything due to the response times. My limits are 90,000 TPM 3,500 RPM
ex - pt 425, completion tokens 100, total 525. - 8.9 seconds.
ex - pt: 432, completion tokens: 100, total: 532 - 24.9s
It needs to be in the < 1s range to be usable for my use case - so pretty much an instant response. Is this possible?
FWIW, here is my GPT-4 token generation graph for the last 20 days.
I am at Tier 4.
About 10-11 days ago it started slowing down.
1 Like
As an aside, and I know this doesn’t apply to many people, but the 32k context model for GPT-4 doesn’t appear to be impacted. However, the standard deviation of the 32k model is so large, that it’s hard to gauge any real trend without smoothing the data.
I would imagine the OAI 32K model used for alpha testing is on it’s own node somewhere and is only affected by those with access, so small numbers relative to everything else.
Yeah 32k has always been a bit erratic. Here is a plot going back 160 days ago to present.
Something happened 125 days ago which improved the speed. Then there is some dip 75 days ago that has since been rectified.
No idea why the performance is so variable, but overall it’s faster than the 8k GPT-4 version.
_j
76
You need 100 tokens of text returned to you in under a second?
The only way you’d be able to reliably do that is on your own A100/H100 cards and a tiny model.
reyemtm
77
What would an expected response time for 100 tokens be?
The 100 tokens would take about 5 seconds. See the rough 20 tokens per second in my graph above. But it could take as long as 20 seconds just due to model performance variations (5 tokens per second).
For instant response going out, maybe see if streaming is a fit for you.
_j
79
We can measure 100 tokens. An average of 2.3 seconds is a good score.
For 3 trials of gpt-3.5-turbo-0613
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.375 |
Max: 0.782 |
Avg: 0.534 |
| total response (s) |
Min: 2.0236 |
Max: 2.482 |
Avg: 2.226 |
| total rate |
Min: 40.29 |
Max: 49.417 |
Avg: 45.198 |
| stream rate |
Min: 57.7 |
Max: 60.4 |
Avg: 58.550 |
| response tokens |
Min: 100 |
Max: 100 |
Avg: 100.000 |
For 3 trials of ft:gpt-3.5-turbo-0613:xxxx::yyyyy
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.376 |
Max: 0.416 |
Avg: 0.396 |
| total response (s) |
Min: 1.531 |
Max: 1.667 |
Avg: 1.597 |
| total rate |
Min: 59.988 |
Max: 65.317 |
Avg: 62.680 |
| stream rate |
Min: 77.8 |
Max: 85.7 |
Avg: 82.500 |
| response tokens |
Min: 100 |
Max: 100 |
Avg: 100.000 |
You can see a fine-tune model is fastest (due to concurrency or hardware), at 82 tokens per second as a capability, but that is only after the network delay and model context loading of 10 prompt tokens at 0.4 seconds.
However with the max wait found in a mere three trials of the normal model being 0.8 seconds to receive the first token (at top_p =0.01 for minimum sampler input), you don’t get many more tokens before 1.0 seconds has elapsed on the best day that doesn’t go into multiple seconds. Let’s get responses as fast as we can, 3 tokens in, 5 out:
For 50 trials of gpt-3.5-turbo-0613:
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.195 |
Max: 2.091 |
Avg: 0.595 |
| total response (s) |
Min: 0.253 |
Max: 2.101 |
Avg: 0.647 |
| total rate |
Min: 2.38 |
Max: 19.763 |
Avg: 9.751 |
| stream rate |
Min: 21.4 |
Max: 666.7 |
Avg: 141.630 |
| response tokens |
Min: 5 |
Max: 7 |
Avg: 5.040 |
0.65 seconds to barely say “hi”. (the extreme “stream” rate of tokens 2-5 is likely a miniscule delay in opening the API network response to you as the model is generating)
5-token cat chat at temperature = 1.9
Cats, miracle piecesCats are fascinating animalsCats are mesmerizingCats,
those mysteriousCats, the curiousCats are itdeCats are captivating creaturesCats
are majestic animalsCats are incredibly fascinatingCats have the abilityCats
are fluffy andCats wander the worldCats have occupied aVAMarketing-winder’sCats
are perhaps theCats, beloved creaturesCats are charismatic andCats, though
muchCats are fascinating creaturesCats, known allCats, with theirCats, also
knownCats are fascinating creaturesCats are beloved creaturesCats, notorious
forCats: The UltimateCats, scientifically knownThere is something inherently
mysticalCats, also knownCats, one ofCats. MysteriousCats have drawn peopleCats:
Understanding OurOur beloved feline companionsCats, also knownCats are
amazing creaturesCats, wandering mystCats, often regardedCats are beloved
companionsCats are small mammalsCats, one ofCats are small,Cats have
attained aCats, longtime popularCats are fascinating creaturesCats, also
knownCats are enigmaticCats, domesticsCats are prevalent inCats, little
homem
My numbers are for GPT-4. GPT-3.5-Turbo will be much faster.
Thanks for this, I’m building a very simple app that generates a personal value statement with gpt3.5-turbo (prompt + reply don’t take more than 700 tokens) and I couldn’t figure out why with my API key everything worked smoothly, but with my client’s API key the script takes very long to execute (10-20 seconds) and often times out.
I’ll post the video I made as a proof anyway, but I guess telling my client to buy at least $50 of credits should fix the problem
EDIT
TLDR: raise your rate limits by buying credits
reyemtm
82
Thanks for the responses. In my use case - user asks a question, AI generates SQL from question which then queries db - in a search bar is not feasible. Sounds like 10s would be about right for this flow which is not a reasonable user experience.
It is normal to wait ~3-4 minutes for “total_tokens”: 4148 ? with gpt-3.5-turbo-16k
_j
84
I can get about 110,000 gpt-3.5-turbo-16k tokens within 5 minutes … when asking for the same question to be answered 50 times.
The normal production rate of -16k has tended around 30-40 tokens per second. That’s 2000 per minute. If you output seems to be chopped to half that or less, you can check your account “rate limits”, where you can also see what trust level tier you are in. Tier 1: slow models too.
Thx for the answer. Now I see why i slower than before, I’m on tier 1…
The only way to get a faster speed is by “Tier 2 $50 paid and 7+ days since first successful payment”
Waiting 7 days?
_j
86
Other’s have reported getting back their speed much faster.
The waiting is likely to be trusted to put in and use more money of the next tier.
1 Like
The claim that getting on Tier 2 is the answer to huge delays makes common sense but I am on Tier 1 and I’ve reviewed the rate limits for that tier and GPT-3.5-Turbo. It is totally impossible that my chatbot is getting anywhere near those limits.but I’m getting 30 sec and up delays, occasionally more than 100 sec. If needing Tier 2 is the answer then it certainly isn’t explained by OpenAI documentation!
_j
88
They aren’t going to directly write “We took a whole bunch of accounts that were low-value and put them into an API filter buffer that simulates slow output to decrease their satisfaction. Goal: get them to pay more to return to normal.”
OpenAI rewrote the text on the “rate limits” page to:
“Organizations in higher tiers also get access to lower latency models.”
Previously: " “As your usage tier increases, we may also move your account onto lower latency models behind the scenes.”"
Lower latency “models” makes no sense. Why leave your generation in an overloaded time-sliced server when it is more efficient to generate 100 tokens a second and then the unit processor is freed for another user. Only if there was no way for them to generate the current customer load without hiring slow energy-inefficient GPU instances of older technology.
I guess that’s the way it is. I, for one, would prefer they be more transparent about it. I can handle the truth.
Let me share an interesting observation with you.
Until I opened my API account one week ago, my ChatGPT account was very fast. But from the moment I opened my API account, ChatGPT became painfully slow. API is also painfully slow. The funny thing is, that my wife’s ChatGPT account has been fast the whole time.
Conclusion:
It’s part of their business plan to slow it down for you to make you pay more money. They will fail. I’m going to cancel my API account.