reyemtm
77
What would an expected response time for 100 tokens be?
The 100 tokens would take about 5 seconds. See the rough 20 tokens per second in my graph above. But it could take as long as 20 seconds just due to model performance variations (5 tokens per second).
For instant response going out, maybe see if streaming is a fit for you.
_j
79
We can measure 100 tokens. An average of 2.3 seconds is a good score.
For 3 trials of gpt-3.5-turbo-0613
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.375 |
Max: 0.782 |
Avg: 0.534 |
| total response (s) |
Min: 2.0236 |
Max: 2.482 |
Avg: 2.226 |
| total rate |
Min: 40.29 |
Max: 49.417 |
Avg: 45.198 |
| stream rate |
Min: 57.7 |
Max: 60.4 |
Avg: 58.550 |
| response tokens |
Min: 100 |
Max: 100 |
Avg: 100.000 |
For 3 trials of ft:gpt-3.5-turbo-0613:xxxx::yyyyy
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.376 |
Max: 0.416 |
Avg: 0.396 |
| total response (s) |
Min: 1.531 |
Max: 1.667 |
Avg: 1.597 |
| total rate |
Min: 59.988 |
Max: 65.317 |
Avg: 62.680 |
| stream rate |
Min: 77.8 |
Max: 85.7 |
Avg: 82.500 |
| response tokens |
Min: 100 |
Max: 100 |
Avg: 100.000 |
You can see a fine-tune model is fastest (due to concurrency or hardware), at 82 tokens per second as a capability, but that is only after the network delay and model context loading of 10 prompt tokens at 0.4 seconds.
However with the max wait found in a mere three trials of the normal model being 0.8 seconds to receive the first token (at top_p =0.01 for minimum sampler input), you don’t get many more tokens before 1.0 seconds has elapsed on the best day that doesn’t go into multiple seconds. Let’s get responses as fast as we can, 3 tokens in, 5 out:
For 50 trials of gpt-3.5-turbo-0613:
| Stat |
Minimum |
Maximum |
Average |
| latency (s) |
Min: 0.195 |
Max: 2.091 |
Avg: 0.595 |
| total response (s) |
Min: 0.253 |
Max: 2.101 |
Avg: 0.647 |
| total rate |
Min: 2.38 |
Max: 19.763 |
Avg: 9.751 |
| stream rate |
Min: 21.4 |
Max: 666.7 |
Avg: 141.630 |
| response tokens |
Min: 5 |
Max: 7 |
Avg: 5.040 |
0.65 seconds to barely say “hi”. (the extreme “stream” rate of tokens 2-5 is likely a miniscule delay in opening the API network response to you as the model is generating)
5-token cat chat at temperature = 1.9
Cats, miracle piecesCats are fascinating animalsCats are mesmerizingCats,
those mysteriousCats, the curiousCats are itdeCats are captivating creaturesCats
are majestic animalsCats are incredibly fascinatingCats have the abilityCats
are fluffy andCats wander the worldCats have occupied aVAMarketing-winder’sCats
are perhaps theCats, beloved creaturesCats are charismatic andCats, though
muchCats are fascinating creaturesCats, known allCats, with theirCats, also
knownCats are fascinating creaturesCats are beloved creaturesCats, notorious
forCats: The UltimateCats, scientifically knownThere is something inherently
mysticalCats, also knownCats, one ofCats. MysteriousCats have drawn peopleCats:
Understanding OurOur beloved feline companionsCats, also knownCats are
amazing creaturesCats, wandering mystCats, often regardedCats are beloved
companionsCats are small mammalsCats, one ofCats are small,Cats have
attained aCats, longtime popularCats are fascinating creaturesCats, also
knownCats are enigmaticCats, domesticsCats are prevalent inCats, little
homem
My numbers are for GPT-4. GPT-3.5-Turbo will be much faster.
Thanks for this, I’m building a very simple app that generates a personal value statement with gpt3.5-turbo (prompt + reply don’t take more than 700 tokens) and I couldn’t figure out why with my API key everything worked smoothly, but with my client’s API key the script takes very long to execute (10-20 seconds) and often times out.
I’ll post the video I made as a proof anyway, but I guess telling my client to buy at least $50 of credits should fix the problem
EDIT
TLDR: raise your rate limits by buying credits
reyemtm
82
Thanks for the responses. In my use case - user asks a question, AI generates SQL from question which then queries db - in a search bar is not feasible. Sounds like 10s would be about right for this flow which is not a reasonable user experience.
It is normal to wait ~3-4 minutes for “total_tokens”: 4148 ? with gpt-3.5-turbo-16k
_j
84
I can get about 110,000 gpt-3.5-turbo-16k tokens within 5 minutes … when asking for the same question to be answered 50 times.
The normal production rate of -16k has tended around 30-40 tokens per second. That’s 2000 per minute. If you output seems to be chopped to half that or less, you can check your account “rate limits”, where you can also see what trust level tier you are in. Tier 1: slow models too.
Thx for the answer. Now I see why i slower than before, I’m on tier 1…
The only way to get a faster speed is by “Tier 2 $50 paid and 7+ days since first successful payment”
Waiting 7 days?
_j
86
Other’s have reported getting back their speed much faster.
The waiting is likely to be trusted to put in and use more money of the next tier.
1 Like
The claim that getting on Tier 2 is the answer to huge delays makes common sense but I am on Tier 1 and I’ve reviewed the rate limits for that tier and GPT-3.5-Turbo. It is totally impossible that my chatbot is getting anywhere near those limits.but I’m getting 30 sec and up delays, occasionally more than 100 sec. If needing Tier 2 is the answer then it certainly isn’t explained by OpenAI documentation!
_j
88
They aren’t going to directly write “We took a whole bunch of accounts that were low-value and put them into an API filter buffer that simulates slow output to decrease their satisfaction. Goal: get them to pay more to return to normal.”
OpenAI rewrote the text on the “rate limits” page to:
“Organizations in higher tiers also get access to lower latency models.”
Previously: " “As your usage tier increases, we may also move your account onto lower latency models behind the scenes.”"
Lower latency “models” makes no sense. Why leave your generation in an overloaded time-sliced server when it is more efficient to generate 100 tokens a second and then the unit processor is freed for another user. Only if there was no way for them to generate the current customer load without hiring slow energy-inefficient GPU instances of older technology.
I guess that’s the way it is. I, for one, would prefer they be more transparent about it. I can handle the truth.
Let me share an interesting observation with you.
Until I opened my API account one week ago, my ChatGPT account was very fast. But from the moment I opened my API account, ChatGPT became painfully slow. API is also painfully slow. The funny thing is, that my wife’s ChatGPT account has been fast the whole time.
Conclusion:
It’s part of their business plan to slow it down for you to make you pay more money. They will fail. I’m going to cancel my API account.
Hi and welcome to the Developer Forum!
You get an API account automatically when you create an OpenAI account, it’s likely that you are currently being assigned to a busy server while your wife is not, this is not something that is carefully calculated in the background, just the luck of the draw. Could easily be reversed tomorrow. The system is under heavy load after Devday, please bare with it.
_j
92
Let me share an interesting observation with you.
You had an API account the whole time.
Also, let me know if you find where to “cancel” an API account. The only way I know how you can do that is to write sex fantasies about bombs and guns.
API has a tier system, where, as the rest of this topic describes, they trickle out the tokens to degrade the experience until you pay more in credits.
Thanks Foxabilo.
I introduced the API service in my web application a week ago. And it has been a total disaster. Response times up to 3 minutes for simple questions. So right now I have removed that service again from my application. I can’t rely on that technology. Just imagine that my application was 100 % depending on the API service, and suddenly response times explodes for a week or more. That would be a complete disaster.
The greatest laugh of all is the fear of AI cancelling millions of jobs. AI is run by incompetent leaders destroying their own business.
I have filed a support ticket for cancellation of my API account. The AI supporter told me that a real human being would contact me.
I started using it from yesterday. API responses are a disaster. Don’t trust what they tell you on the dev conference Faster Api calls 
1 Like
kerry_w
96
I’m following a Generative AI for Beginners course and am using the API on a Jupyter Notebook.
It took 38s to get a poem response…

Absolute disaster.
1 Like