What is the expected inference latency of fine-tuned gpt-4 model?

goingmerry · March 13, 2024, 11:43pm

Hello, we have a fine tuned gpt-4 model for code generation, the answer quality is pretty good, while we got inconsistent inference latency numbers.

The first day when I tested, around 80% requests can generate 30 tokens/s, the other 20% requests can generate 10 tokens/s.

While it gets slower and slower, most of the requests now around 10 tokens/s. Some of them dropped to 5 tokens/s.

All the inputs are quite similar, is there any latency SLA around fine tuned gpt-4 model and what’s the potential reason of the varied latency?

Thanks!

_j · March 14, 2024, 3:10am

Are you streaming? Although I haven’t noticed it recently (and my fine tunes are nothing but tests not regularly used), there was a period where warming up a fine-tune model could be up to 15 seconds before the first token.

By the fifth second of nothing, you could mute and fire off a new parallel request and close() the one that loses the race to the first 50 tokens.

The tier you are in affects your token production rate (what OpenAI mistakenly called latency when they even mentioned the low-tier penalty), mainly tier-1.

You are in a rare club of fine-tuning gpt-4. Nothing is documented except that it is available and requires approval, even the price (the price which you can tell us about!)

goingmerry · March 14, 2024, 4:41pm

In our case, it is not streaming. Thanks for the comments! Will double check the latency to generate the first token and our tier

Topic		Replies	Views
Fine-tuned gpt-3.5-turbo latency Feedback api	13	2780	September 28, 2023
What is considered as normal latency? API	3	1943	December 15, 2023
API call latency poses an issue API api	0	127	April 15, 2024
Gpt-4-0125-preview is slower than gpt-4-0613? Feedback gpt-4 , api	5	4655	January 30, 2024
Unstable speed of gpt-3.5-turbo-16k API api , gpt-35-turbo-16k , performance	6	667	January 9, 2024

What is the expected inference latency of fine-tuned gpt-4 model?

Related Topics