What is the expected inference latency of fine-tuned gpt-4 model?

Hello, we have a fine tuned gpt-4 model for code generation, the answer quality is pretty good, while we got inconsistent inference latency numbers.

The first day when I tested, around 80% requests can generate 30 tokens/s, the other 20% requests can generate 10 tokens/s.

While it gets slower and slower, most of the requests now around 10 tokens/s. Some of them dropped to 5 tokens/s.

All the inputs are quite similar, is there any latency SLA around fine tuned gpt-4 model and what’s the potential reason of the varied latency?

Thanks!

Are you streaming? Although I haven’t noticed it recently (and my fine tunes are nothing but tests not regularly used), there was a period where warming up a fine-tune model could be up to 15 seconds before the first token.

By the fifth second of nothing, you could mute and fire off a new parallel request and close() the one that loses the race to the first 50 tokens.

The tier you are in affects your token production rate (what OpenAI mistakenly called latency when they even mentioned the low-tier penalty), mainly tier-1.

You are in a rare club of fine-tuning gpt-4. Nothing is documented except that it is available and requires approval, even the price (the price which you can tell us about!)

In our case, it is not streaming. Thanks for the comments! Will double check the latency to generate the first token and our tier