Fine-tuned gpt-3.5-turbo latency

Hi there,

I’m trying out a fine-tuned gpt-3.5-turbo model.

The inference speed is good in terms of tokens/s (comparable to gpt-3.5-turbo), but sometimes the API calls (including via Playground) will have a 2-10 second delay. I assume this is because the fine-tuned part of the model isn’t in some hot cache.

Is this the expectation of the fine-tuned models long term? Is there a way to improve this latency? If I use the model for long enough, should I expect it to go away? Any information here would be helpful, thanks.

-Jeffrey

2 Likes

I’ve had the exact same experience. I only used a model trained on 10 examples. Response times range from anywhere between 2 seconds to 10 seconds. GPT-4 probably averages 4-5 seconds with even a larger response. I have the same questions as Jeffrey, but also wanted to know if latency gets even worse with larger training sets?

Good to know. I’m training on ~100 examples that are 300-500 tokens each and have fine-tuned three such models so far that all exhibit the same behavior.

1 Like

If your application has 24/7 usage, I’d be interested in a log dump of the latency over time, see if there are any time of day, possibly week, trends that stand out, making a graph of latency against time with code interpreter might also be useful.

Jailbreak the AI, make it produce obscene text or meaning. See the new response finish reason for content violation.

Infer that the response tokens are being held up while the generation is being scanned for bad content by an undocumented change (or somewhat documented by the “how we used GPT-4 to make a moderator” blog post)

We haven’t deployed it into production because of this issue. It’s very straightforward to reproduce – we can make ~20 of the exact same call to the API (same model, same prompt), and 1/3 of them will take <200ms, 1/3 of them will take 1-2 seconds, and 1/3 of them will take longer than that, up to 10 seconds as I mentioned previously. From my perspective, it’s almost certainly some sort of model loading/caching behavior.

2 Likes

I’m observing the same behavior consistently. If we can get rid of the 1/3 that take longer than 2 seconds, we’re in good shape.

I had the same issue with my FT model. Making it unusable for production as someone mentioned. I checked again and the response time is not normalized and consistent. Around 0.4 to 0.6 seconds for the first chunk when streaming. Anyone else notice an improvement? Maybe they fixed it.

One could make guesses that the custom model weights are not kept active for every client on every routed inference instance server that one may reach for any possible period of inactivity. Some amount of data movement is necessary before fulfillment, but might not be if hitting the same server again in a batch.

One could explore and probe, but it would not solve the ultimate problem. Keep-model-alive pings would probably not be appreciated.

Well, all of my tests are now within the range of the normal 3.5 model. In fact, the tokens per second are even faster than 3.5 normal. So, I suspect they have fixed something. The problem is no longer reproducible.

I’m also experiencing delays in the fine-tuned model, it’s inconsistent, about a 1/3 of api calls are delayed by 5-12 seconds.

This is really bad, wondering if someone from openAI is aware of this. With this latency issue finetuned GPT3.5-turbo models it’s not usable for production usecases.

The latency of the first chunk of a fine-tune is only more inconsistent. Might go instantly, might leave you looking at a blank screen for five seconds.

1 Like

Irregularity is just the worst thing that could happen in my usecase, also I saw that fine-tuned model doesn’t support function calling so I won’t try to train anything.
Thanks for the feedback

I wonder if this is still an issue in 2024 and in finetuning GPT 4o and 4o-mini?

Yes. An inactive fine-tune takes longer to start producing.

Here’s a fine-tuning from yesterday never used for inference before:

Stat Average Cold Minimum Maximum
stream rate Avg: 149.133 Cold: 276.4 Min: 84.6 Max: 276.4
latency (s) Avg: 2.420 Cold: 5.8651 Min: 0.4704 Max: 5.8651
total response (s) Avg: 3.590 Cold: 6.3354 Min: 1.975 Max: 6.3354
total rate Avg: 46.747 Cold: 20.677 Min: 20.677 Max: 66.329
response tokens Avg: 131.000 Cold: 131 Min: 131 Max: 131

gpt-4o-mini

Contrast “cold” to “minimum” latency of two subsequent runs. The average column can be disregarded, as it is biased by the first slow startup.

The whole picture is confusing though, as the “cold” also blasts tokens at a high rate after the initial streamed token reception, but that doesn’t make up for the wait, giving a six second completion time instead of two seconds.

Note that the input context is large enough to activate context caching on the later rounds.

1 Like