Fine-tuned gpt-3.5-turbo latency

paladin8 · August 24, 2023, 9:02pm

Hi there,

I’m trying out a fine-tuned gpt-3.5-turbo model.

The inference speed is good in terms of tokens/s (comparable to gpt-3.5-turbo), but sometimes the API calls (including via Playground) will have a 2-10 second delay. I assume this is because the fine-tuned part of the model isn’t in some hot cache.

Is this the expectation of the fine-tuned models long term? Is there a way to improve this latency? If I use the model for long enough, should I expect it to go away? Any information here would be helpful, thanks.

-Jeffrey

lancecarlson · August 24, 2023, 9:05pm

I’ve had the exact same experience. I only used a model trained on 10 examples. Response times range from anywhere between 2 seconds to 10 seconds. GPT-4 probably averages 4-5 seconds with even a larger response. I have the same questions as Jeffrey, but also wanted to know if latency gets even worse with larger training sets?

paladin8 · August 24, 2023, 9:11pm

Good to know. I’m training on ~100 examples that are 300-500 tokens each and have fine-tuned three such models so far that all exhibit the same behavior.

Foxalabs · August 25, 2023, 10:36am

If your application has 24/7 usage, I’d be interested in a log dump of the latency over time, see if there are any time of day, possibly week, trends that stand out, making a graph of latency against time with code interpreter might also be useful.

_j · August 25, 2023, 10:40am

Jailbreak the AI, make it produce obscene text or meaning. See the new response finish reason for content violation.

Infer that the response tokens are being held up while the generation is being scanned for bad content by an undocumented change (or somewhat documented by the “how we used GPT-4 to make a moderator” blog post)

paladin8 · August 28, 2023, 5:44pm

We haven’t deployed it into production because of this issue. It’s very straightforward to reproduce – we can make ~20 of the exact same call to the API (same model, same prompt), and 1/3 of them will take <200ms, 1/3 of them will take 1-2 seconds, and 1/3 of them will take longer than that, up to 10 seconds as I mentioned previously. From my perspective, it’s almost certainly some sort of model loading/caching behavior.

lancecarlson · August 28, 2023, 7:06pm

I’m observing the same behavior consistently. If we can get rid of the 1/3 that take longer than 2 seconds, we’re in good shape.

NormanNormal · August 28, 2023, 10:33pm

I had the same issue with my FT model. Making it unusable for production as someone mentioned. I checked again and the response time is not normalized and consistent. Around 0.4 to 0.6 seconds for the first chunk when streaming. Anyone else notice an improvement? Maybe they fixed it.

_j · August 29, 2023, 12:12am

One could make guesses that the custom model weights are not kept active for every client on every routed inference instance server that one may reach for any possible period of inactivity. Some amount of data movement is necessary before fulfillment, but might not be if hitting the same server again in a batch.

One could explore and probe, but it would not solve the ultimate problem. Keep-model-alive pings would probably not be appreciated.

NormanNormal · August 30, 2023, 4:25am

Well, all of my tests are now within the range of the normal 3.5 model. In fact, the tokens per second are even faster than 3.5 normal. So, I suspect they have fixed something. The problem is no longer reproducible.

SJ11 · September 7, 2023, 12:36pm

I’m also experiencing delays in the fine-tuned model, it’s inconsistent, about a 1/3 of api calls are delayed by 5-12 seconds.

sa420s · September 14, 2023, 1:34pm

This is really bad, wondering if someone from openAI is aware of this. With this latency issue finetuned GPT3.5-turbo models it’s not usable for production usecases.

_j · September 28, 2023, 7:45pm

The latency of the first chunk of a fine-tune is only more inconsistent. Might go instantly, might leave you looking at a blank screen for five seconds.

colibryx · September 28, 2023, 8:07pm

Irregularity is just the worst thing that could happen in my usecase, also I saw that fine-tuned model doesn’t support function calling so I won’t try to train anything.
Thanks for the feedback

Mario_dev · November 15, 2024, 2:53pm

I wonder if this is still an issue in 2024 and in finetuning GPT 4o and 4o-mini?

_j · November 15, 2024, 3:11pm

Yes. An inactive fine-tune takes longer to start producing.

Here’s a fine-tuning from yesterday never used for inference before:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 149.133	Cold: 276.4	Min: 84.6	Max: 276.4
latency (s)	Avg: 2.420	Cold: 5.8651	Min: 0.4704	Max: 5.8651
total response (s)	Avg: 3.590	Cold: 6.3354	Min: 1.975	Max: 6.3354
total rate	Avg: 46.747	Cold: 20.677	Min: 20.677	Max: 66.329
response tokens	Avg: 131.000	Cold: 131	Min: 131	Max: 131

gpt-4o-mini

Contrast “cold” to “minimum” latency of two subsequent runs. The average column can be disregarded, as it is biased by the first slow startup.

The whole picture is confusing though, as the “cold” also blasts tokens at a high rate after the initial streamed token reception, but that doesn’t make up for the wait, giving a six second completion time instead of two seconds.

Note that the input context is large enough to activate context caching on the later rounds.

Topic		Replies	Views
High latency for fine-tuned gpt-4o-mini API	4	701	November 26, 2024
What is the expected inference latency of fine-tuned gpt-4 model? API gpt-4 , fine-tuning	3	1638	August 2, 2024
Fine-tuned model with same seed and data is 7x slower now vs 6 months ago API fine-tuning-problems	2	127	January 9, 2025
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9499	July 22, 2024
Fine-tuning flagged Moderation Policy customization API fine-tuning , moderation	17	1498	December 16, 2023

Fine-tuned gpt-3.5-turbo latency

Related topics