The inference speed is good in terms of tokens/s (comparable to gpt-3.5-turbo), but sometimes the API calls (including via Playground) will have a 2-10 second delay. I assume this is because the fine-tuned part of the model isn’t in some hot cache.
Is this the expectation of the fine-tuned models long term? Is there a way to improve this latency? If I use the model for long enough, should I expect it to go away? Any information here would be helpful, thanks.
I’ve had the exact same experience. I only used a model trained on 10 examples. Response times range from anywhere between 2 seconds to 10 seconds. GPT-4 probably averages 4-5 seconds with even a larger response. I have the same questions as Jeffrey, but also wanted to know if latency gets even worse with larger training sets?
Good to know. I’m training on ~100 examples that are 300-500 tokens each and have fine-tuned three such models so far that all exhibit the same behavior.
If your application has 24/7 usage, I’d be interested in a log dump of the latency over time, see if there are any time of day, possibly week, trends that stand out, making a graph of latency against time with code interpreter might also be useful.
Jailbreak the AI, make it produce obscene text or meaning. See the new response finish reason for content violation.
Infer that the response tokens are being held up while the generation is being scanned for bad content by an undocumented change (or somewhat documented by the “how we used GPT-4 to make a moderator” blog post)
We haven’t deployed it into production because of this issue. It’s very straightforward to reproduce – we can make ~20 of the exact same call to the API (same model, same prompt), and 1/3 of them will take <200ms, 1/3 of them will take 1-2 seconds, and 1/3 of them will take longer than that, up to 10 seconds as I mentioned previously. From my perspective, it’s almost certainly some sort of model loading/caching behavior.
I had the same issue with my FT model. Making it unusable for production as someone mentioned. I checked again and the response time is not normalized and consistent. Around 0.4 to 0.6 seconds for the first chunk when streaming. Anyone else notice an improvement? Maybe they fixed it.
One could make guesses that the custom model weights are not kept active for every client on every routed inference instance server that one may reach for any possible period of inactivity. Some amount of data movement is necessary before fulfillment, but might not be if hitting the same server again in a batch.
One could explore and probe, but it would not solve the ultimate problem. Keep-model-alive pings would probably not be appreciated.
Well, all of my tests are now within the range of the normal 3.5 model. In fact, the tokens per second are even faster than 3.5 normal. So, I suspect they have fixed something. The problem is no longer reproducible.
This is really bad, wondering if someone from openAI is aware of this. With this latency issue finetuned GPT3.5-turbo models it’s not usable for production usecases.
The latency of the first chunk of a fine-tune is only more inconsistent. Might go instantly, might leave you looking at a blank screen for five seconds.
Irregularity is just the worst thing that could happen in my usecase, also I saw that fine-tuned model doesn’t support function calling so I won’t try to train anything.
Thanks for the feedback
Yes. An inactive fine-tune takes longer to start producing.
Here’s a fine-tuning from yesterday never used for inference before:
Stat
Average
Cold
Minimum
Maximum
stream rate
Avg: 149.133
Cold: 276.4
Min: 84.6
Max: 276.4
latency (s)
Avg: 2.420
Cold: 5.8651
Min: 0.4704
Max: 5.8651
total response (s)
Avg: 3.590
Cold: 6.3354
Min: 1.975
Max: 6.3354
total rate
Avg: 46.747
Cold: 20.677
Min: 20.677
Max: 66.329
response tokens
Avg: 131.000
Cold: 131
Min: 131
Max: 131
gpt-4o-mini
Contrast “cold” to “minimum” latency of two subsequent runs. The average column can be disregarded, as it is biased by the first slow startup.
The whole picture is confusing though, as the “cold” also blasts tokens at a high rate after the initial streamed token reception, but that doesn’t make up for the wait, giving a six second completion time instead of two seconds.
Note that the input context is large enough to activate context caching on the later rounds.