The inference speed is good in terms of tokens/s (comparable to gpt-3.5-turbo), but sometimes the API calls (including via Playground) will have a 2-10 second delay. I assume this is because the fine-tuned part of the model isn’t in some hot cache.
Is this the expectation of the fine-tuned models long term? Is there a way to improve this latency? If I use the model for long enough, should I expect it to go away? Any information here would be helpful, thanks.
I’ve had the exact same experience. I only used a model trained on 10 examples. Response times range from anywhere between 2 seconds to 10 seconds. GPT-4 probably averages 4-5 seconds with even a larger response. I have the same questions as Jeffrey, but also wanted to know if latency gets even worse with larger training sets?
Good to know. I’m training on ~100 examples that are 300-500 tokens each and have fine-tuned three such models so far that all exhibit the same behavior.
If your application has 24/7 usage, I’d be interested in a log dump of the latency over time, see if there are any time of day, possibly week, trends that stand out, making a graph of latency against time with code interpreter might also be useful.
Jailbreak the AI, make it produce obscene text or meaning. See the new response finish reason for content violation.
Infer that the response tokens are being held up while the generation is being scanned for bad content by an undocumented change (or somewhat documented by the “how we used GPT-4 to make a moderator” blog post)
We haven’t deployed it into production because of this issue. It’s very straightforward to reproduce – we can make ~20 of the exact same call to the API (same model, same prompt), and 1/3 of them will take <200ms, 1/3 of them will take 1-2 seconds, and 1/3 of them will take longer than that, up to 10 seconds as I mentioned previously. From my perspective, it’s almost certainly some sort of model loading/caching behavior.
I had the same issue with my FT model. Making it unusable for production as someone mentioned. I checked again and the response time is not normalized and consistent. Around 0.4 to 0.6 seconds for the first chunk when streaming. Anyone else notice an improvement? Maybe they fixed it.
One could make guesses that the custom model weights are not kept active for every client on every routed inference instance server that one may reach for any possible period of inactivity. Some amount of data movement is necessary before fulfillment, but might not be if hitting the same server again in a batch.
One could explore and probe, but it would not solve the ultimate problem. Keep-model-alive pings would probably not be appreciated.
Well, all of my tests are now within the range of the normal 3.5 model. In fact, the tokens per second are even faster than 3.5 normal. So, I suspect they have fixed something. The problem is no longer reproducible.
This is really bad, wondering if someone from openAI is aware of this. With this latency issue finetuned GPT3.5-turbo models it’s not usable for production usecases.
Hi there, any news about fine-tuned model latency speed? I was wondering if using a fine-tuned instance could make generation faster than the non fine-tuned one consistently
The latency of the first chunk of a fine-tune is only more inconsistent. Might go instantly, might leave you looking at a blank screen for five seconds.
Irregularity is just the worst thing that could happen in my usecase, also I saw that fine-tuned model doesn’t support function calling so I won’t try to train anything.
Thanks for the feedback