High latency for fine-tuned gpt-4o-mini

Hi there!

Recently got in to fine-tuning a gpt-4o-mini-2024-07-18 model. I only have about 30 elements in my training set, and in some initial testing, I was pleased with the results of the output however the latency is slower than expected.

It can range anywhere from 2-10 seconds - one instance even went for 20 seconds. Has anyone had any luck in cutting down latency? I’m pretty new, but I wonder if there is any caching involved and if so, if anyone has had luck warming up the cache (say, before expected heavy use periods). I’d really like to use this in prod, but it simply must be faster (said every swe ever lol). Perhaps other models are faster? I tried prompt engineering but it wasn’t up to the task.

Thank you all for your input, O’ noble OpenAI community.

Sure is lonely in here :frowning:

This is actually super good, considering!

The more context you send, the longer it takes.

1 Like

@Dunc Hey! Seeing the same (or worse even, 39s?!)… did you ever find a magic solution for this?

Fine-tunings take a while to warm up after inactivity, for whatever reason on the backend.

You can make some async 1 token calls to the model when you anticipate use of it, for cases where you have some warning that it may be called upon.