Hi there!
Recently got in to fine-tuning a gpt-4o-mini-2024-07-18 model. I only have about 30 elements in my training set, and in some initial testing, I was pleased with the results of the output however the latency is slower than expected.
It can range anywhere from 2-10 seconds - one instance even went for 20 seconds. Has anyone had any luck in cutting down latency? I’m pretty new, but I wonder if there is any caching involved and if so, if anyone has had luck warming up the cache (say, before expected heavy use periods). I’d really like to use this in prod, but it simply must be faster (said every swe ever lol). Perhaps other models are faster? I tried prompt engineering but it wasn’t up to the task.
Thank you all for your input, O’ noble OpenAI community.