My Application has integrated with open AI for API calls for the responses to the user prompts. . The API calls are taking longer time to generate a response. I need a best feasible solution to achieve faster responses within milli seconds.
One solution could be to switch to GPT 3.5 Turbo.
If this is not feasible because of quality, you could try fine-tuning a GPT 3.5 Turbo using GPT 4 to give you the fine-tuning dataset.
Another thing you can do is add a semantic caching layer between your server and OpenAI, and check if that query has already been asked and just fetch the answer from your semantic cache.
The only way you will get millisecond responses from an LLM reliably without a dedicated instance will be to host a small open source model on a high performance GPU and have only you as the client. You could take advantage of a dedicated instance and have it make commercial sense if your needs are 450M tokens per day for greater.