I’ve created a free tool here to help software developers except every time I call the OpenAI API it takes incredibly long time to get a response. This will most likely turn away anyone wanting to use the tool. How do other applications speed up the API calls using gpt-3.5-turbo?
Some calls take as long as 80 seconds while others are 10 seconds
It’s a really good question. I suspect many AI applications are not designed in ways that accelerate performance. Our initial attempts to build an FAQ system had 20-second+ response times. This was untenable. We have it down to about 3.8 seconds with embeddings and GPT3. But soon, we will move it all to PaLM embeddings and completions where we’re seeing less than 2 seconds for multiple answer-processing calls. Part of this slow march to higher performance relates to our vector data handling and other caching techniques.
that’s very impressive that you got it down to just a few seconds. I’m dealing with a similar project right now. Would you be so kind to share, what you’ve focused on to improve performance? Thank you!
Yeah, so there are many little things, but a big ah-ha moment for me was establishing a hash index in memory representing historical responses. This makes it possible to lean solely on embeddings to craft a response. Consider that a single embedding can be executed in about 400ms. My process using this cache index involves just two steps:
Generates the embedding vector for the query (~400ms)
Performs dot product (similarity) tests against the in-memory cache (~250ms)
High historical similarities (based on a minimum threshold) represent opportunities to regurgitate previous inference responses. It makes it possible to generate near-instant answers to questions.
The beauty of this is that step one is required for all responses (in a Q&A app). Step two is only 1/4 second, so you hope to short-circuit an inference request by investing in a look-back query.
Many additional ideas come to mind that I have not explored. Imagine the historical cache was based on question popularity and other measures that predict the types of questions that should be cached. The opportunities to build additional performance measures using AI itself are vast.
Lastly, imagine the historical cache is periodically tested against actual new questions to see if the earlier recommendation is as predictable as an actual new inference would have been. Embeddings - once again - make this possible because you can easily perform the look-back and compare that answer to what a new inference would have been had there been no cache to lean on - thus, the snake is eating its tail to get better and better.
I think the key to building really performant AI systems is to establish a framework for measuring prompt performance.