I’ve created a free tool here to help software developers except every time I call the OpenAI API it takes incredibly long time to get a response. This will most likely turn away anyone wanting to use the tool. How do other applications speed up the API calls using gpt-3.5-turbo?
Some calls take as long as 80 seconds while others are 10 seconds
It’s a really good question. I suspect many AI applications are not designed in ways that accelerate performance. Our initial attempts to build an FAQ system had 20-second+ response times. This was untenable. We have it down to about 3.8 seconds with embeddings and GPT3. But soon, we will move it all to PaLM embeddings and completions where we’re seeing less than 2 seconds for multiple answer-processing calls. Part of this slow march to higher performance relates to our vector data handling and other caching techniques.
that’s very impressive that you got it down to just a few seconds. I’m dealing with a similar project right now. Would you be so kind to share, what you’ve focused on to improve performance? Thank you!
Yeah, so there are many little things, but a big ah-ha moment for me was establishing a hash index in memory representing historical responses. This makes it possible to lean solely on embeddings to craft a response. Consider that a single embedding can be executed in about 400ms. My process using this cache index involves just two steps:
Generates the embedding vector for the query (~400ms)
Performs dot product (similarity) tests against the in-memory cache (~250ms)
High historical similarities (based on a minimum threshold) represent opportunities to regurgitate previous inference responses. It makes it possible to generate near-instant answers to questions.
The beauty of this is that step one is required for all responses (in a Q&A app). Step two is only 1/4 second, so you hope to short-circuit an inference request by investing in a look-back query.
Many additional ideas come to mind that I have not explored. Imagine the historical cache was based on question popularity and other measures that predict the types of questions that should be cached. The opportunities to build additional performance measures using AI itself are vast.
Lastly, imagine the historical cache is periodically tested against actual new questions to see if the earlier recommendation is as predictable as an actual new inference would have been. Embeddings - once again - make this possible because you can easily perform the look-back and compare that answer to what a new inference would have been had there been no cache to lean on - thus, the snake is eating its tail to get better and better.
I think the key to building really performant AI systems is to establish a framework for measuring prompt performance.
Ah, okay. That sounds good. I think a ten-second response is about right for chat completions. This chart shows chat completions for a system I built that’s a basic Q&A system built on an embedding architecture. Over the long run, it’s averaging 6.7 seconds per query.
I don’t break out the embedding vs. the completion time. Still, for each 6.7s process, two API calls are made - one to the ADA model to generate an embedding vector for the query and another to perform a completion based on the data located through the embedding vector search.
Remember, a vector similarity comparison process is also performed on a relatively small data set between the first OpenAI call and the second one. The 6.7s average encompasses all these calls and processes, so it’s pretty quick. It uses davinci-003 for the completions. Maybe that’s why my performance is acceptable.
That doesn’t seem like it would cause sluggish responses, although I would suggest you share a complete sample prompt so some of the experts here can comment and understand what’s really in the API call.
Yeah, that makes sense. Have you benchmarked the process to see what in the prompt is taking the most time? Perhaps removing the two additional tasks (what changed and why) to see if either or both of those are causing it to slow down.
Also, I recommend you give PaLM 2 a try using the same prompts. I’ve noticed it in 2.5x faster than GPT 3.5.
I may have some ideas about using embeddings to speed this up. No time today though.