How to speed up OpenAI API calls

jacknoordhuizen · May 15, 2023, 4:10pm

I’ve created a free tool here to help software developers except every time I call the OpenAI API it takes incredibly long time to get a response. This will most likely turn away anyone wanting to use the tool. How do other applications speed up the API calls using gpt-3.5-turbo?

Some calls take as long as 80 seconds while others are 10 seconds

bill.french · May 15, 2023, 4:21pm

It’s a really good question. I suspect many AI applications are not designed in ways that accelerate performance. Our initial attempts to build an FAQ system had 20-second+ response times. This was untenable. We have it down to about 3.8 seconds with embeddings and GPT3. But soon, we will move it all to PaLM embeddings and completions where we’re seeing less than 2 seconds for multiple answer-processing calls. Part of this slow march to higher performance relates to our vector data handling and other caching techniques.

filipp.trigub · May 16, 2023, 2:37pm

Hi Bill,

that’s very impressive that you got it down to just a few seconds. I’m dealing with a similar project right now. Would you be so kind to share, what you’ve focused on to improve performance? Thank you!

bill.french · May 16, 2023, 3:21pm

Yeah, so there are many little things, but a big ah-ha moment for me was establishing a hash index in memory representing historical responses. This makes it possible to lean solely on embeddings to craft a response. Consider that a single embedding can be executed in about 400ms. My process using this cache index involves just two steps:

Generates the embedding vector for the query (~400ms)
Performs dot product (similarity) tests against the in-memory cache (~250ms)

High historical similarities (based on a minimum threshold) represent opportunities to regurgitate previous inference responses. It makes it possible to generate near-instant answers to questions.

The beauty of this is that step one is required for all responses (in a Q&A app). Step two is only 1/4 second, so you hope to short-circuit an inference request by investing in a look-back query.

Many additional ideas come to mind that I have not explored. Imagine the historical cache was based on question popularity and other measures that predict the types of questions that should be cached. The opportunities to build additional performance measures using AI itself are vast.

Lastly, imagine the historical cache is periodically tested against actual new questions to see if the earlier recommendation is as predictable as an actual new inference would have been. Embeddings - once again - make this possible because you can easily perform the look-back and compare that answer to what a new inference would have been had there been no cache to lean on - thus, the snake is eating its tail to get better and better.

I think the key to building really performant AI systems is to establish a framework for measuring prompt performance.

filipp.trigub · May 25, 2023, 9:58am

Thank you very much, thats indeed an insightful solution!

uriv · May 26, 2023, 10:06am

One way to speed it up if you’re expecting repeated calls is caching.

Me and a friend built a managed caching service for this:

pip install rmmbr

from rmmbr import cloud_cache

n_called = 0

@cloud_cache(
    "https://rmmbr.net",
    "your-service-token",
    "some name for the cache",
    60 * 60 * 24, # TTL is one day.
    "your-encryption-key",
)
async def f(x: int):
  nonlocal n_called
  n_called += 1
  return x

await f(3)
await f(3)
# nCalled is 1 here

gabriel_jorge · June 21, 2023, 4:33pm

Hey Bill!

The strategy using embeddings only works for simple cases as FAQ systems? If not, how can I use it more complex cases, for example an English grammar en vocabulary checker?

bill.french · June 21, 2023, 6:45pm

I don’t know.

The science of vocabulary and validation is a bit beyond my skill set. I assume you’re asking in the context of OpenAI performance, right?

gabriel_jorge · June 21, 2023, 8:42pm

Yeah, I’m talking about performance, today it’s taking about 10 seconds to get a result using gpt-3.5-turbo

bill.french · June 21, 2023, 9:37pm

And by “result”, you mean from what exactly?

gabriel_jorge · June 21, 2023, 10:31pm

Sorry, by result I want to mean the return from the open AI API

bill.french · June 21, 2023, 10:43pm

That part I get. What I don’t know is what you were asking for. A text completion? A chat completion? A vector embedding?

gabriel_jorge · June 21, 2023, 11:10pm

Oh sure, I’m using the chat completions, sending, into the messages array, 1 system role, 2 users and 1 assistant as example.

system > user > assistant (answer example) > user

Btw, I’m using to improve English grammar, it basically receive a sentence and should rewrite this sentence showing what and why changed

bill.french · June 22, 2023, 12:23am

Ah, okay. That sounds good. I think a ten-second response is about right for chat completions. This chart shows chat completions for a system I built that’s a basic Q&A system built on an embedding architecture. Over the long run, it’s averaging 6.7 seconds per query.

I don’t break out the embedding vs. the completion time. Still, for each 6.7s process, two API calls are made - one to the ADA model to generate an embedding vector for the query and another to perform a completion based on the data located through the embedding vector search.

Remember, a vector similarity comparison process is also performed on a relatively small data set between the first OpenAI call and the second one. The 6.7s average encompasses all these calls and processes, so it’s pretty quick. It uses davinci-003 for the completions. Maybe that’s why my performance is acceptable.

That doesn’t seem like it would cause sluggish responses, although I would suggest you share a complete sample prompt so some of the experts here can comment and understand what’s really in the API call.

Yeah, that makes sense. Have you benchmarked the process to see what in the prompt is taking the most time? Perhaps removing the two additional tasks (what changed and why) to see if either or both of those are causing it to slow down.

Also, I recommend you give PaLM 2 a try using the same prompts. I’ve noticed it in 2.5x faster than GPT 3.5.

I may have some ideas about using embeddings to speed this up. No time today though.

SomebodySysop · June 22, 2023, 4:28am

What tools are available for benchmarking?

My chat completion process right now:

question
moderation → openai
standalone question → openai
concepts → openai
context doc retrieval → weaviate
completion → openai

I suspect that most of my current time lag is with Weaviate, but right now I don’t have a way to measure that.

novaphil · June 22, 2023, 4:31am

How are you running those requests? Whatever code is running them could at least log duration/start/end times to a log file

SomebodySysop · June 22, 2023, 5:23am

Excellent idea. Implementing now.

SomebodySysop · June 22, 2023, 7:14am

I take that back. Turns out Weaviate, on queries at least, is typically the fastest API call:

On a sample query:

.99 openai
1.2 openai
0.6 weaviate
2.67 openai

Not lightening fast, but not nearly as slow as I thought. I’m also doing some local database processing (for permissions and access control) so that adds time also.

Anyway, thanks for the suggestion!

bill.french · June 22, 2023, 11:39am

Often, the best AI solutions are built with tools and approaches that have little to do with AI.

bill.french · June 22, 2023, 11:43am

Yep. Weaviate and all vector indices are designed to be blistering fast.

You should log all of these processes independently to rule out the possibility they are each responsible for latency.

I’m a huge fan of the accuracy of GPT models, but for giggles, I also benchmark the same processes against PaLM 2 and a Hugging Face model or two if response time is absolutely critical.

Lastly, I look for parts of the process that could benefit from a cache.

Topic		Replies	Views
API "gpt-3.5-turbo" Sucks (Slow) API	21	9816	December 16, 2023
GPT-4 API to slow when you have to work with a 46 second time out API	11	2778	July 30, 2023
Response speed with semantic searching API	2	1263	December 29, 2023
How can I improve response times from the OpenAI API while generating responses based on our knowledge base? API chatgpt , api	3	22402	November 9, 2023
Slow Chat api responses ------ API	17	6470	December 24, 2023

How to speed up OpenAI API calls

Related topics