How to speed up OpenAI API calls

I’ve created a free tool here to help software developers except every time I call the OpenAI API it takes incredibly long time to get a response. This will most likely turn away anyone wanting to use the tool. How do other applications speed up the API calls using gpt-3.5-turbo?

Some calls take as long as 80 seconds while others are 10 seconds

3 Likes

It’s a really good question. I suspect many AI applications are not designed in ways that accelerate performance. Our initial attempts to build an FAQ system had 20-second+ response times. This was untenable. We have it down to about 3.8 seconds with embeddings and GPT3. But soon, we will move it all to PaLM embeddings and completions where we’re seeing less than 2 seconds for multiple answer-processing calls. Part of this slow march to higher performance relates to our vector data handling and other caching techniques.

3 Likes

Hi Bill,

that’s very impressive that you got it down to just a few seconds. I’m dealing with a similar project right now. Would you be so kind to share, what you’ve focused on to improve performance? Thank you!

2 Likes

Yeah, so there are many little things, but a big ah-ha moment for me was establishing a hash index in memory representing historical responses. This makes it possible to lean solely on embeddings to craft a response. Consider that a single embedding can be executed in about 400ms. My process using this cache index involves just two steps:

  1. Generates the embedding vector for the query (~400ms)
  2. Performs dot product (similarity) tests against the in-memory cache (~250ms)

High historical similarities (based on a minimum threshold) represent opportunities to regurgitate previous inference responses. It makes it possible to generate near-instant answers to questions.

The beauty of this is that step one is required for all responses (in a Q&A app). Step two is only 1/4 second, so you hope to short-circuit an inference request by investing in a look-back query.

Many additional ideas come to mind that I have not explored. Imagine the historical cache was based on question popularity and other measures that predict the types of questions that should be cached. The opportunities to build additional performance measures using AI itself are vast.

Lastly, imagine the historical cache is periodically tested against actual new questions to see if the earlier recommendation is as predictable as an actual new inference would have been. Embeddings - once again - make this possible because you can easily perform the look-back and compare that answer to what a new inference would have been had there been no cache to lean on - thus, the snake is eating its tail to get better and better.

I think the key to building really performant AI systems is to establish a framework for measuring prompt performance.

6 Likes

Thank you very much, thats indeed an insightful solution!

1 Like

One way to speed it up if you’re expecting repeated calls is caching.

Me and a friend built a managed caching service for this:

pip install rmmbr
from rmmbr import cloud_cache

n_called = 0

@cloud_cache(
    "https://rmmbr.net",
    "your-service-token",
    "some name for the cache",
    60 * 60 * 24, # TTL is one day.
    "your-encryption-key",
)
async def f(x: int):
  nonlocal n_called
  n_called += 1
  return x

await f(3)
await f(3)
# nCalled is 1 here
1 Like

Hey Bill!

The strategy using embeddings only works for simple cases as FAQ systems? If not, how can I use it more complex cases, for example an English grammar en vocabulary checker?

1 Like

I don’t know.

The science of vocabulary and validation is a bit beyond my skill set. I assume you’re asking in the context of OpenAI performance, right?

Yeah, I’m talking about performance, today it’s taking about 10 seconds to get a result using gpt-3.5-turbo

And by “result”, you mean from what exactly?

Sorry, by result I want to mean the return from the open AI API

That part I get. What I don’t know is what you were asking for. A text completion? A chat completion? A vector embedding?

Oh sure, I’m using the chat completions, sending, into the messages array, 1 system role, 2 users and 1 assistant as example.

system > user > assistant (answer example) > user

Btw, I’m using to improve English grammar, it basically receive a sentence and should rewrite this sentence showing what and why changed

Ah, okay. That sounds good. I think a ten-second response is about right for chat completions. This chart shows chat completions for a system I built that’s a basic Q&A system built on an embedding architecture. Over the long run, it’s averaging 6.7 seconds per query.

I don’t break out the embedding vs. the completion time. Still, for each 6.7s process, two API calls are made - one to the ADA model to generate an embedding vector for the query and another to perform a completion based on the data located through the embedding vector search.

Remember, a vector similarity comparison process is also performed on a relatively small data set between the first OpenAI call and the second one. The 6.7s average encompasses all these calls and processes, so it’s pretty quick. It uses davinci-003 for the completions. Maybe that’s why my performance is acceptable.

That doesn’t seem like it would cause sluggish responses, although I would suggest you share a complete sample prompt so some of the experts here can comment and understand what’s really in the API call.

Yeah, that makes sense. Have you benchmarked the process to see what in the prompt is taking the most time? Perhaps removing the two additional tasks (what changed and why) to see if either or both of those are causing it to slow down.

Also, I recommend you give PaLM 2 a try using the same prompts. I’ve noticed it in 2.5x faster than GPT 3.5.

I may have some ideas about using embeddings to speed this up. No time today though.

2 Likes

What tools are available for benchmarking?

My chat completion process right now:

question
moderation → openai
standalone question → openai
concepts → openai
context doc retrieval → weaviate
completion → openai

I suspect that most of my current time lag is with Weaviate, but right now I don’t have a way to measure that.

How are you running those requests? Whatever code is running them could at least log duration/start/end times to a log file

2 Likes

Excellent idea. Implementing now.

I take that back. Turns out Weaviate, on queries at least, is typically the fastest API call:

On a sample query:

.99 openai
1.2 openai
0.6 weaviate
2.67 openai

Not lightening fast, but not nearly as slow as I thought. I’m also doing some local database processing (for permissions and access control) so that adds time also.

Anyway, thanks for the suggestion!

1 Like

Often, the best AI solutions are built with tools and approaches that have little to do with AI. :wink:

Yep. Weaviate and all vector indices are designed to be blistering fast.

You should log all of these processes independently to rule out the possibility they are each responsible for latency.

I’m a huge fan of the accuracy of GPT models, but for giggles, I also benchmark the same processes against PaLM 2 and a Hugging Face model or two if response time is absolutely critical.

Lastly, I look for parts of the process that could benefit from a cache.

2 Likes