How can I improve response times from the OpenAI API while generating responses based on our knowledge base?

Hello, I have a Node.js project where I utilize Microsoft’s Cognitive Search for indexed searching. This allows me to perform structured searches on my own knowledge base. Once I receive the query response, I make a call to the OpenAI API to generate the natural language response.

I chose this approach because I was inspired by a recent example published by Microsoft that showcased the powerful implementation achieved by combining both tools. Moreover, it simplifies the process of updating information automatically and in real-time, making my knowledge base scalable. I’m even able to extract text from images, which is truly amazing.

However, I have noticed that the response from OpenAI is somewhat slow. I’ve come across suggestions that this might be due to the lengthy input in terms of the number of tokens, resulting in longer processing times.

I would greatly appreciate any recommendations you can provide to help improve the response times. I understand that server performance and memory also play a role, but I specifically seek techniques to enhance response times.

I have considered utilizing conversation history, employing natural language libraries to identify frequently asked questions, and reducing input length. However, these ideas are currently scattered, and I would highly value your guidance in determining the best course of action.

An example of responses, this was the answer from the api (usually)

usage: {prompt_tokens: 526, complete_tokens: 175, total_tokens: 701}

The response time was 17001 ms, the gpt-3.5-turbo model generated. My knowledge base usually has information of up to 300/500 tokens per response

const system = "You are an enthusiastic representative of (NAMEOFAPP), dedicated to helping people. You have extensive knowledge of (NAMEOFAPP) and its systems, including (NAMEOFAPP) and (NAMEOFAPP). You are asked to answer questions using only the information provided in the (NAMEOFAPP) and (NAMEOFAPP) documentation. Please avoid copying the text verbatim and try to be brief in your answers. If necessary, you can structure the text in steps and attach URLs to provide a more visual understanding how to use the applications. For example: Step 1. Enter the link If you are not sure of the answer or there is not enough information, indicate that you do not know and answer: "Unfortunately, that question is not related to (NAMEOFAPP)." It then provides general information about (NAMEOFAPP) and offers to help with related topics."

const prompt = `Please answer this query: ${query}\n\n`
    + `Use only the following information:\n\n${responseFromCognitiveSearch.value[0].formattedText}`;

// Structure of JSON Curl

        "model": "gpt-3.5-turbo",
        "messages": [
                "role": "system",
                "content": system
                "role": "user",
                "content": prompt

As a learner, I am seeking guidance and assistance regarding improving response times from the OpenAI API while generating responses based on our knowledge base. I would greatly appreciate any help and advice that experienced developers or community members can provide. Thank you in advance for your support.

1 Like

Unfortunately, the API responses are often quite slow. Unless you can find ways to shorten your expected responses, I’m not sure there’s much to do in order to reliably improve response times. In my experience, requesting outputs as short as possible has been the most effective response, but that’s harder to do for an open-ended task like this (my use case basically involved extracting a few variables from a natural language prompt, so I was able to request very short responses).

You can also request streaming responses so the user will see the text appear word by word, which some may find a better experience than waiting for the entire response:

1 Like

There are things you can do! Summarizing a list I just wrote up:

  1. Reduce output token count
  2. Switch from GPT-4 to GPT-3.5
  3. Switch to Azure-hosted APIs
  4. Parallelize your calls
  5. Stream output and use stop sequences

A bit more detail on each in the original post: Making GPT API responses faster.

1 Like

The OpenAI’s Chat Completion API takes on average 20s to get a reply for me. Poe, on the other hand, averages under 2s for the same prompt, reply, and model (3.5 Turbo). How are they able to get a reply so fast?