GPT-3.5 Turbo API response is slow

gopi80211 · October 10, 2023, 9:23am

Why the GPT3.5 turbo API response is becoming slow day by day. Is there any way to overcome this? I am using GPT3.5 Turbo API for content generation. Previously it used to be less than a minute, now it is taking 2 to 3 min for generation

Foxalabs · October 10, 2023, 10:14am

Hi,

Model performance will vary depending upon the time of day and the day of the week, it is also dependant on the number of concurrent users and the amount of compute available, more people join, more compute gets added, sometimes more users use the service than at other times.

The basic facts are that there is a limited compute resource with a variable number of users of that resource, things will get better with time, but there will be periods where output is slower than at others, you should build this fact in to your product designs.

scottgruar · October 17, 2023, 1:14pm

I’ve got a similar problem. My calls use to take about 10 seconds to get a response, and then for the last week or so are taking 40-60 seconds. I get the points about how performance will vary, but this has been consistently slow. Im averaging 50 requests a day, so I’m within the 3500 rpm. Some functions of my site have just stopped working because of time out errors, particularly when responses go over 60 seconds.

I also understand the point that this should be built into the product design but my question is, what changed? Why was it so effecient for the last few months and then in the last week it change like this? and my most important question, how do I get it back to being quick?

gopi80211 · October 17, 2023, 1:29pm

Yes for me also, things changed from last week where the response became slow and right now sometimes throwing service unavailable error

Foxalabs · October 17, 2023, 1:30pm

Hi and welcome to the Developer Forum!

Well this is a plot of the last months worth of test calls every 5 mins for 256 tokens for both GPT-3.5 and 4, I see the odd blip as you will find on any remote API, but I do not see any trends slowing down or any sudden jumps.

jfedorowicz · October 17, 2023, 9:56pm

Hi,

I’ve seen you posting this on some of the slowness threads but I need some more context. Are these all your calls? This seems to be limited to an ever-growing subset of users.

I’m pretty upset with the whole thing. I began to prototype a feature in an app that would respond within five seconds. That is over 60 seconds now and if anything, my prompt has become more concise not less.

Open AI’s silence is a tough pill to swallow, especially as every third post the last week has been on this issue.

Foxalabs · October 17, 2023, 10:15pm

These are API calls made by a 3rd party to track response times, it’s a useful metric to have to see if there is an issue, there are millions of API users and the forums will concentrate those having issues to a narrow point making it seem like the issue is widespread, also the forum gets hundreds of posts per day and thousands of new users signing up to view posts.

I understand that it’s frustrating to have an issue that is affecting your experience, I think I have seen around 20-30 posts maybe a few more in total.

So far I’ve not seen any API calling source code, and only perhaps 2 screen shots of poor response times. What would be super helpful is details.

what is your server OS.
how much RAM does it have
how many cpu’s does it have
how much drive space is there
what is the speed of the network connection
is the machine a VM or standalone
what version of python is installed
what version of Node.JS
what version of PHP
what version of OpenAI API in installed
code snippets of API calls
code snippets of prompt setup and API setup
what are your monthly spend limits
what are your rate limits
which models are you calling
is your account pre pay or pay as you go

Without details it’s very hard to make any kind of an informed investigation as to what might be the cause.

jaffy86668 · October 17, 2023, 10:59pm

These were the same calls and rate on Enterprise and the prompts and other details have not changed since it was 10-15 seconds just a couple of weeks ago consistently.

Foxalabs · October 17, 2023, 11:09pm

Please can you answer the question from the list i posted above, thank you.

_j · October 18, 2023, 12:18am

This is simply avoidance and blame.

The gpt-3.5-turbo-instruct model is still fast for those affected - a 3.5-turbo slowdown to under 10 tokens per second.

Let’s link to just one of multiple threads, 41 posts

A typical token-per-second is like what I get — 25 to 50 tokens per second. Not 5-10.

For 2 trials of gpt-3.5-turbo @ 2023-10-17 05:09PM:

Stat	Minimum	Maximum	Average
latency (s)	Min: 0.501	Max: 0.604	Avg: 0.552
total response (s)	Min: 2.8842	Max: 2.9052	Avg: 2.895
total rate	Min: 34.421	Max: 34.672	Avg: 34.546
stream rate	Min: 41.5	Max: 43.0	Avg: 42.250
response tokens	Min: 100	Max: 100	Avg: 100.000

For 2 trials of gpt-3.5-turbo-instruct @ 2023-10-17 05:09PM:

Stat	Minimum	Maximum	Average
latency (s)	Min: 0.229	Max: 0.795	Avg: 0.512
total response (s)	Min: 1.273	Max: 1.8421	Avg: 1.558
total rate	Min: 54.286	Max: 78.555	Avg: 66.421
stream rate	Min: 94.5	Max: 94.8	Avg: 94.650
response tokens	Min: 100	Max: 100	Avg: 100.000

Try-it-Yourself Python code, compare chat to instruct, producing forum markdown

(You can increase the number of trial runs per model or include more models in the list if desired)

import openai  # requires pip install openai
import tiktoken  # requires pip install tiktoken
import time
import json
openai.api_key = "sk-2156a65Y"


class Tokenizer:
    def __init__(self, encoder="cl100k_base"):
        self.tokenizer = tiktoken.get_encoding(encoder)

    def count(self, text):
        return len(self.tokenizer.encode(text))


class BotDate:
    def __init__(self):
        self.created_time = time.time()
        self.start_time = 0

    def start(self):
        return time.strftime("%Y-%m-%d %I:%M%p", time.localtime(self.created_time))

    def now(self):
        return time.strftime("%Y-%m-%d %I:%M%p", time.localtime(time.time()))

    def set(self):
        self.start_time = time.time()

    def get(self):
        return round(time.time() - self.start_time, 4)

models = ['gpt-3.5-turbo', 'gpt-3.5-turbo-instruct']
bdate = BotDate()
tok = Tokenizer()
latency = 0
stats = {model: {"latency (s)": [],"total response (s)": [],"total rate": [],
                 "stream rate": [],"response tokens": [],} for model in models}
trials = 2
max_tokens = 100
prompt = "Write an article about kittens, 80 paragraphs"

for i in range(trials):  # number of trials per model
    for model in models:
        bdate.set()
        if model[-5:] == "instruct"[-5:]:
            response = openai.Completion.create(
                prompt=prompt,
                model=model,
                top_p=0.01, stream=True, max_tokens=max_tokens+1)
        else:
            response = openai.ChatCompletion.create(
                messages=[
                          # {"role": "system", "content": "You are a helpful assistant"},
                          {"role": "user", "content": prompt}],
                model=model,
                top_p=0.01, stream=True, max_tokens=max_tokens)

        # capture the words emitted by the response generator
        reply = ""
        for chunk in response:
            if reply == "":
                latency_s = bdate.get()
            if not chunk['choices'][0]['finish_reason']:
                if not chunk['object'] == "chat.completion.chunk":
                    reply += chunk['choices'][0]['text']
                else:
                    reply += chunk['choices'][0]['delta']['content']
                print(".", end="")
        total_s = bdate.get()
        # extend model stats lists with total, latency, tokens for model
        stats[model]["latency (s)"].append(round(latency_s,4))
        stats[model]["total response (s)"].append(round(total_s,4))
        tokens = tok.count(reply)
        stats[model]["response tokens"].append(tokens)
        stats[model]["total rate"].append(round(tokens/total_s, 3))
        stats[model]["stream rate"].append(round((tokens-1)/(1 if (total_s-latency_s) == 0 else (total_s-latency_s)), 1))

print("\n")
for key in stats:
    print(f"### For {trials} trials of {key} @ {bdate.now()}:")
    print("| Stat | Minimum | Maximum | Average |")
    print("| --- | --- | --- | --- |")
    for sub_key in stats[key]:
        values = stats[key][sub_key]
        min_value = min(values)
        max_value = max(values)
        avg_value = sum(values) / len(values)
        print(f"| {sub_key} | Min: {min_value} | Max: {max_value} | Avg: {avg_value:.3f} |")
    print()

gopi80211 · October 18, 2023, 6:08am

Hey @Foxalabs Please find the details and share your feedback or views.
P.S. This is a personal project

What is your server OS - Ubuntu 20.04
how much RAM does it have - 16 GB
how many cpu’s does it have - 4
How much drive space is there - 500GB
what is the speed of the network connection - 900 MBPS (download and upload)
is the machine a VM or standalone - Standalone
what version of python is installed - 3.8.16
what version of Node.JS - Node v18.16.0
what version of OpenAI API in installed - 4.7.0

code snippets of prompt setup and API setup
const template = `
# Context
## Original Requirements
${user_input}

  # Sections list
  [
      "Purpose: Outline what is this requirement is for and purpose of this",
      "Introduction: Introduction to the document and the requirement",
      "Glossary: If you used any short forms of words add it and add the full form of that particular words",
      "Key Stakeholders: individuals, groups, or organizations that have a vested interest or influence in the project",
      "Scope: Project scope defines the boundaries, deliverables, and objectives of a project. It outlines what the project will accomplish and what it will not",
      "Project Objectives: Project objectives are specific, measurable, and achievable goals or outcomes that a project is designed to accomplish",
      "Business Requirements: Detailed description of the needs and expectations of the organization regarding the project, include functional requirements and non-functional requirements as subheadings and add content",
      "Success Metrics: also known as key performance indicators (KPIs) or performance measures, are quantifiable criteria used to evaluate the achievement of objectives",
      "Risk Strategies: proactive approaches that organizations and project managers use to identify, assess, mitigate, and respond to risks that may impact a project's success",
      "Project Constraints: limitations or restrictions that can affect the planning, execution, and completion of the project",
      "User Stories: Provided as a list, scenario-based user stories, If the requirement itself is simple, the user stories should also be less"
  ]
  -----
  Role: You are a professional product manager; the goal is to design a concise, usable, efficient product
  Requirements: According to the context/original requirements given, generate a business requirement document with elaborative manner. If the requirements are unclear, ensure minimum viability and avoid excessive design
  ATTENTION: Use '##' to SPLIT SECTIONS, not '#'. AND '## <SECTION_NAME>'. Output carefully referenced with given 'Sections list' and make it elaborative. Be RELIABLE, and make sure to generate all the section content and don't just provide data that need to added, and please generate all the content.
  `;

  const rePromptTemplate = `
  Based on response generated by previous conversation, Add more details to previously generated response according to following context and regenerate business requirement document completely from beginning to end.
  # Context
  ${user_input}
  `;

console.log(‘templateeeeee’, template);

if (Array.isArray(this.messages) && this.messages.length >= 2) {
  this.addMessage("user", rePromptTemplate);
} else {
  this.addMessage("user", template);
}
const chat = await openai.chat.completions.create({
  messages: this.messages,
  model: "gpt-3.5-turbo-16K",
  temperature: this.temp
});

which models are you calling - GPT-3.5 Turbo-16K
is your account pre pay or pay as you go - prepaid

scottgruar · October 18, 2023, 9:23am

Thank you for taking the time to reply. I’m not going to list all of the technical specs of the server, but essentially it’s a low end machine. I’m receiving 50 calls a day roughly. The thing that is confusing me the most are the variables. I could run my site using api calls on any machine, on any network and have a similar response time of roughly 10 seconds . I’m still receiving similar response times on various machines, on various networks but the time has increased to (at times) over a minute.

At this stage I’m going for the high tech ‘turn it off and on again’ which for me means trying a new API key, or maybe a new API account. I was hoping to find a fix for this problem on these forums, since it seems to be fairly consistent. If this works, I’ll let you know. If it doesn’t, I’ll post your requested information.

Foxalabs · October 18, 2023, 10:34am

If you could let us know the results of your tests that would be awesome, I don’t doubt for one one moment that what you are experiencing is real, it’s just hard to pin it down without information regarding the environments and accounts to look for commonalities and then isolate them.

Foxalabs · October 18, 2023, 10:36am

Much apricated, if you change the model to the non 16K does the issue stop?

gopi80211 · October 18, 2023, 11:25am

Yes @Foxalabs its working but the result is not generating completely, the data generated is only 50% and stopping in between the process, need to try “Continue” for complete generation in 4K model.

That’s why shifted from 4K to 16K model before this where 16K generated data at a time without breaking in between, initially everything goes well but last one week the response has increased and sometimes throws timeout error or service unavailable error in 16K model. What is you views or input on this?

Foxalabs · October 18, 2023, 12:20pm

Just gathering information at this stage, trying to look for common elements.

alenadradfge42 · October 18, 2023, 1:32pm

I used to get response for 5 prompt in around 30 seconds but last 2 days it takes around 2 minutes and sometimes it just throws 503 error at the middle of the chat completion. I tried 2 API keys and still same.

Anphex · November 6, 2023, 6:34am

I am signing this too. GPT-3.5-turbo and GPT-3.5-turbo-16k is taking 40-60 seconds to send a response. This wasn’t the case a few weeks ago. GPT-4 is BLAZING fast, so the point about heavy use on OpenAI servers is quiet not valid.

karlen · November 6, 2023, 3:17pm

I am a new user of GPT APIs. Yesterday I tried using the gpt-3.5-turbo-0613 model and all responses took minimum 50 seconds to respond, except the cases when I sent the same prompts twice consequently. I don’t know what was the performance of this model weeks ago, but this is definitely not the result I expected, especially when the ChatGPT is much faster for the same prompts.

As mentioned by others, it’s really hard to come up with a product offering based on such poor performance, no matter how good you’ll build that fact into the product design. Lots of product ideas are about making users’ lives easier by taking the complexity of experimentation and creation of the right prompts away from them, and additionally saving tons of time for them. But because of such performance they will eventually switch or continue using ChatGPT. Unless that’s your goal, it doesn’t help the people, like in this thread, to build a product using GPT APIs.

Hopefully the issues will be addressed soon.

dominolex14 · November 10, 2023, 10:04pm

That’s true, a month ago response was 3 seconds but now for same prompt 15-35, what is going on? It’s not suite for commercial usage anymore

Topic		Replies	Views
GPT-3.5 API is 30x slower than ChatGPT equivalent prompt API gpt-35-turbo , api	69	14059	November 30, 2023
Gpt-3.5-turbo-1106 is very slow API chatgpt	46	8014	December 19, 2023
OpenAI Why Are The API Calls So Slow? When will it be fixed? API	103	55953	February 19, 2024
GPT-3.5 API is very slow. Any fix? API	31	9961	October 12, 2023
We proved the API is intentionally slow API	56	18475	May 2, 2023

GPT-3.5 Turbo API response is slow

For 2 trials of gpt-3.5-turbo @ 2023-10-17 05:09PM:

For 2 trials of gpt-3.5-turbo-instruct @ 2023-10-17 05:09PM:

Related topics