GPT-3.5 Turbo API response is slow

Why the GPT3.5 turbo API response is becoming slow day by day. Is there any way to overcome this? I am using GPT3.5 Turbo API for content generation. Previously it used to be less than a minute, now it is taking 2 to 3 min for generation

1 Like

Hi,

Model performance will vary depending upon the time of day and the day of the week, it is also dependant on the number of concurrent users and the amount of compute available, more people join, more compute gets added, sometimes more users use the service than at other times.

The basic facts are that there is a limited compute resource with a variable number of users of that resource, things will get better with time, but there will be periods where output is slower than at others, you should build this fact in to your product designs.

2 Likes

Iā€™ve got a similar problem. My calls use to take about 10 seconds to get a response, and then for the last week or so are taking 40-60 seconds. I get the points about how performance will vary, but this has been consistently slow. Im averaging 50 requests a day, so Iā€™m within the 3500 rpm. Some functions of my site have just stopped working because of time out errors, particularly when responses go over 60 seconds.

I also understand the point that this should be built into the product design but my question is, what changed? Why was it so effecient for the last few months and then in the last week it change like this? and my most important question, how do I get it back to being quick?

1 Like

Yes for me also, things changed from last week where the response became slow and right now sometimes throwing service unavailable error

1 Like

Hi and welcome to the Developer Forum!

Well this is a plot of the last months worth of test calls every 5 mins for 256 tokens for both GPT-3.5 and 4, I see the odd blip as you will find on any remote API, but I do not see any trends slowing down or any sudden jumps.

1 Like

Hi,

Iā€™ve seen you posting this on some of the slowness threads but I need some more context. Are these all your calls? This seems to be limited to an ever-growing subset of users.

Iā€™m pretty upset with the whole thing. I began to prototype a feature in an app that would respond within five seconds. That is over 60 seconds now and if anything, my prompt has become more concise not less.

Open AIā€™s silence is a tough pill to swallow, especially as every third post the last week has been on this issue.

1 Like

These are API calls made by a 3rd party to track response times, itā€™s a useful metric to have to see if there is an issue, there are millions of API users and the forums will concentrate those having issues to a narrow point making it seem like the issue is widespread, also the forum gets hundreds of posts per day and thousands of new users signing up to view posts.

I understand that itā€™s frustrating to have an issue that is affecting your experience, I think I have seen around 20-30 posts maybe a few more in total.

So far Iā€™ve not seen any API calling source code, and only perhaps 2 screen shots of poor response times. What would be super helpful is details.

  • what is your server OS.
  • how much RAM does it have
  • how many cpuā€™s does it have
  • how much drive space is there
  • what is the speed of the network connection
  • is the machine a VM or standalone
  • what version of python is installed
  • what version of Node.JS
  • what version of PHP
  • what version of OpenAI API in installed
  • code snippets of API calls
  • code snippets of prompt setup and API setup
  • what are your monthly spend limits
  • what are your rate limits
  • which models are you calling
  • is your account pre pay or pay as you go

Without details itā€™s very hard to make any kind of an informed investigation as to what might be the cause.

These were the same calls and rate on Enterprise and the prompts and other details have not changed since it was 10-15 seconds just a couple of weeks ago consistently.

Please can you answer the question from the list i posted above, thank you.

This is simply avoidance and blame.

The gpt-3.5-turbo-instruct model is still fast for those affected - a 3.5-turbo slowdown to under 10 tokens per second.

Letā€™s link to just one of multiple threads, 41 posts

A typical token-per-second is like what I get ā€” 25 to 50 tokens per second. Not 5-10.

For 2 trials of gpt-3.5-turbo @ 2023-10-17 05:09PM:

Stat Minimum Maximum Average
latency (s) Min: 0.501 Max: 0.604 Avg: 0.552
total response (s) Min: 2.8842 Max: 2.9052 Avg: 2.895
total rate Min: 34.421 Max: 34.672 Avg: 34.546
stream rate Min: 41.5 Max: 43.0 Avg: 42.250
response tokens Min: 100 Max: 100 Avg: 100.000

For 2 trials of gpt-3.5-turbo-instruct @ 2023-10-17 05:09PM:

Stat Minimum Maximum Average
latency (s) Min: 0.229 Max: 0.795 Avg: 0.512
total response (s) Min: 1.273 Max: 1.8421 Avg: 1.558
total rate Min: 54.286 Max: 78.555 Avg: 66.421
stream rate Min: 94.5 Max: 94.8 Avg: 94.650
response tokens Min: 100 Max: 100 Avg: 100.000
Try-it-Yourself Python code, compare chat to instruct, producing forum markdown

(You can increase the number of trial runs per model or include more models in the list if desired)

import openai  # requires pip install openai
import tiktoken  # requires pip install tiktoken
import time
import json
openai.api_key = "sk-2156a65Y"


class Tokenizer:
    def __init__(self, encoder="cl100k_base"):
        self.tokenizer = tiktoken.get_encoding(encoder)

    def count(self, text):
        return len(self.tokenizer.encode(text))


class BotDate:
    def __init__(self):
        self.created_time = time.time()
        self.start_time = 0

    def start(self):
        return time.strftime("%Y-%m-%d %I:%M%p", time.localtime(self.created_time))

    def now(self):
        return time.strftime("%Y-%m-%d %I:%M%p", time.localtime(time.time()))

    def set(self):
        self.start_time = time.time()

    def get(self):
        return round(time.time() - self.start_time, 4)

models = ['gpt-3.5-turbo', 'gpt-3.5-turbo-instruct']
bdate = BotDate()
tok = Tokenizer()
latency = 0
stats = {model: {"latency (s)": [],"total response (s)": [],"total rate": [],
                 "stream rate": [],"response tokens": [],} for model in models}
trials = 2
max_tokens = 100
prompt = "Write an article about kittens, 80 paragraphs"

for i in range(trials):  # number of trials per model
    for model in models:
        bdate.set()
        if model[-5:] == "instruct"[-5:]:
            response = openai.Completion.create(
                prompt=prompt,
                model=model,
                top_p=0.01, stream=True, max_tokens=max_tokens+1)
        else:
            response = openai.ChatCompletion.create(
                messages=[
                          # {"role": "system", "content": "You are a helpful assistant"},
                          {"role": "user", "content": prompt}],
                model=model,
                top_p=0.01, stream=True, max_tokens=max_tokens)

        # capture the words emitted by the response generator
        reply = ""
        for chunk in response:
            if reply == "":
                latency_s = bdate.get()
            if not chunk['choices'][0]['finish_reason']:
                if not chunk['object'] == "chat.completion.chunk":
                    reply += chunk['choices'][0]['text']
                else:
                    reply += chunk['choices'][0]['delta']['content']
                print(".", end="")
        total_s = bdate.get()
        # extend model stats lists with total, latency, tokens for model
        stats[model]["latency (s)"].append(round(latency_s,4))
        stats[model]["total response (s)"].append(round(total_s,4))
        tokens = tok.count(reply)
        stats[model]["response tokens"].append(tokens)
        stats[model]["total rate"].append(round(tokens/total_s, 3))
        stats[model]["stream rate"].append(round((tokens-1)/(1 if (total_s-latency_s) == 0 else (total_s-latency_s)), 1))

print("\n")
for key in stats:
    print(f"### For {trials} trials of {key} @ {bdate.now()}:")
    print("| Stat | Minimum | Maximum | Average |")
    print("| --- | --- | --- | --- |")
    for sub_key in stats[key]:
        values = stats[key][sub_key]
        min_value = min(values)
        max_value = max(values)
        avg_value = sum(values) / len(values)
        print(f"| {sub_key} | Min: {min_value} | Max: {max_value} | Avg: {avg_value:.3f} |")
    print()

Hey @Foxalabs Please find the details and share your feedback or views.
P.S. This is a personal project

  • What is your server OS - Ubuntu 20.04

  • how much RAM does it have - 16 GB

  • how many cpuā€™s does it have - 4

  • How much drive space is there - 500GB

  • what is the speed of the network connection - 900 MBPS (download and upload)

  • is the machine a VM or standalone - Standalone

  • what version of python is installed - 3.8.16

  • what version of Node.JS - Node v18.16.0

  • what version of OpenAI API in installed - 4.7.0

  • code snippets of prompt setup and API setup
    const template = `
    # Context
    ## Original Requirements
    ${user_input}

      # Sections list
      [
          "Purpose: Outline what is this requirement is for and purpose of this",
          "Introduction: Introduction to the document and the requirement",
          "Glossary: If you used any short forms of words add it and add the full form of that particular words",
          "Key Stakeholders: individuals, groups, or organizations that have a vested interest or influence in the project",
          "Scope: Project scope defines the boundaries, deliverables, and objectives of a project. It outlines what the project will accomplish and what it will not",
          "Project Objectives: Project objectives are specific, measurable, and achievable goals or outcomes that a project is designed to accomplish",
          "Business Requirements: Detailed description of the needs and expectations of the organization regarding the project, include functional requirements and non-functional requirements as subheadings and add content",
          "Success Metrics: also known as key performance indicators (KPIs) or performance measures, are quantifiable criteria used to evaluate the achievement of objectives",
          "Risk Strategies: proactive approaches that organizations and project managers use to identify, assess, mitigate, and respond to risks that may impact a project's success",
          "Project Constraints: limitations or restrictions that can affect the planning, execution, and completion of the project",
          "User Stories: Provided as a list, scenario-based user stories, If the requirement itself is simple, the user stories should also be less"
      ]
      -----
      Role: You are a professional product manager; the goal is to design a concise, usable, efficient product
      Requirements: According to the context/original requirements given, generate a business requirement document with elaborative manner. If the requirements are unclear, ensure minimum viability and avoid excessive design
      ATTENTION: Use '##' to SPLIT SECTIONS, not '#'. AND '## <SECTION_NAME>'. Output carefully referenced with given 'Sections list' and make it elaborative. Be RELIABLE, and make sure to generate all the section content and don't just provide data that need to added, and please generate all the content.
      `;
    
      const rePromptTemplate = `
      Based on response generated by previous conversation, Add more details to previously generated response according to following context and regenerate business requirement document completely from beginning to end.
      # Context
      ${user_input}
      `;
    

console.log(ā€˜templateeeeeeā€™, template);

if (Array.isArray(this.messages) && this.messages.length >= 2) {
  this.addMessage("user", rePromptTemplate);
} else {
  this.addMessage("user", template);
}
const chat = await openai.chat.completions.create({
  messages: this.messages,
  model: "gpt-3.5-turbo-16K",
  temperature: this.temp
});
  • which models are you calling - GPT-3.5 Turbo-16K

  • is your account pre pay or pay as you go - prepaid

1 Like

Thank you for taking the time to reply. Iā€™m not going to list all of the technical specs of the server, but essentially itā€™s a low end machine. Iā€™m receiving 50 calls a day roughly. The thing that is confusing me the most are the variables. I could run my site using api calls on any machine, on any network and have a similar response time of roughly 10 seconds . Iā€™m still receiving similar response times on various machines, on various networks but the time has increased to (at times) over a minute.

At this stage Iā€™m going for the high tech ā€˜turn it off and on againā€™ which for me means trying a new API key, or maybe a new API account. I was hoping to find a fix for this problem on these forums, since it seems to be fairly consistent. If this works, Iā€™ll let you know. If it doesnā€™t, Iā€™ll post your requested information.

If you could let us know the results of your tests that would be awesome, I donā€™t doubt for one one moment that what you are experiencing is real, itā€™s just hard to pin it down without information regarding the environments and accounts to look for commonalities and then isolate them.

Much apricated, if you change the model to the non 16K does the issue stop?

Yes @Foxalabs its working but the result is not generating completely, the data generated is only 50% and stopping in between the process, need to try ā€œContinueā€ for complete generation in 4K model.

Thatā€™s why shifted from 4K to 16K model before this where 16K generated data at a time without breaking in between, initially everything goes well but last one week the response has increased and sometimes throws timeout error or service unavailable error in 16K model. What is you views or input on this?

Just gathering information at this stage, trying to look for common elements.

I used to get response for 5 prompt in around 30 seconds but last 2 days it takes around 2 minutes and sometimes it just throws 503 error at the middle of the chat completion. I tried 2 API keys and still same.

1 Like

I am signing this too. GPT-3.5-turbo and GPT-3.5-turbo-16k is taking 40-60 seconds to send a response. This wasnā€™t the case a few weeks ago. GPT-4 is BLAZING fast, so the point about heavy use on OpenAI servers is quiet not valid.

2 Likes

I am a new user of GPT APIs. Yesterday I tried using the gpt-3.5-turbo-0613 model and all responses took minimum 50 seconds to respond, except the cases when I sent the same prompts twice consequently. I donā€™t know what was the performance of this model weeks ago, but this is definitely not the result I expected, especially when the ChatGPT is much faster for the same prompts.

As mentioned by others, itā€™s really hard to come up with a product offering based on such poor performance, no matter how good youā€™ll build that fact into the product design. Lots of product ideas are about making usersā€™ lives easier by taking the complexity of experimentation and creation of the right prompts away from them, and additionally saving tons of time for them. But because of such performance they will eventually switch or continue using ChatGPT. Unless thatā€™s your goal, it doesnā€™t help the people, like in this thread, to build a product using GPT APIs.

Hopefully the issues will be addressed soon.

2 Likes

Thatā€™s true, a month ago response was 3 seconds but now for same prompt 15-35, what is going on? Itā€™s not suite for commercial usage anymore