Gpt-3.5-turbo-1106 is very slow

It’s been 2 days that gpt-3.5-turbo-1106 is live.

It launched with super fast responses. Then it gradually slowed down to the point to be a problem, it was taking 1-2 seconds, not it takes more than 6 seconds.

This is the newest model of GPT-3 Turbo announced by OpenAI right now.

So my question is, why this keeps happening?

I have made a research and people seem to have several assumptions.

  1. Servers are overloaded.

Seriously? OpenAI does not have money to buy more servers? I don’t think so.

  1. Technical problem

Me and other engineers will be ready to help if the issue is identified. But again, it was working fast, so the problem seems to be somewhere else.

  1. Are they doing it intentionally?

This is also an assumption that needs attention, is OpenAI intentionally slowing down to force us switch to more expensive models? Or just making some tricks to make more money?

My online business and sales are dependent on OpenAI API and response speed.

That is why this is not only annoying but also harms businesses.

Please identify the problem and solve it ASAP, if there are some other issues, let us know what is happening.

Does anyone else running into this issue?

2 Likes

Hi and welcome to the Developer Forum!

There are several reasons why this is happening, the first will be your account usage Tier, if you are in Tiers 1, 2 and 3 you may be on higher latency servers, this can be rectified by either using and paying for more usage month on month, or by prepaying an amount to your credit balance and then waiting the required number of days, details can be found here OpenAI Platform

Second is excess user load, there will be periods during the day when more users than average will be online using the system and performance may drop, this will get better with time and additional compute allocation.

Third is deliberate disruption, there have been a number of abnormal data patterns detected over the past 24 hours that are indicative of a DDoS attack, there may still be isolated groups of attackers still online and causing system performance to drop.

  1. I was on Tier 2, i had spend about $50

Tier 3 requires $150 spent on API
Tier 4 requires $250

@Foxalabs I have paid $250 and moved to Tier 4 to test your assumption.

This did NOT work, the response is still slow.

  1. Excess user load.
    This one is really surprising, why still this problem exists? OpenAI makes billions of dollars now, just took $250 from me :slight_smile: why not to buy 3x more servers than needed?

  2. Deliberate disruption. it was slowing down gradually day by day, and it was not becoming faster, if this was the case it get faster after attacks are stopped.

I think I have done all the requirements by prepaying the $250, and the issue still remains, what are the next steps?

1 Like

Ok, so you have paid $250, and you will notice the second requirement of at least 30 days since first successful payment, how long ago was your first successful payment?

I’m paying for ChatGPT API since May 2023, so this is not the case as well

And has your Tier changed from 3 to 4? It can also take time to be moved to a low latency server. Also it should be noted that in the wake of Devday there is a significant increase in activity, and there have been possible DDoS attacks, all of which contribute to a reduction in performance.

The developer forum has no ability to look at accounts or access details related to your current usage, that is best done by making use of the support bot on help.openai.com in the bottom right corner and leaving your contact details and your issue description.

Report for 3 trials of gpt-3.5-turbo-0613:

- total response time.Min:006.808 Max:007.171 Avg:006.948
- latency (s).........Min:000.430 Max:000.572 Avg:000.505
- response tokens.....Min:256.000 Max:256.000 Avg:256.000
- total rate..........Min:035.699 Max:037.601 **Avg:036.866**
- stream rate.........Min:038.303 Max:040.890 Avg:039.610

Report for 3 trials of gpt-3.5-turbo-1106:

- total response time.Min:002.942 Max:003.780 Avg:003.263
- latency (s).........Min:000.374 Max:000.529 Avg:000.442
- response tokens.....Min:256.000 Max:256.000 Avg:256.000
- total rate..........Min:067.718 Max:087.006 **Avg:079.398**
- stream rate.........Min:078.428 Max:101.246 Avg:091.455

It is fast.

I’ve noticed a good number of long pauses and a few timeouts.

There is a new finish_reason: content_filter. That means you’re getting moderated and it may be that some original prompts sit there waiting on a moderator.

The actual payment made could take some time to actually be charged to the card and percolate to backend systems.

@_j of course it faster than the old and deprecated gpt-3.5-turbo-0613 model :slight_smile:

I’m saying that, when it launched, it was taking even less than 1 second, and it slowed down to 6 seconds now. So it has the capability to work extremely fast.

And still nowdays folks from OpenAI talking about server overloads.

This is very disappointing and will lead to search for alternatives such as LLama and other, because OpenAI are making tons of money and they cannot solve that issue. I’m sure that google has 100X more traffic, but why aren’t they overloaded?

OpenAI should buy more servers and solve this issue. Because it supposed to be open source but now they take money and should have responsibilities.

cc @Foxalabs

1 Like

Oh noes Llama that can make 80 tokens a second?

“I was the first one on the server and it was fast!”

“So angry that this is only twice the rate that gpt-3.5-turbo has been producing for the last months.”

Free benchmark code I just ran and bonus utilities. If you aren’t also making the same rate at a tier higher than me, wait for your payment to process.

python w openai == 1.2.2, tiktoken
import openai
import jsonschema
import time
import re
import os
import tiktoken
import httpx

openai.api_key = key

from openai import OpenAI
client = OpenAI(timeout=httpx.Timeout(15.0, read=5.0, write=10.0, connect=3.0))

class Printer:
    """
    A class for formatted text output, supporting word wrapping, indentation and line breaks.

    Attributes:
        max_len (int): Maximum line length.
        indent (int): Indentation size.
        breaks (str): Characters treated as line breaks.
        line_length (int): Current line length.

    Methods:
        print_word(word): Prints a word with the defined formatting rules.
        reset(): Starts a new line without printing anything.
    """

    def __init__(self, max_len=80, indent=0, breaks=[" ", "-"]):
        self.max_len = max_len
        self.indent = indent
        self.breaks = breaks
        self.line_length = -1

    def reset(self):
        self.line_length = 0

    def document(self, text):
        # Define a regular expression pattern to split text into words
        word_pattern = re.compile(r"[\w']+|[.,!?;]")
        # Split the text into words including ending punctuation
        words = word_pattern.findall(text)
        for chunk in words:
            self.word(chunk)
            time.sleep(0.1)

    def word(self, word):
        if ((len(word) + self.line_length > self.max_len
                and (word and word[0] in self.breaks))
                or self.line_length == -1):
            print("")  # new line
            self.line_length = 0
            word = word.lstrip()
        if self.line_length == 0:  # Indent new lines
            print(" " * self.indent, end="")
            self.line_length = self.indent
        print(word, end="")
        if word.endswith("\n"):  # Indent after AI's line feed
            print(" " * self.indent, end="")
            self.line_length = self.indent
        self.line_length += len(word)


class Tokenizer:
    """ required: import tiktoken; import re;
    usage example:
        cl100 = Tokenizer()
        number_of_tokens = cl100.count("my string")
    """
    def __init__(self, model="cl100k_base"):
        self.tokenizer = tiktoken.get_encoding(model)
        self.chat_strip_match = re.compile(r'<\|.*?\|>')
        self.intype = None

    def ucount(self, text):
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)

    def count(self, text):
        text = self.chat_strip_match.sub('', text)
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)


class BotDate:
    """ .start/.now : object creation date/time; current date/time
        .set/.get   : start/reset timer, elapsed time
        .print      : formatted date/time from epoch seconds
    """
    def __init__(self, format_spec="%Y-%m-%d %H:%M%p"):
        self.format_spec = format_spec
        self.created_time = time.time()
        self.start_time = 0
        self.stats1 = []
        self.stats2 = []

    def stats_reset(self):
        self.stats1 = []
        self.stats2 = []
        

    def start(self):
        return self.format_time(self.created_time)

    def now(self):
        return self.format_time(time.time())

    def print(self, epoch_seconds): # format input seconds
        return self.format_time(epoch_seconds)

    def format_time(self, epoch_seconds):
        formatted_time = time.strftime(self.format_spec, time.localtime(epoch_seconds))
        return formatted_time

    def set(self):
        self.start_time = time.perf_counter()  # Record the current time when set is called

    def get(self):  # elapsed time value str
        if self.start_time is None:
            return "X.XX"
        else:
            elapsed_time = time.perf_counter() - self.start_time
            return elapsed_time


bdate = BotDate()
tok = Tokenizer()
p = Printer()
latency = 0
user = """Write an article about kittens""".strip()

models = ['gpt-3.5-turbo-1106', 'gpt-3.5-turbo-0613']
trials = 3
stats = {model: {"total response time": [],
                 "latency (s)": [],
                 "response tokens": [],
                 "total rate": [],
                 "stream rate": [],
                 } for model in models}

for i in range(trials):
    for model in models:
        print(f"\n[{model}]")
        time.sleep(.2)
        bdate.set()
        # call the chat API using the openai package and model parameters
        try:
            response = client.chat.completions.create(
                messages=[
                          # {"role": "system", "content": "You are a helpful assistant"},
                          {"role": "user", "content": user}],
                model=model,
                top_p=0.0, stream=True, max_tokens=256)
        except openai.APIConnectionError as e:
            print("The server could not be reached")
            print(e.__cause__)  # an underlying Exception, likely raised within httpx.
        except openai.RateLimitError as e:
            print(f"OpenAI rate error {e.status_code}: (e.response)")
        except openai.APIStatusError as e:
            print(f"OpenAI error {e.status_code}: (e.response)")

        # capture the words emitted by the response generator
        reply = ""
        for part in response:
            if reply == "":
                latency = bdate.get()
            if not (part.choices[0].finish_reason):
                word = part.choices[0].delta.content or ""
                if reply == "" and word == "\n":
                    word = ""
                reply += word
                p.word(word)
        total = bdate.get()
        # extend model stats lists with total, latency, tokens for model
        stats[model]["total response time"].append(total)
        stats[model]["latency (s)"].append(latency)
        tokens = tok.count(reply)
        stats[model]["response tokens"].append(tokens)
        stats[model]["total rate"].append(tokens/total)
        stats[model]["stream rate"].append((tokens-1)/(1 if total-latency == 0 else total-latency))

print("\n\n")
for key in stats:
    print(f"Report for {trials} trials of {key}:")
    for sub_key in stats[key]:
        values = stats[key][sub_key]
        min_value = min(values)
        max_value = max(values)
        avg_value = sum(values) / len(values)
        print(f"- {sub_key.ljust(20, '.')}"
              f"Min:{str(f'{min_value:.3f}'.zfill(7))} "
              f"Max:{str(f'{max_value:.3f}'.zfill(7))} "
              f"Avg:{str(f'{avg_value:.3f}'.zfill(7))}")


    print()
2 Likes

Dub this. 1106 gets stuck every few dozen requests for me.

5 Likes

Hi, I am seeing the same problem. Does anyone know why? Is there a solution for it? Every dozen requests it gets stuck, a results are returned after around 10 minutes.

3 Likes

I pinpointed the problem in my case (hopefully it will help you as well).

If you are expecting a json output from gpt-3.5-turbo-1106 and passing a parameter "response_format": { "type": "json_object" }, you also need to indicate in the prompt that you want a json output (and ideally description of the fields).

Here’s the official info from https://platform.openai.com/docs/guides/text-generation/json-mode

  • When using JSON mode, always instruct the model to produce JSON via some message in the conversation, for example via your system message. If you don’t include an explicit instruction to generate JSON, the model may generate an unending stream of whitespace and the request may run continually until it reaches the token limit. To help ensure you don’t forget, the API will throw an error if the string "JSON" does not appear somewhere in the context.

In my case I had JSON mentioned in the prompt, but the instruction was to output in json on,y under certain conditions, so I was not getting the error, but was getting the model stuck forever every 10-20 requests.

1 Like

Tony, thank you for your prompt response. I am using LangChain for my application and I have tried passing the parameter you suggested as follows: model_kwargs={"response_format": {"type": "json_object"}}, but it is still getting stuck. Do you have any other ideas as to why this might be happening?

You need to provide your code, man. Guessing is not fun :slight_smile:

Hi @frunzghazaryan . Is the api still slow for you? I’m wondering if it’s worth upgrading. I am currently on tier 1 and requests get stuck every few times.

@ken0ryu yes, it is slow and the Tier plan did not make any change.

I am facing similiar issues. Recently upgraded to 1106 and was happy to see option to ascertain json responses. However, 1 in about 10-15 requests fails with Socket timeout on my end [increased the timeout to upto 3mins just to check - what doesn’t respond in 60 normally doesn’t respond at all].. The same input with the same prompt works fine when hit again and the response in such successful cases is also in a few seconds.
I already have JSON in my system message and in the response format. Have reported the issue on their chat support. Hoping it gets resolved.

In your instructions in the prompt is it clear that AI must (!) always output a json? Do you give specification of the output json in the prompt instructions?

That it works 9/10 times, and the 10th is a timeout with no response at all, not even a finish reason, means the type of desired output being JSON or not has nothing to do with your request not connecting to a responsive model.

It can be cat facts or quantum physics, you just don’t get any reply and would have to set a timeout to catch the lack of response.

1 Like

For me with the right prompt it works without timeouts. Just did a test with 100+ cycles. Not a single timeout. Results:


LM params: {'temperature': 0, 'response_format': {'type': 'json_object'}, 'timeout': 5, 'model': 'gpt-3.5-turbo-1106'}
Generated successfully: 100.00% (110/110)
Valid responses: 90.00% (99/110)
Average generation time: 2.30

Valid responses - here I’m validating JSON against a pydantic class.