It launched with super fast responses. Then it gradually slowed down to the point to be a problem, it was taking 1-2 seconds, not it takes more than 6 seconds.
This is the newest model of GPT-3 Turbo announced by OpenAI right now.
So my question is, why this keeps happening?
I have made a research and people seem to have several assumptions.
Servers are overloaded.
Seriously? OpenAI does not have money to buy more servers? I don’t think so.
Technical problem
Me and other engineers will be ready to help if the issue is identified. But again, it was working fast, so the problem seems to be somewhere else.
Are they doing it intentionally?
This is also an assumption that needs attention, is OpenAI intentionally slowing down to force us switch to more expensive models? Or just making some tricks to make more money?
My online business and sales are dependent on OpenAI API and response speed.
That is why this is not only annoying but also harms businesses.
Please identify the problem and solve it ASAP, if there are some other issues, let us know what is happening.
There are several reasons why this is happening, the first will be your account usage Tier, if you are in Tiers 1, 2 and 3 you may be on higher latency servers, this can be rectified by either using and paying for more usage month on month, or by prepaying an amount to your credit balance and then waiting the required number of days, details can be found here OpenAI Platform
Second is excess user load, there will be periods during the day when more users than average will be online using the system and performance may drop, this will get better with time and additional compute allocation.
Third is deliberate disruption, there have been a number of abnormal data patterns detected over the past 24 hours that are indicative of a DDoS attack, there may still be isolated groups of attackers still online and causing system performance to drop.
Excess user load.
This one is really surprising, why still this problem exists? OpenAI makes billions of dollars now, just took $250 from me why not to buy 3x more servers than needed?
Deliberate disruption. it was slowing down gradually day by day, and it was not becoming faster, if this was the case it get faster after attacks are stopped.
I think I have done all the requirements by prepaying the $250, and the issue still remains, what are the next steps?
Ok, so you have paid $250, and you will notice the second requirement of at least 30 days since first successful payment, how long ago was your first successful payment?
And has your Tier changed from 3 to 4? It can also take time to be moved to a low latency server. Also it should be noted that in the wake of Devday there is a significant increase in activity, and there have been possible DDoS attacks, all of which contribute to a reduction in performance.
The developer forum has no ability to look at accounts or access details related to your current usage, that is best done by making use of the support bot on help.openai.com in the bottom right corner and leaving your contact details and your issue description.
I’ve noticed a good number of long pauses and a few timeouts.
There is a new finish_reason: content_filter. That means you’re getting moderated and it may be that some original prompts sit there waiting on a moderator.
The actual payment made could take some time to actually be charged to the card and percolate to backend systems.
@_j of course it faster than the old and deprecated gpt-3.5-turbo-0613 model
I’m saying that, when it launched, it was taking even less than 1 second, and it slowed down to 6 seconds now. So it has the capability to work extremely fast.
And still nowdays folks from OpenAI talking about server overloads.
This is very disappointing and will lead to search for alternatives such as LLama and other, because OpenAI are making tons of money and they cannot solve that issue. I’m sure that google has 100X more traffic, but why aren’t they overloaded?
OpenAI should buy more servers and solve this issue. Because it supposed to be open source but now they take money and should have responsibilities.
“I was the first one on the server and it was fast!”
“So angry that this is only twice the rate that gpt-3.5-turbo has been producing for the last months.”
Free benchmark code I just ran and bonus utilities. If you aren’t also making the same rate at a tier higher than me, wait for your payment to process.
python w openai == 1.2.2, tiktoken
import openai
import jsonschema
import time
import re
import os
import tiktoken
import httpx
openai.api_key = key
from openai import OpenAI
client = OpenAI(timeout=httpx.Timeout(15.0, read=5.0, write=10.0, connect=3.0))
class Printer:
"""
A class for formatted text output, supporting word wrapping, indentation and line breaks.
Attributes:
max_len (int): Maximum line length.
indent (int): Indentation size.
breaks (str): Characters treated as line breaks.
line_length (int): Current line length.
Methods:
print_word(word): Prints a word with the defined formatting rules.
reset(): Starts a new line without printing anything.
"""
def __init__(self, max_len=80, indent=0, breaks=[" ", "-"]):
self.max_len = max_len
self.indent = indent
self.breaks = breaks
self.line_length = -1
def reset(self):
self.line_length = 0
def document(self, text):
# Define a regular expression pattern to split text into words
word_pattern = re.compile(r"[\w']+|[.,!?;]")
# Split the text into words including ending punctuation
words = word_pattern.findall(text)
for chunk in words:
self.word(chunk)
time.sleep(0.1)
def word(self, word):
if ((len(word) + self.line_length > self.max_len
and (word and word[0] in self.breaks))
or self.line_length == -1):
print("") # new line
self.line_length = 0
word = word.lstrip()
if self.line_length == 0: # Indent new lines
print(" " * self.indent, end="")
self.line_length = self.indent
print(word, end="")
if word.endswith("\n"): # Indent after AI's line feed
print(" " * self.indent, end="")
self.line_length = self.indent
self.line_length += len(word)
class Tokenizer:
""" required: import tiktoken; import re;
usage example:
cl100 = Tokenizer()
number_of_tokens = cl100.count("my string")
"""
def __init__(self, model="cl100k_base"):
self.tokenizer = tiktoken.get_encoding(model)
self.chat_strip_match = re.compile(r'<\|.*?\|>')
self.intype = None
def ucount(self, text):
encoded_text = self.tokenizer.encode(text)
return len(encoded_text)
def count(self, text):
text = self.chat_strip_match.sub('', text)
encoded_text = self.tokenizer.encode(text)
return len(encoded_text)
class BotDate:
""" .start/.now : object creation date/time; current date/time
.set/.get : start/reset timer, elapsed time
.print : formatted date/time from epoch seconds
"""
def __init__(self, format_spec="%Y-%m-%d %H:%M%p"):
self.format_spec = format_spec
self.created_time = time.time()
self.start_time = 0
self.stats1 = []
self.stats2 = []
def stats_reset(self):
self.stats1 = []
self.stats2 = []
def start(self):
return self.format_time(self.created_time)
def now(self):
return self.format_time(time.time())
def print(self, epoch_seconds): # format input seconds
return self.format_time(epoch_seconds)
def format_time(self, epoch_seconds):
formatted_time = time.strftime(self.format_spec, time.localtime(epoch_seconds))
return formatted_time
def set(self):
self.start_time = time.perf_counter() # Record the current time when set is called
def get(self): # elapsed time value str
if self.start_time is None:
return "X.XX"
else:
elapsed_time = time.perf_counter() - self.start_time
return elapsed_time
bdate = BotDate()
tok = Tokenizer()
p = Printer()
latency = 0
user = """Write an article about kittens""".strip()
models = ['gpt-3.5-turbo-1106', 'gpt-3.5-turbo-0613']
trials = 3
stats = {model: {"total response time": [],
"latency (s)": [],
"response tokens": [],
"total rate": [],
"stream rate": [],
} for model in models}
for i in range(trials):
for model in models:
print(f"\n[{model}]")
time.sleep(.2)
bdate.set()
# call the chat API using the openai package and model parameters
try:
response = client.chat.completions.create(
messages=[
# {"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": user}],
model=model,
top_p=0.0, stream=True, max_tokens=256)
except openai.APIConnectionError as e:
print("The server could not be reached")
print(e.__cause__) # an underlying Exception, likely raised within httpx.
except openai.RateLimitError as e:
print(f"OpenAI rate error {e.status_code}: (e.response)")
except openai.APIStatusError as e:
print(f"OpenAI error {e.status_code}: (e.response)")
# capture the words emitted by the response generator
reply = ""
for part in response:
if reply == "":
latency = bdate.get()
if not (part.choices[0].finish_reason):
word = part.choices[0].delta.content or ""
if reply == "" and word == "\n":
word = ""
reply += word
p.word(word)
total = bdate.get()
# extend model stats lists with total, latency, tokens for model
stats[model]["total response time"].append(total)
stats[model]["latency (s)"].append(latency)
tokens = tok.count(reply)
stats[model]["response tokens"].append(tokens)
stats[model]["total rate"].append(tokens/total)
stats[model]["stream rate"].append((tokens-1)/(1 if total-latency == 0 else total-latency))
print("\n\n")
for key in stats:
print(f"Report for {trials} trials of {key}:")
for sub_key in stats[key]:
values = stats[key][sub_key]
min_value = min(values)
max_value = max(values)
avg_value = sum(values) / len(values)
print(f"- {sub_key.ljust(20, '.')}"
f"Min:{str(f'{min_value:.3f}'.zfill(7))} "
f"Max:{str(f'{max_value:.3f}'.zfill(7))} "
f"Avg:{str(f'{avg_value:.3f}'.zfill(7))}")
print()
Hi, I am seeing the same problem. Does anyone know why? Is there a solution for it? Every dozen requests it gets stuck, a results are returned after around 10 minutes.
I pinpointed the problem in my case (hopefully it will help you as well).
If you are expecting a json output from gpt-3.5-turbo-1106 and passing a parameter "response_format": { "type": "json_object" }, you also need to indicate in the prompt that you want a json output (and ideally description of the fields).
When using JSON mode, always instruct the model to produce JSON via some message in the conversation, for example via your system message. If you don’t include an explicit instruction to generate JSON, the model may generate an unending stream of whitespace and the request may run continually until it reaches the token limit. To help ensure you don’t forget, the API will throw an error if the string "JSON" does not appear somewhere in the context.
In my case I had JSON mentioned in the prompt, but the instruction was to output in json on,y under certain conditions, so I was not getting the error, but was getting the model stuck forever every 10-20 requests.
Tony, thank you for your prompt response. I am using LangChain for my application and I have tried passing the parameter you suggested as follows: model_kwargs={"response_format": {"type": "json_object"}}, but it is still getting stuck. Do you have any other ideas as to why this might be happening?
Hi @frunzghazaryan . Is the api still slow for you? I’m wondering if it’s worth upgrading. I am currently on tier 1 and requests get stuck every few times.
I am facing similiar issues. Recently upgraded to 1106 and was happy to see option to ascertain json responses. However, 1 in about 10-15 requests fails with Socket timeout on my end [increased the timeout to upto 3mins just to check - what doesn’t respond in 60 normally doesn’t respond at all].. The same input with the same prompt works fine when hit again and the response in such successful cases is also in a few seconds.
I already have JSON in my system message and in the response format. Have reported the issue on their chat support. Hoping it gets resolved.
In your instructions in the prompt is it clear that AI must (!) always output a json? Do you give specification of the output json in the prompt instructions?
That it works 9/10 times, and the 10th is a timeout with no response at all, not even a finish reason, means the type of desired output being JSON or not has nothing to do with your request not connecting to a responsive model.
It can be cat facts or quantum physics, you just don’t get any reply and would have to set a timeout to catch the lack of response.