Struggling with the speed of the API requests

AB29 · November 20, 2023, 9:56pm

Hi everyone,

My requests to the “gpt-3.5-turbo” model via the API are very slow, ~20-40 secs to execute 20 concurrent requests, sometimes up to ~600 secs and I wanted to know if there is anything I can do to improve this

I perform the requests via a Python script running locally on my machine, using the concurrent.futures / ThreadPoolExecutor to execute 20 requests in parallel at a time. Requests are ~280/50 token long I/O and I’m still on tier 1 as a user.

Thank you!

Foxalabs · November 20, 2023, 11:26pm

Hi and welcome to the Developer Forum!

OpenAI API services are ran on a Tiered system with the top tiers on lower latency servers. Tiers are based on API spend and can be increased by purchasing credits and waiting the required period of time, details here:

https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-free

_j · November 20, 2023, 11:45pm

Usage limits is a new notation there to be aware of also. These pages should have a version history. Seemingly nonsensical except were you to try to ramp up to huge levels within weeks - for example when making a new second account.

oieieio · November 20, 2023, 11:59pm

Hi AB29, here are some results for tier 2 latency, this code was shared by a forum member. Runs in a colab notebook for me but you can tweak it for a local machine if you want to experiment.

import openai
import jsonschema
import time
import re
import os
import tiktoken
import httpx

from openai import OpenAI

client = OpenAI()

#openai.api_key = key

from openai import OpenAI
client = OpenAI(timeout=httpx.Timeout(15.0, read=5.0, write=10.0, connect=3.0))

class Printer:
    """
    A class for formatted text output, supporting word wrapping, indentation and line breaks.

    Attributes:
        max_len (int): Maximum line length.
        indent (int): Indentation size.
        breaks (str): Characters treated as line breaks.
        line_length (int): Current line length.

    Methods:
        print_word(word): Prints a word with the defined formatting rules.
        reset(): Starts a new line without printing anything.
    """

    def __init__(self, max_len=80, indent=0, breaks=[" ", "-"]):
        self.max_len = max_len
        self.indent = indent
        self.breaks = breaks
        self.line_length = -1

    def reset(self):
        self.line_length = 0

    def document(self, text):
        # Define a regular expression pattern to split text into words
        word_pattern = re.compile(r"[\w']+|[.,!?;]")
        # Split the text into words including ending punctuation
        words = word_pattern.findall(text)
        for chunk in words:
            self.word(chunk)
            time.sleep(0.1)

    def word(self, word):
        if ((len(word) + self.line_length > self.max_len
                and (word and word[0] in self.breaks))
                or self.line_length == -1):
            print("")  # new line
            self.line_length = 0
            word = word.lstrip()
        if self.line_length == 0:  # Indent new lines
            print(" " * self.indent, end="")
            self.line_length = self.indent
        print(word, end="")
        if word.endswith("\n"):  # Indent after AI's line feed
            print(" " * self.indent, end="")
            self.line_length = self.indent
        self.line_length += len(word)


class Tokenizer:
    """ required: import tiktoken; import re;
    usage example:
        cl100 = Tokenizer()
        number_of_tokens = cl100.count("my string")
    """
    def __init__(self, model="cl100k_base"):
        self.tokenizer = tiktoken.get_encoding(model)
        self.chat_strip_match = re.compile(r'<\|.*?\|>')
        self.intype = None

    def ucount(self, text):
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)

    def count(self, text):
        text = self.chat_strip_match.sub('', text)
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)


class BotDate:
    """ .start/.now : object creation date/time; current date/time
        .set/.get   : start/reset timer, elapsed time
        .print      : formatted date/time from epoch seconds
    """
    def __init__(self, format_spec="%Y-%m-%d %H:%M%p"):
        self.format_spec = format_spec
        self.created_time = time.time()
        self.start_time = 0
        self.stats1 = []
        self.stats2 = []

    def stats_reset(self):
        self.stats1 = []
        self.stats2 = []


    def start(self):
        return self.format_time(self.created_time)

    def now(self):
        return self.format_time(time.time())

    def print(self, epoch_seconds): # format input seconds
        return self.format_time(epoch_seconds)

    def format_time(self, epoch_seconds):
        formatted_time = time.strftime(self.format_spec, time.localtime(epoch_seconds))
        return formatted_time

    def set(self):
        self.start_time = time.perf_counter()  # Record the current time when set is called

    def get(self):  # elapsed time value str
        if self.start_time is None:
            return "X.XX"
        else:
            elapsed_time = time.perf_counter() - self.start_time
            return elapsed_time


bdate = BotDate()
tok = Tokenizer()
p = Printer()
latency = 0
user = """Write an article about kittens""".strip()

models = ['gpt-3.5-turbo-1106', 'gpt-3.5-turbo-0613']
trials = 3
stats = {model: {"total response time": [],
                 "latency (s)": [],
                 "response tokens": [],
                 "total rate": [],
                 "stream rate": [],
                 } for model in models}

for i in range(trials):
    for model in models:
        print(f"\n[{model}]")
        time.sleep(.2)
        bdate.set()
        # call the chat API using the openai package and model parameters
        try:
            response = client.chat.completions.create(
                messages=[
                          # {"role": "system", "content": "You are a helpful assistant"},
                          {"role": "user", "content": user}],
                model=model,
                top_p=0.0, stream=True, max_tokens=256)
        except openai.APIConnectionError as e:
            print("The server could not be reached")
            print(e.__cause__)  # an underlying Exception, likely raised within httpx.
        except openai.RateLimitError as e:
            print(f"OpenAI rate error {e.status_code}: (e.response)")
        except openai.APIStatusError as e:
            print(f"OpenAI error {e.status_code}: (e.response)")

        # capture the words emitted by the response generator
        reply = ""
        for part in response:
            if reply == "":
                latency = bdate.get()
            if not (part.choices[0].finish_reason):
                word = part.choices[0].delta.content or ""
                if reply == "" and word == "\n":
                    word = ""
                reply += word
                p.word(word)
        total = bdate.get()
        # extend model stats lists with total, latency, tokens for model
        stats[model]["total response time"].append(total)
        stats[model]["latency (s)"].append(latency)
        tokens = tok.count(reply)
        stats[model]["response tokens"].append(tokens)
        stats[model]["total rate"].append(tokens/total)
        stats[model]["stream rate"].append((tokens-1)/(1 if total-latency == 0 else total-latency))

print("\n\n")
for key in stats:
    print(f"Report for {trials} trials of {key}:")
    for sub_key in stats[key]:
        values = stats[key][sub_key]
        min_value = min(values)
        max_value = max(values)
        avg_value = sum(values) / len(values)
        print(f"- {sub_key.ljust(20, '.')}"
              f"Min:{str(f'{min_value:.3f}'.zfill(7))} "
              f"Max:{str(f'{max_value:.3f}'.zfill(7))} "
              f"Avg:{str(f'{avg_value:.3f}'.zfill(7))}")


    print()

_j · November 21, 2023, 12:13am

Wow, what a code mess. It’s like someone just pasted some random classes out of other libraries they wrote just to get the job done. (hint, it was me).

It seems that only tier 1 is really getting hit with slow output, for now. 40-50 tokens per second for gpt-3.5-turbo has and had been typical before some accounts were put into a degraded state, previously without reason given – and a lot faster than you can read.

oieieio · November 21, 2023, 12:15am

lol yes it is, but it works and thanks for sharing it

AB29 · November 21, 2023, 6:38pm

Thank you Foxabilo! Is there a way to accelerate the tier upgrade with pre-payment or something like that ?

logankilpatrick · December 4, 2023, 3:24pm

Hey! It turns out there was a bug on our end that could result in timeouts in certain scenarios. We have since fixed the issue. Please let us know in a new thread if you end up seeing similar issues again. Thanks again for reporting this!

logankilpatrick · December 4, 2023, 3:25pm

Yes, @AB29 you can pre-purchase credits which would allow your usage tier to go up, even if you don’t use all the credits right away.

Topic		Replies	Views
ChatGPT API responses are very slow API	31	28261	December 12, 2023
Pay to use GTP4 Turbo. Can't use it API chatgpt , api-rate-limits , account-tiers	11	3068	November 27, 2023
Chat Completion API super slow and hanging API	8	2049	December 13, 2023
How to speed up GPT4 generation Feedback gpt-4 , chatgpt , api	10	5661	January 29, 2024
Is there a faster way to get out of free tier? API	2	129	August 29, 2024

Struggling with the speed of the API requests

Related topics