Gpt-3.5-turbo-0613 2x slower than gpt-3.5-turbo-0301

The latest model of gpt3.5 (gpt-3.5-turbo-0613) as of date(29th Sep 2023) is significantly slower than the older legacy version(gpt-3.5-turbo-0301).

Here’s a snippet which does summarisation on a paragraph of text with the same parameters only differing in the model type.

Gist to reproduce the issue:

import time
import requests
import aiohttp
import openai
from typing import List, Union, Tuple, Dict

def get_openai_api_response_sync(
    prompt_message: List, model: str, max_tokens : int,
) -> Union[Tuple[str, Dict], None]:
    response = None
    with requests.Session() as session:
        openai.requestssession = session
        response = openai.ChatCompletion.create(
        if response:
            return response["choices"][0]["message"]["content"], dict(response['usage'])
    return None

prompt_message = [
            "role": "system",
            "content": "Given a large piece of text, write a summary in 3-4 sentences",
            "role": "user",
            "content": "Text : Throughout the history of mankind, music has served as an essential element of society, representing the emotions, experiences, and philosophies of its composers and listeners alike. From the ancient flutes crafted by early humans to the complex symphonies penned in the age of Romanticism, music is an art form that expresses the vast spectrum of human sentiment. It transcends tangible barriers and abstract differences, soothing souls and inspiring change. Beyond its emotional appeal, music also wields the power to shape cognition and development in exquisite, profound ways. Children who learn music often exhibit improved cognitive skills and academic proficiency. Research has further divulged enhanced neural plasticity and memory consolidation in musicians. On a societal level, music taps into the cultural tapestry of races and nations, bridging diverse cultures and fostering communal harmony. It has been leveraged as a vehicle for change during pivotal historical events and stands as an unceasing source of unity and strength in turbulent times. Music, thus, is not just an auditory pleasure but a medium of holistic human expression, cognitive enhancement, and societal cohesion. Summary : ",

print("--------- Test with gpt-3.5-turbo-0613 (new & slower) ---------")
start = time.time()
model = "gpt-3.5-turbo-0613"  
out = get_openai_api_response_sync(
    prompt_message = prompt_message,
    model = model,
    max_tokens = 50
end = time.time()
print(f"Took {(end - start) * 1000} ms with {model}\n\n")

print("--------- Test with gpt-3.5-turbo-0301 (old & faster) ---------")
start = time.time()
model = "gpt-3.5-turbo-0301" 
out = get_openai_api_response_sync(
    prompt_message = prompt_message,
    model = model,
    max_tokens = 50
end = time.time()
print(f"\nTook {(end - start) * 1000} ms with {model}\n\n")


--------- Test with gpt-3.5-turbo-0613 (new & slower) ---------
('Music has played a significant role throughout history, expressing human emotions and experiences. It goes beyond cultural differences, inspiring change and fostering unity. Learning music has been shown to improve cognitive skills and academic performance in children, while musicians also exhibit enhanced neural plasticity', {'prompt_tokens': 247, 'completion_tokens': 50, 'total_tokens': 297}) 

Took 2617.2635555267334 ms with gpt-3.5-turbo-0613

--------- Test with gpt-3.5-turbo-0301 (old & faster) ---------
('Music has been an essential part of human society throughout history, expressing a wide range of emotions and experiences. It has the power to shape cognition and development, with children who learn music exhibiting improved cognitive skills and academic proficiency. Music also serves as a vehicle', {'prompt_tokens': 249, 'completion_tokens': 50, 'total_tokens': 299})

Took 912.5313758850098 ms with gpt-3.5-turbo-0301

gpt-3.5-turbo-0613 - 2600 ms
gpt-3.5-turbo-0301 - 900 ms

That’s a huge difference between the same type of model but different checkpoints.

1 Like

Yep, looks like -0301, standing by, but not being utilized by everybody, is quite fast!

Report for gpt-3.5-turbo:

For total, Min: 10.129, Max: 10.724, Avg: 10.30
For latency, Min: 0.412, Max: 1.061, Avg: 0.62
For tokens, Min: 196, Max: 196, Avg: 196.00
For rate, Min: 18.28, Max: 19.35, Avg: 19.03 tokens/s

Report for gpt-3.5-turbo-0301:

For total, Min: 2.915, Max: 3.907, Avg: 3.42
For latency, Min: 0.199, Max: 0.501, Avg: 0.35
For tokens, Min: 196, Max: 196, Avg: 196.00
For rate, Min: 50.17, Max: 67.24, Avg: 58.12 tokens/s


Supporting classes
import openai
import jsonschema
import time
import re
import os
import tiktoken
openai.api_key = key

class Tokenizer:
    """ required: import tiktoken; import re;
    usage example:
        cl100 = Tokenizer()
        number_of_tokens = cl100.count("my string")
    def __init__(self, model="cl100k_base"):
        self.tokenizer = tiktoken.get_encoding(model)
        self.chat_strip_match = re.compile(r'<\|.*?\|>')
        self.intype = None

    def ucount(self, text):
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)

    def count(self, text):
        text = self.chat_strip_match.sub('', text)
        encoded_text = self.tokenizer.encode(text)
        return len(encoded_text)

class BotDate:
    """ .start/.now : object creation date/time; current date/time
        .set/.get   : start/reset timer, elapsed time
        .print      : formatted date/time from epoch seconds
    def __init__(self, format_spec="%Y-%m-%d %H:%M%p"):
        self.format_spec = format_spec
        self.created_time = time.time()
        self.stats1 = []
        self.stats2 = []

    def stats_reset(self):
        self.stats1 = []
        self.stats2 = []
    def set(self):
        self.start_time = time.time()  # Record the current time when set is called

    def start(self):
        return self.format_time(self.created_time)

    def now(self):
        return self.format_time(time.time())

    def print(self, epoch_seconds): # format input seconds
        return self.format_time(epoch_seconds)

    def format_time(self, epoch_seconds):
        formatted_time = time.strftime(self.format_spec, time.localtime(epoch_seconds))
        return formatted_time

    def get(self):  # elapsed time value str
        if self.start_time is None:
            return "X.XX"
            elapsed_time = time.time() - self.start_time
            return round(elapsed_time, 3)

bdate = BotDate()
tok = Tokenizer()
A prompt to run for 275 input tokens
user = """Write verbose user documentation based on this class method docstring:
    Extends the input message dictionary or list of dictionaries with a 'tokens' field,
    which contains the token count of the 'role' and 'content' fields
    (and optionally the 'name' field). The token count is calculated using the
    'scount' method, which strips out any text enclosed within "<|" and "|>" before counting the tokens.

        message (dict or list): A dictionary or a list of dictionaries. The ChatML format.
        Each dictionary must have a 'role' field and a 'content' field, and may optionally
        have a 'name' field. The 'role' and 'content' fields are strings, and the
        'name' field, if present, is also a string.

        The input message dictionary or list of dictionaries, extended with a 'tokens' field
        in each dictionary. The 'tokens' field contains the token count of the 'role' and
        'content' fields (and optionally the 'name' field), calculated using the 'scount'
        method. The total token count also includes a fixed overhead of 3 control tokens.

        KeyError: If a dictionary does not have a 'role' or 'content' field.
models = ["gpt-3.5-turbo", "gpt-3.5-turbo-0301"]
trials = 4
stats = {model: {"total": [], "latency": [], "tokens": [], "rate": []} for model in models}

for i in range(trials):
    for model in models:
        # call the chat API using the openai package and model parameters
        response = openai.ChatCompletion.create(
            messages = [{"role": "user", "content": user}],
            model = model,
            top_p = 0.0, stream = True, max_tokens = 200)

        # capture the words emitted by the response generator
        reply = ""
        for delta in response:
            if reply == "":
                latency = bdate.get()
            if not delta['choices'][0]['finish_reason']:
                word = delta['choices'][0]['delta']['content']
                reply += word
                print(word, end="")
        tokens = tok.count(reply)
        total = bdate.get()
        # extend model stats lists with total, latency, tokens for model
        tokens = tok.count(reply)
        stats[model]["rate"].append(round(tokens/total, 2))

for key in stats:
    print(f"Report for {key}:")
    for sub_key in stats[key]:
        values = stats[key][sub_key]
        min_value = min(values)
        max_value = max(values)
        avg_value = sum(values) / len(values)
        print(f"For {sub_key}, Min: {min_value}, Max: {max_value}, Avg: {avg_value:.2f}")

Now what everybody wants to know, how does a fine-tune model stack up? I have gpt-3.5-turbo with a bare minimum tune…

Report for ft:gpt-3.5-turbo-0613:aaaaa::3333333:

For total, Min: 2.217, Max: 20.147, Avg: 6.80
For latency, Min: 0.2, Max: 18.205, Avg: 4.76
For tokens, Min: 196, Max: 196, Avg: 196.00
For rate, Min: 9.73, Max: 88.41, Avg: 65.19

A horrible start-up time on running the model not used since yesterday.

Round 2

For total, Min: 2.319, Max: 2.848, Avg: 2.61
For latency, Min: 0.275, Max: 0.704, Avg: 0.53
For tokens, Min: 196, Max: 196, Avg: 196.00
For rate, Min: 68.82, Max: 84.52, Avg: 75.52

But faster output tokens than the non-tuned AI models.

Quite a few reports on API being slow. No idea how they expect businesses to use this. Total crap response time