Rate become slower over time (GPT 1o mini)

I am running a loop of 240K queries to gpt-4o-mini mini.
In the beginning it is 1s per iteration
Gradually it becomes about 20 seconds per iteration.
I don’t hit rate limit error.
It’s just slow.
why?

Are you using async client?

O1 series models use test time compute which means that your API requests can stay open for a relatively longer time than vanilla chat completion models.

What sort of deployment environment are you running the requests on?

240,000 simultaneous connections (or even batched ones) will require a massive number of open sockets.

sorry, I meant gpt-4o-mini - I am running on a mac with python. move to joblib Parallel to simulate some traces parallely. first it runs super fast hitting rate limit (good) but then after a while becomes super slow again. I am monitoring messages and tokens number. they have not changed. it must be something to do with an issue with the api.

Can you share the current code you’re running?

thanks!

Note these are not 240,000 simultaneous connection. Just n~10 threads each serial

can anyone help? :frowning: no support from open AI. @sps

I’d recommend using asyncio and AsyncOpenAI client to make asynchronous API calls.

Here's some example code to test Time to First Tokens over a span of 50 API calls
import asyncio
import time
import matplotlib.pyplot as plt
from openai import AsyncOpenAI

client = AsyncOpenAI()

system_message = {
    "role": "system",
    "content": "You are a master dad joke maker."
}

user_message = {
    "role": "user",
    "content": "Tell me a dad joke."
}

messages = [system_message, user_message]

# Async function to make a single API call using streaming and measure time to first token
async def measure_time_to_first_token():
    start_time = time.time()
    response_text = ""
    
    stream = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True
    )
    
    async for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            response_text += chunk.choices[0].delta.content
            time_to_first_token = time.time() - start_time
            return time_to_first_token, response_text

# Async function to make multiple API calls concurrently
async def main():
    tasks = [measure_time_to_first_token() for _ in range(50)]
    results = await asyncio.gather(*tasks)
    
    times_to_first_token, responses = zip(*results)
    
    for i, (ttft, response_text) in enumerate(zip(times_to_first_token, responses)):
        print(f"Response {i+1}: {response_text}\nTime to first token: {ttft} seconds")
    
    # Print average time to first token
    average_time_to_first_token = sum(times_to_first_token) / len(times_to_first_token)
    print(f"Average time to first token over 50 calls: {average_time_to_first_token} seconds")
    
    # Plot the TTTFT variance
    plt.figure(figsize=(12, 6))
    plt.plot(times_to_first_token, marker='o', linestyle='-', color='#FFA500', markersize=8, markerfacecolor='#FF4500')
    plt.xlabel('API Call Number', fontsize=14, color='white')
    plt.ylabel('Time to First Token (seconds)', fontsize=14, color='white')
    plt.title('Variance in Time to First Token for 50 API Calls', fontsize=16, color='white')
    plt.grid(True, linestyle='--', alpha=0.6)
    
    # Set background color
    plt.gca().set_facecolor('#0E1117')
    plt.gcf().set_facecolor('#0E1117')
    plt.gca().spines['top'].set_color('white')
    plt.gca().spines['bottom'].set_color('white')
    plt.gca().spines['left'].set_color('white')
    plt.gca().spines['right'].set_color('white')
    plt.gca().tick_params(axis='x', colors='white')
    plt.gca().tick_params(axis='y', colors='white')
    
    # Show the plot
    plt.show()

# Run the async main function
asyncio.run(main())

Here are the results from the test run:

1 Like