Rate become slower over time (GPT 1o mini)

chanansh · February 4, 2025, 12:18pm

I am running a loop of 240K queries to gpt-4o-mini mini.
In the beginning it is 1s per iteration
Gradually it becomes about 20 seconds per iteration.
I don’t hit rate limit error.
It’s just slow.
why?

sps · February 4, 2025, 12:38pm

Are you using async client?

O1 series models use test time compute which means that your API requests can stay open for a relatively longer time than vanilla chat completion models.

What sort of deployment environment are you running the requests on?

240,000 simultaneous connections (or even batched ones) will require a massive number of open sockets.

chanansh · February 6, 2025, 12:39pm

sorry, I meant gpt-4o-mini - I am running on a mac with python. move to joblib Parallel to simulate some traces parallely. first it runs super fast hitting rate limit (good) but then after a while becomes super slow again. I am monitoring messages and tokens number. they have not changed. it must be something to do with an issue with the api.

sps · February 6, 2025, 7:38pm

Can you share the current code you’re running?

chanansh · February 7, 2025, 4:17pm

github.com/chanansh/llm_learning_from_experience

src/experiment_2008.py

main

import os
from joblib import Parallel, delayed
import numpy as np
import pandas as pd
from tqdm import tqdm
from dataclasses import dataclass
from typing import Optional
from langgraph.graph import StateGraph
from llm import get_memory_graph, invoke_graph
from loguru import logger
import fire
# Configure logging

@dataclass
class ExperimentSettings:
    instructions: str = (
        """
        You are a Technion student participating in a paid experiment.
        Instructions:
        - Play a series of games involving a money machine.

This file has been truncated. show original

github.com/chanansh/llm_learning_from_experience

src/llm.py

main

from functools import partial
import sys
import uuid
from typing import Optional, List
from langchain_core.messages import (
    SystemMessage,
    HumanMessage,
     RemoveMessage,
)
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import START, MessagesState, StateGraph
from loguru import logger
from langchain_community.chat_models import ChatOpenAI
from dotenv import load_dotenv
from typing import List, Dict

import openai
from retry import retry

def get_open_ai_chat_model(model: str = "gpt-4o-mini") -> ChatOpenAI:

This file has been truncated. show original

thanks!

chanansh · February 7, 2025, 5:35pm

Note these are not 240,000 simultaneous connection. Just n~10 threads each serial

chanansh · February 13, 2025, 2:36pm

can anyone help? no support from open AI. @sps

sps · February 15, 2025, 12:31pm

I’d recommend using asyncio and AsyncOpenAI client to make asynchronous API calls.

Here's some example code to test Time to First Tokens over a span of 50 API calls

import asyncio
import time
import matplotlib.pyplot as plt
from openai import AsyncOpenAI

client = AsyncOpenAI()

system_message = {
    "role": "system",
    "content": "You are a master dad joke maker."
}

user_message = {
    "role": "user",
    "content": "Tell me a dad joke."
}

messages = [system_message, user_message]

# Async function to make a single API call using streaming and measure time to first token
async def measure_time_to_first_token():
    start_time = time.time()
    response_text = ""
    
    stream = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True
    )
    
    async for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            response_text += chunk.choices[0].delta.content
            time_to_first_token = time.time() - start_time
            return time_to_first_token, response_text

# Async function to make multiple API calls concurrently
async def main():
    tasks = [measure_time_to_first_token() for _ in range(50)]
    results = await asyncio.gather(*tasks)
    
    times_to_first_token, responses = zip(*results)
    
    for i, (ttft, response_text) in enumerate(zip(times_to_first_token, responses)):
        print(f"Response {i+1}: {response_text}\nTime to first token: {ttft} seconds")
    
    # Print average time to first token
    average_time_to_first_token = sum(times_to_first_token) / len(times_to_first_token)
    print(f"Average time to first token over 50 calls: {average_time_to_first_token} seconds")
    
    # Plot the TTTFT variance
    plt.figure(figsize=(12, 6))
    plt.plot(times_to_first_token, marker='o', linestyle='-', color='#FFA500', markersize=8, markerfacecolor='#FF4500')
    plt.xlabel('API Call Number', fontsize=14, color='white')
    plt.ylabel('Time to First Token (seconds)', fontsize=14, color='white')
    plt.title('Variance in Time to First Token for 50 API Calls', fontsize=16, color='white')
    plt.grid(True, linestyle='--', alpha=0.6)
    
    # Set background color
    plt.gca().set_facecolor('#0E1117')
    plt.gcf().set_facecolor('#0E1117')
    plt.gca().spines['top'].set_color('white')
    plt.gca().spines['bottom'].set_color('white')
    plt.gca().spines['left'].set_color('white')
    plt.gca().spines['right'].set_color('white')
    plt.gca().tick_params(axis='x', colors='white')
    plt.gca().tick_params(axis='y', colors='white')
    
    # Show the plot
    plt.show()

# Run the async main function
asyncio.run(main())

Here are the results from the test run:

Topic		Replies	Views
Gpt4o-mini slows down after a while Bugs	0	62	February 13, 2025
ChatGPT API responses are very slow API	31	29220	December 12, 2023
Runs randomly take > 30sec Bugs assistants-api	7	608	September 11, 2024
Intermittent Latency Spikes with Chat Completion API (GPT-4) in FastAPI Application API	0	145	October 28, 2024
Chat GPT's API is significantly slower than the website with GPT Plus API	35	36595	December 12, 2023

Rate become slower over time (GPT 1o mini)

Related topics