I am running a loop of 240K queries to gpt-4o-mini mini.
In the beginning it is 1s per iteration
Gradually it becomes about 20 seconds per iteration.
I don’t hit rate limit error.
It’s just slow.
why?
Are you using async client?
O1 series models use test time compute which means that your API requests can stay open for a relatively longer time than vanilla chat completion models.
What sort of deployment environment are you running the requests on?
240,000 simultaneous connections (or even batched ones) will require a massive number of open sockets.
sorry, I meant gpt-4o-mini - I am running on a mac with python. move to joblib Parallel to simulate some traces parallely. first it runs super fast hitting rate limit (good) but then after a while becomes super slow again. I am monitoring messages and tokens number. they have not changed. it must be something to do with an issue with the api.
Can you share the current code you’re running?
thanks!
Note these are not 240,000 simultaneous connection. Just n~10 threads each serial
can anyone help? no support from open AI. @sps
I’d recommend using asyncio
and AsyncOpenAI
client to make asynchronous API calls.
Here's some example code to test Time to First Tokens over a span of 50 API calls
import asyncio
import time
import matplotlib.pyplot as plt
from openai import AsyncOpenAI
client = AsyncOpenAI()
system_message = {
"role": "system",
"content": "You are a master dad joke maker."
}
user_message = {
"role": "user",
"content": "Tell me a dad joke."
}
messages = [system_message, user_message]
# Async function to make a single API call using streaming and measure time to first token
async def measure_time_to_first_token():
start_time = time.time()
response_text = ""
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
stream=True
)
async for chunk in stream:
if chunk.choices[0].delta.content is not None:
response_text += chunk.choices[0].delta.content
time_to_first_token = time.time() - start_time
return time_to_first_token, response_text
# Async function to make multiple API calls concurrently
async def main():
tasks = [measure_time_to_first_token() for _ in range(50)]
results = await asyncio.gather(*tasks)
times_to_first_token, responses = zip(*results)
for i, (ttft, response_text) in enumerate(zip(times_to_first_token, responses)):
print(f"Response {i+1}: {response_text}\nTime to first token: {ttft} seconds")
# Print average time to first token
average_time_to_first_token = sum(times_to_first_token) / len(times_to_first_token)
print(f"Average time to first token over 50 calls: {average_time_to_first_token} seconds")
# Plot the TTTFT variance
plt.figure(figsize=(12, 6))
plt.plot(times_to_first_token, marker='o', linestyle='-', color='#FFA500', markersize=8, markerfacecolor='#FF4500')
plt.xlabel('API Call Number', fontsize=14, color='white')
plt.ylabel('Time to First Token (seconds)', fontsize=14, color='white')
plt.title('Variance in Time to First Token for 50 API Calls', fontsize=16, color='white')
plt.grid(True, linestyle='--', alpha=0.6)
# Set background color
plt.gca().set_facecolor('#0E1117')
plt.gcf().set_facecolor('#0E1117')
plt.gca().spines['top'].set_color('white')
plt.gca().spines['bottom'].set_color('white')
plt.gca().spines['left'].set_color('white')
plt.gca().spines['right'].set_color('white')
plt.gca().tick_params(axis='x', colors='white')
plt.gca().tick_params(axis='y', colors='white')
# Show the plot
plt.show()
# Run the async main function
asyncio.run(main())
Here are the results from the test run: