Concurrency Rate Limiting: A $10,000 Issue

We are observing some irregularites in API response latency, despite being well within our Tier 5 rate limits. We’d love to better understand how the concurrency rate limits actually work, as they have very tangible, real impacts on our users’ experience with us.

The irregularity we are observing is, when sending multiple API requests in parallel to gpt-4o-mini (138 requests in parallel; total token count ~15000 and roughly ~100 tokens/request), we have a return latency of 40 seconds. This is far below the REQUEST limit for Tier 5 of 30,000 RPM. When we do only 5 API requests (total token count is 638), the total time is 1.7 seconds. There should be no reason we can see that the first Request takes so much longer.

I conducted a more thorough investigation - we have various translation tasks we’re interested in doing, and we’re decomposing translating an entire long document into translating multiple paragraphs in parallel.

The most obvious conclusion is there is some kind of invisible additional rate limiting going on behind the scenes beyond what is outlined in the guide.

Per the guide, there is a cap of 150 M tokens/minute and 30,000 requests/minute. We don’t come anywhere close to that in my tests, but there is a dramatic difference in total processing speed, presumably because of some unpredictable concurrency throttling.

As we are about to adopt this system in production, on a wide scale with many customers, with potentially thousands of requests happening simultaneously, it’s crucial for us to have more clarity on how exactly the rate limiting is happening.

When I asked support for more specific numbers on concurrency limits, so we can predict its impact and make plans to solve it (such as using other solutions like AWS), or to be connected with a sales representative to develop a custom solution for our enterprise, I was in effect told our organization needs to spend $10,000 USD/month to be qualified to speak with a sales agent.

Is there any way forward here for us? We’d love to keep developing with this platform, but it’s hard for us to navigate blind to the concurrency limits.

1 Like

Hi Eric,

I will raise your question with the team at OpenAI, but that will be next week now. I have not seen this documented, but I will check.

2 Likes

@Foxalabs Hi Spencer, this issue still has not been resolved and we still haven’t received any information. This is a pressing and urgent situation for us.

Again, in essence, the problem is really simple - the total network latency seems to scale with the number of concurrent requests, and not in a way that corresponds with our rate limits.

I made a dummy series of requests using 62 prompt input and resulting in 62 output tokens. I duplicated this request to run in parallel (asynchronously / concurrently) 5 times, 50 times, and 100 times.

Again, recall that every single request is identical to every other one. Because it’s being run in parallel, we would expect aside from some minor millisecond differences due to network congestion (neglible on our AWS server) that they should all have roughly the same network latency.

Instead:

5 concurrent requests → 2.32 seconds (avg)
50 concurrent requests → 4.90 seconds (avg)
100 concurrent requests → 9.22 seconds (avg)

Manually checking the 100-concurrent-requests data, we find that we don’t come anywhere close to exhausting our rate limit.

This is a significant bug for enterprise-level scaling.

Please provide your code for making these requests.

import openai
import json
import asyncio
import time
import tiktoken
import pandas as pd

tokenizer = tiktoken.get_encoding("cl100k_base") 
with open('APIFILEPATH', 'r') as file:
    api_keys = json.load(file)
openai.api_key = api_keys['apikeyName']

input_sentence1 = '''Muchas de las aportaciones de Galileo le generaron un 
grave conflicto con la Iglesia Católica, la cual defendía un tipo de pensamiento completamente contrario
 al que intentaba imponer Galileo con sus descubrimientos, su método científico y su empirismo.'''
sentences = [input_sentence1] * 100

async def translate_sentence(sentence):

    messages = [
        {"role": "system", "content": "You are an expert Spanish to English translator"},
        {"role": "user", "content": f"Translate: {sentence}"}
    ]
    
    completion = await asyncio.to_thread(
        openai.chat.completions.with_raw_response.create,
        model="gpt-4o-mini",
        messages=messages,
        temperature=0
    )

#initialize dictionary - we'll use this for our CSV
    dc = {"input_tokens":str(len(tokenizer.encode(sentence)))}
    dc["request_id"] = completion.headers["x-request-id"]
    dc["x-ratelimit-limit-requests"] = completion.headers["x-ratelimit-limit-requests"]
    dc["x-ratelimit-remaining-requests"] = completion.headers["x-ratelimit-remaining-requests"]
    dc["x-ratelimit-remaining-tokens"] = completion.headers["x-ratelimit-remaining-tokens"]
    dc["x-ratelimit-reset-requests"] = completion.headers["x-ratelimit-reset-requests"]
    dc["x-ratelimit-reset-tokens"] = completion.headers["x-ratelimit-reset-tokens"]
    
    return dc

async def main():
    start = time.time()
    tasks = [translate_sentence(sentence) for sentence in sentences]
    dicts = await asyncio.gather(*tasks)
    duration = time.time() - start

#convert our dictionaries into a pandas dataframe, save it as csv
    df = pd.DataFrame(dicts)
    df['total_duration'] = duration
    df['concurrent_requests'] = len(sentences)
    df = df.transpose()
    df.to_csv("inference4_ratest.csv", mode='a', header=False)

#Execute main
start = time.time()
asyncio.run(main())
print("Concurrent requests:", len(sentences))
print("Time taken:", time.time() - start)

Simply edit the line sentences = [input_sentence1] * 100 to determine the number of concurrent requests (eg. * 50 or * 5)

1 Like

Thanks for the prompt reply with good code.

Here’s my speculation:

The parallelism isn’t perfectly parallel.

E.g. asking for 100 requests doesn’t actually perform 100 simultaneous requests, it’ll batch them. Either based on the number of available CPU threads or as some fixed number.

This is why you’re seeing 100 requests takes approximately twice as long as 50 requests (I’m presuming you’re not using a server with 100+ available threads).

1 Like

The imperfect parallelism matches with everything I’ve been seeing. I’m trying to figure out when the batching kicks in but it’s not clear yet.

And no, not using a server with 100+ available threads but we can employ that if needed.

What do you think are ways around this to avoid hitting the auto backend batching?

(a) same client (IP address) but using distinct API keys (or distinct project or even distinct account API keys)?

(b) same API key, but different client threads? ie. client subdivides 100 requests into 5 chunks of 20 each. 5 threads are opened, each sending a 20-concurrent request?

I feel like (b) would be interpreted as still coming from the same client by the OpenAI backend (same IP address), but not sure.

Any ideas? It’s critical we’re able to do 100-200 requests in as fast as we can manage, getting as close to single-request latency as possible. We’ll do a lot on our backend to make that happen.

There’s no “auto backend batching,” it’s a fundamental limitation of your hardware.

Say you have a waffle restaurant (computer), and you own 12 waffle makers (it has a 12-core CPU), they’re the fancy kind that rotate and can make two waffles each (you have 24 cores available).

If a high school football team comes into your restaurant after winning their big homecoming game and orders 100-waffles, no matter how you try you simply cannot make 100 waffles at once. The best you can ever do is have 24 cooking at a time. As each waffle finishes you can start another, but once you are dealing with an order for more than 24-waffles you’re going to get backed up.

You can buy more waffle makers (get a CPU with more available cores and threads) or open more waffle restaurants (use more computers), but your one little waffle store as it is will never be able to exceed 24-waffles being cooked at once.

I understand multithreading/multicoring, but I guess I’m confused about where the actual inference and processing is taking place. It was my understanding that the actual inference is taking place on OpenAI’s servers - my local hardware (our computer/server sending the request) surely doesn’t matter in terms of actual processing speed. Wouldn’t the limitation be the number of cores OpenAI is putting at our disposal during inference? But perhaps that’s what you mean by “our hardware”.

The analogy would be more fitting to say you have 12 waiters who can carry 2 tables’ orders. They still need to take the order to the chef and have to wait for it to be finished.

The waiter in this case unfortunately has to wait for the waffle to be made & delivered before taking more orders.

You would also want to not use the OpenAI client library and instead get as low-levelled as possible for extra control. You may run into connection pooling issues (this may or may not be true)

The processing takes place using the OpenAI servers. Your thread is still bound to the process and needs to wait for the response.

If absolutely required, you would be using cloud computing to create as many low-resource virtual machines to be capable of sending all the requests at the exact same time.

1 Like

Maybe a better analogy would be phone lines and phone calls?

Anyway, the point is once a thread on your PC sends a request it must wait for the response to come back.

2 Likes

I have spoken with OAI about this and it will be looked into, so hopefully it will be checked. The response I received was, that it should not be a smaller limit for parallel calls.

4 Likes

Great, thanks all! I appreciate the responses @Foxalabs @elmstedt @RonaldGRuckus

Hi,

The Eng team at OpenAI took a look and spotted an error in your code, you weren’tt actually running all of the tests in parallel, this is the corrected code, which has been tested to run as expected, i.e. a very quick response.

 21 def timed_create(*args, **kwargs):
 22     s = time. Time()
 23     response = client.chat.completions.with_raw_response.create(*args, **kwargs)
 24     print(f"rid1={response.headers['x-request-id']},total_time={time. Time()-s}")
 25     return response
 26 
 27 async def translate_sentence(sentence):
 28 
 29     s = time. Time()
 30     messages = [
 31         {"role": "system", "content": "You are an expert Spanish to English translator"},
 32         {"role": "user", "content": f"Translate: {sentence}"}
 33     ]
 34     
 35     response = await asyncio.to_thread(
 36         timed_create,
 37         #client.chat.completions.with_raw_response.create,
 38         model="gpt-4o-mini",
 39         messages=messages,
 40         temperature=0,
 41     )
 42     r = json.loads(response. Text)
5 Likes

I noticed that you’re using cl100k-base encodings for counting tokens, however gpt-4o- mini uses o200k-base encodings which have improved multilingual tokenization.

A better way would be to simply get encodings for the model:

import tiktoken

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4o-mini")

Hence, in case you’re using cl100k in your production code, I’d recommend getting proper encodings for the model you’re using.

3 Likes