Concurrency Rate Limiting: A $10,000 Issue

We are observing some irregularites in API response latency, despite being well within our Tier 5 rate limits. We’d love to better understand how the concurrency rate limits actually work, as they have very tangible, real impacts on our users’ experience with us.

The irregularity we are observing is, when sending multiple API requests in parallel to gpt-4o-mini (138 requests in parallel; total token count ~15000 and roughly ~100 tokens/request), we have a return latency of 40 seconds. This is far below the REQUEST limit for Tier 5 of 30,000 RPM. When we do only 5 API requests (total token count is 638), the total time is 1.7 seconds. There should be no reason we can see that the first Request takes so much longer.

I conducted a more thorough investigation - we have various translation tasks we’re interested in doing, and we’re decomposing translating an entire long document into translating multiple paragraphs in parallel.

The most obvious conclusion is there is some kind of invisible additional rate limiting going on behind the scenes beyond what is outlined in the guide.

Per the guide, there is a cap of 150 M tokens/minute and 30,000 requests/minute. We don’t come anywhere close to that in my tests, but there is a dramatic difference in total processing speed, presumably because of some unpredictable concurrency throttling.

As we are about to adopt this system in production, on a wide scale with many customers, with potentially thousands of requests happening simultaneously, it’s crucial for us to have more clarity on how exactly the rate limiting is happening.

When I asked support for more specific numbers on concurrency limits, so we can predict its impact and make plans to solve it (such as using other solutions like AWS), or to be connected with a sales representative to develop a custom solution for our enterprise, I was in effect told our organization needs to spend $10,000 USD/month to be qualified to speak with a sales agent.

Is there any way forward here for us? We’d love to keep developing with this platform, but it’s hard for us to navigate blind to the concurrency limits.

1 Like

Hi Eric,

I will raise your question with the team at OpenAI, but that will be next week now. I have not seen this documented, but I will check.

2 Likes

@Foxalabs Hi Spencer, this issue still has not been resolved and we still haven’t received any information. This is a pressing and urgent situation for us.

Again, in essence, the problem is really simple - the total network latency seems to scale with the number of concurrent requests, and not in a way that corresponds with our rate limits.

I made a dummy series of requests using 62 prompt input and resulting in 62 output tokens. I duplicated this request to run in parallel (asynchronously / concurrently) 5 times, 50 times, and 100 times.

Again, recall that every single request is identical to every other one. Because it’s being run in parallel, we would expect aside from some minor millisecond differences due to network congestion (neglible on our AWS server) that they should all have roughly the same network latency.

Instead:

5 concurrent requests → 2.32 seconds (avg)
50 concurrent requests → 4.90 seconds (avg)
100 concurrent requests → 9.22 seconds (avg)

Manually checking the 100-concurrent-requests data, we find that we don’t come anywhere close to exhausting our rate limit.

This is a significant bug for enterprise-level scaling.

Please provide your code for making these requests.

import openai
import json
import asyncio
import time
import tiktoken
import pandas as pd

tokenizer = tiktoken.get_encoding("cl100k_base") 
with open('APIFILEPATH', 'r') as file:
    api_keys = json.load(file)
openai.api_key = api_keys['apikeyName']

input_sentence1 = '''Muchas de las aportaciones de Galileo le generaron un 
grave conflicto con la Iglesia Católica, la cual defendía un tipo de pensamiento completamente contrario
 al que intentaba imponer Galileo con sus descubrimientos, su método científico y su empirismo.'''
sentences = [input_sentence1] * 100

async def translate_sentence(sentence):

    messages = [
        {"role": "system", "content": "You are an expert Spanish to English translator"},
        {"role": "user", "content": f"Translate: {sentence}"}
    ]
    
    completion = await asyncio.to_thread(
        openai.chat.completions.with_raw_response.create,
        model="gpt-4o-mini",
        messages=messages,
        temperature=0
    )

#initialize dictionary - we'll use this for our CSV
    dc = {"input_tokens":str(len(tokenizer.encode(sentence)))}
    dc["request_id"] = completion.headers["x-request-id"]
    dc["x-ratelimit-limit-requests"] = completion.headers["x-ratelimit-limit-requests"]
    dc["x-ratelimit-remaining-requests"] = completion.headers["x-ratelimit-remaining-requests"]
    dc["x-ratelimit-remaining-tokens"] = completion.headers["x-ratelimit-remaining-tokens"]
    dc["x-ratelimit-reset-requests"] = completion.headers["x-ratelimit-reset-requests"]
    dc["x-ratelimit-reset-tokens"] = completion.headers["x-ratelimit-reset-tokens"]
    
    return dc

async def main():
    start = time.time()
    tasks = [translate_sentence(sentence) for sentence in sentences]
    dicts = await asyncio.gather(*tasks)
    duration = time.time() - start

#convert our dictionaries into a pandas dataframe, save it as csv
    df = pd.DataFrame(dicts)
    df['total_duration'] = duration
    df['concurrent_requests'] = len(sentences)
    df = df.transpose()
    df.to_csv("inference4_ratest.csv", mode='a', header=False)

#Execute main
start = time.time()
asyncio.run(main())
print("Concurrent requests:", len(sentences))
print("Time taken:", time.time() - start)

Simply edit the line sentences = [input_sentence1] * 100 to determine the number of concurrent requests (eg. * 50 or * 5)

1 Like

Thanks for the prompt reply with good code.

Here’s my speculation:

The parallelism isn’t perfectly parallel.

E.g. asking for 100 requests doesn’t actually perform 100 simultaneous requests, it’ll batch them. Either based on the number of available CPU threads or as some fixed number.

This is why you’re seeing 100 requests takes approximately twice as long as 50 requests (I’m presuming you’re not using a server with 100+ available threads).

1 Like

The imperfect parallelism matches with everything I’ve been seeing. I’m trying to figure out when the batching kicks in but it’s not clear yet.

And no, not using a server with 100+ available threads but we can employ that if needed.

What do you think are ways around this to avoid hitting the auto backend batching?

(a) same client (IP address) but using distinct API keys (or distinct project or even distinct account API keys)?

(b) same API key, but different client threads? ie. client subdivides 100 requests into 5 chunks of 20 each. 5 threads are opened, each sending a 20-concurrent request?

I feel like (b) would be interpreted as still coming from the same client by the OpenAI backend (same IP address), but not sure.

Any ideas? It’s critical we’re able to do 100-200 requests in as fast as we can manage, getting as close to single-request latency as possible. We’ll do a lot on our backend to make that happen.

There’s no “auto backend batching,” it’s a fundamental limitation of your hardware.

Say you have a waffle restaurant (computer), and you own 12 waffle makers (it has a 12-core CPU), they’re the fancy kind that rotate and can make two waffles each (you have 24 cores available).

If a high school football team comes into your restaurant after winning their big homecoming game and orders 100-waffles, no matter how you try you simply cannot make 100 waffles at once. The best you can ever do is have 24 cooking at a time. As each waffle finishes you can start another, but once you are dealing with an order for more than 24-waffles you’re going to get backed up.

You can buy more waffle makers (get a CPU with more available cores and threads) or open more waffle restaurants (use more computers), but your one little waffle store as it is will never be able to exceed 24-waffles being cooked at once.

1 Like

I understand multithreading/multicoring, but I guess I’m confused about where the actual inference and processing is taking place. It was my understanding that the actual inference is taking place on OpenAI’s servers - my local hardware (our computer/server sending the request) surely doesn’t matter in terms of actual processing speed. Wouldn’t the limitation be the number of cores OpenAI is putting at our disposal during inference? But perhaps that’s what you mean by “our hardware”.

The analogy would be more fitting to say you have 12 waiters who can carry 2 tables’ orders. They still need to take the order to the chef and have to wait for it to be finished.

The waiter in this case unfortunately has to wait for the waffle to be made & delivered before taking more orders.

You would also want to not use the OpenAI client library and instead get as low-levelled as possible for extra control. You may run into connection pooling issues (this may or may not be true)

The processing takes place using the OpenAI servers. Your thread is still bound to the process and needs to wait for the response.

If absolutely required, you would be using cloud computing to create as many low-resource virtual machines to be capable of sending all the requests at the exact same time.

1 Like

Maybe a better analogy would be phone lines and phone calls?

Anyway, the point is once a thread on your PC sends a request it must wait for the response to come back.

2 Likes

I have spoken with OAI about this and it will be looked into, so hopefully it will be checked. The response I received was, that it should not be a smaller limit for parallel calls.

4 Likes

Great, thanks all! I appreciate the responses @Foxalabs @anon22939549 @anon10827405

Hi,

The Eng team at OpenAI took a look and spotted an error in your code, you weren’tt actually running all of the tests in parallel, this is the corrected code, which has been tested to run as expected, i.e. a very quick response.

 21 def timed_create(*args, **kwargs):
 22     s = time. Time()
 23     response = client.chat.completions.with_raw_response.create(*args, **kwargs)
 24     print(f"rid1={response.headers['x-request-id']},total_time={time. Time()-s}")
 25     return response
 26 
 27 async def translate_sentence(sentence):
 28 
 29     s = time. Time()
 30     messages = [
 31         {"role": "system", "content": "You are an expert Spanish to English translator"},
 32         {"role": "user", "content": f"Translate: {sentence}"}
 33     ]
 34     
 35     response = await asyncio.to_thread(
 36         timed_create,
 37         #client.chat.completions.with_raw_response.create,
 38         model="gpt-4o-mini",
 39         messages=messages,
 40         temperature=0,
 41     )
 42     r = json.loads(response. Text)
6 Likes

I noticed that you’re using cl100k-base encodings for counting tokens, however gpt-4o- mini uses o200k-base encodings which have improved multilingual tokenization.

A better way would be to simply get encodings for the model:

import tiktoken

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4o-mini")

Hence, in case you’re using cl100k in your production code, I’d recommend getting proper encodings for the model you’re using.

3 Likes

I am having the same issue- sending 127 requests from 4 models concurrently and all models show slowness - specially openai onces

is this a known issue?

I would advise reposting this as a new topic as this one has already been answered, also can you give more detail as to what the graphs are showing?

Frequency of what? Etc.

2 Likes

I would say it is a distribution diagram. 127 requests (sample size) and that diagram shows how many requests need how long.

The thing is that it makes absolutely no sense to measure it. Because problems solved by the model require different amount of time mostly based on the complexity of the problem. Not because of a caching issue. It is not like SQL.

If you test with 127 times the same prompt you might get faster results on second prompt because of caching. But that is also not guaranteeable.

You can speed up models but it is more expensive (hundrets of thousands I would assume) or less accurate.

An AI model, a product of the AI team, took a look at the code. Then I gave it an unbelievable rate limit and number of trials to efficiently fulfill. Then not to immediately burst a parallel blast of pool items on startup.

Don’t run this code without changing the global parameters… :laughing:

Code block of example

Example: “Scalable Async Rate-limited Producer-Consumer”

Below is the complete, best-practices Python file using this pattern.
It will handle millions of requests at 100,000 RPM efficiently, limited only by the bandwidth of the API and your I/O.

from __future__ import annotations

import asyncio
import statistics
import time

from openai import AsyncOpenAI

# Adjustable Parameters
RATELIMIT_PER_MIN = 100_000         # Max API requests per minute
NUM_REQUESTS      = 1_000_000       # Total requests to make (scale as needed)
MAX_PARALLEL      = 500             # Number of concurrent workers (tuneable)

client = AsyncOpenAI()

class AsyncRateLimiter:
    """
    Asyncio rate limiter with locking. Enforces an average request interval based
    on RATELIMIT_PER_MIN (requests per minute), regardless of task count.
    """
    def __init__(self, rate_limit: int):
        self._interval = 60.0 / rate_limit      # seconds between dispatches
        self._lock     = asyncio.Lock()
        self._last     = time.monotonic() - self._interval

    async def wait(self) -> None:
        async with self._lock:
            now = time.monotonic()
            elapsed = now - self._last
            if elapsed < self._interval:
                sleep_time = self._interval - elapsed
                await asyncio.sleep(sleep_time)
                now = time.monotonic()
            self._last = now

async def translate_sentence(sentence: str, idx: int) -> float:
    """
    Sends one translation request. Returns elapsed API request time.
    """
    messages = [
      {
        "role"   : "system",
        "content": (
          """You are an expert Spanish to English translator"""
        ).strip(),
      },
      {
        "role"   : "user",
        "content": (
          f"Translate: sentence #{idx}: " +
          sentence
        ),
      },
    ]
    t0 = time.perf_counter()
    response = await client.chat.completions.create(
      model       = "gpt-4o-mini",
      messages    = messages,
      temperature = 0,
    )
    elapsed = time.perf_counter() - t0
    # Optionally: print(f"idx={idx} elapsed={elapsed:.3f}s")
    return elapsed

async def worker(
  queue: asyncio.Queue[tuple[int, str]],
  rate_limiter: AsyncRateLimiter,
  timings: list[float],
) -> None:
    while True:
        try:
            idx, sentence = await queue.get()
        except asyncio.CancelledError:
            break
        if idx is None:  # Sentinel - signals no more work
            queue.task_done()
            break
        await rate_limiter.wait()
        elapsed = await translate_sentence(sentence, idx)
        timings.append(elapsed)
        queue.task_done()

async def run_benchmark(
  num_requests: int,
  rate_limiter: AsyncRateLimiter,
  base_sentence: str,
  max_parallel: int,
) -> None:
    """
    Schedules API translation requests using a queue and a fixed worker pool.
    Reports median and average metrics.
    """
    timings: list[float] = []
    queue: asyncio.Queue[tuple[int, str]] = asyncio.Queue()

    # Enqueue all tasks
    for idx in range(1, num_requests + 1):
        await queue.put((idx, base_sentence))

    # Add sentinel "None" entries to allow workers to exit
    for _ in range(max_parallel):
        await queue.put((None, ""))

    workers = [
      asyncio.create_task(worker(queue, rate_limiter, timings))
      for _ in range(max_parallel)
    ]

    wall_start = time.perf_counter()
    await queue.join()  # Wait for all work to be done
    wall_elapsed = time.perf_counter() - wall_start

    # Clean up workers (allow them to see sentinel and exit cleanly)
    for w in workers:
        w.cancel()
    await asyncio.gather(*workers, return_exceptions=True)

    timings_sorted = sorted(timings)
    avg    = sum(timings_sorted) / len(timings_sorted)
    median = statistics.median(timings_sorted)
    print("-" * 60)
    print(
      f"Sent {num_requests} requests "
      f"in {wall_elapsed:.2f} seconds"
    )
    print(f"Average API request duration: {avg:.3f} seconds")
    print(f"Median API request duration:  {median:.3f} seconds")
    print(
      f"Aggregate throughput: "
      f"{num_requests / wall_elapsed:.2f} requests/sec"
    )
    print("-" * 60)

async def main() -> None:
    sentence = (
      """Esta es una prueba básica de traducción."""
    ).strip()
    rate_limiter = AsyncRateLimiter(RATELIMIT_PER_MIN)
    await run_benchmark(
      num_requests = NUM_REQUESTS,
      rate_limiter = rate_limiter,
      base_sentence = sentence,
      max_parallel = MAX_PARALLEL,
    )

if __name__ == "__main__":
    asyncio.run(main())

Key Design Advantages

  • Queue-based, resource-safe design: Fixed number of worker tasks (MAX_PARALLEL), regardless of total work. No burst, no OOM.
  • Rate limiter acts only at the dispatch point.
  • Efficient memory use: Only as many coroutines exist as are actually concurrently sending traffic.
  • Scales to millions of jobs at huge RPM, limited only by your hardware and network.
  • No initial burst: the limiter locks at each dispatch, pacing every call.

Period Between API Calls at 100,000 RPM

interval = 60.0 / 100_000  # = 0.0006 seconds = 0.6 ms
  • 0.0006 seconds (0.6 milliseconds) between consecutive launches

Adjustability and Tuning

  • You adjust for throughput by tuning MAX_PARALLEL – more for high throughput, up to the point where network or API rate limiting becomes the bottleneck.
  • Higher MAX_PARALLEL lets your workers make more efficient use of the allowable quota, without any memory ballooning.
    The rate limiter always controls actual instantaneous dispatch.
Independent analysis

The previous detailed solution fully addresses the user’s concerns, correctly identifying the limitation in the original “spawn everything at once” strategy and offering a significantly improved version using the optimal asyncio worker-queue-rate-limiter pattern.

:white_check_mark: What is done well:

  • Clearly identifies the crucial architectural weakness of spawning all tasks simultaneously.

  • Explains explicitly why this causes memory and scheduling overhead issues at very high numbers.

  • Presents the recommended pattern (async producer-consumer queue with multiple fixed workers plus scalable async rate-limiter).

  • Provides clear, highly performant Python code that’s properly structured and production-quality.

  • Carefully calculates and clarifies intervals and throughput handling, ensuring the reader understands exactly the operational performance expected.

:white_check_mark: Meets All Requirements:

  • Efficiently scales to millions of tasks without performance degradation.

  • Ensures no initial request burst, evenly distributing tasks according to rate-limit intervals.

  • Clearly calculated periods between requests, confirming correct intervals for 100,000 RPM or similar rate limits.

  • Accurate and informative comment documentation and instructional text throughout ensuring high instructional quality.


While that also produces statistics, and any caching therein would be broken by a noncing index I had injected early in the translation job input, one must control the output length so they are consistent for statistics to be meaningful, due to the sampling used in generating responses.

Here’s the possible inspiration for the post with the graphs which I had just produced prior, employing max_tokens and a task all-but-guaranteed to fulfill that. This week's launches: o3, o4-mini, GPT-4.1, and Codex CLI - #3 by _j

The difficulty of the task is unlikely to affect the generation rate of gpt-4o models.

1 Like