Concurrency Rate Limiting: A $10,000 Issue

Again, in essence, the problem is really simple - the total network latency seems to scale with the number of concurrent requests, and not in a way that corresponds with our rate limits.

I made a dummy series of requests using 62 prompt input and resulting in 62 output tokens. I duplicated this request to run in parallel (asynchronously / concurrently) 5 times, 50 times, and 100 times.

Again, recall that every single request is identical to every other one. Because it’s being run in parallel, we would expect aside from some minor millisecond differences due to network congestion (neglible on our AWS server) that they should all have roughly the same network latency.

Instead:

5 concurrent requests → 2.32 seconds (avg)
50 concurrent requests → 4.90 seconds (avg)
100 concurrent requests → 9.22 seconds (avg)

Manually checking the 100-concurrent-requests data, we find that we don’t come anywhere close to exhausting our rate limit.

This is a significant bug for enterprise-level scaling.