Variable Response Times in Concurrent API Calls with OpenAI's ChatCompletion API

I’m experiencing variable response times when making concurrent API calls to OpenAI’s ChatCompletion API using Python’s ThreadPoolExecutor. While I have implemented parallel requests for 3 different prompts, the execution times for the 3 requests vary significantly, with some taking much longer than others, despite all being triggered simultaneously.

For example, the response times are:

  • 2.58 seconds
  • 4.44 seconds
  • 12.00 seconds

Environment

  • Model : GPT-4o-mini
  • Tier : 1
  • Avg Token Size : I/P = 5000 , O/p = ~300

I’m looking for insights into optimizing these calls for more consistent response times and any strategies for effectively managing concurrent requests to the API.

2 Likes

See the same problem.

  • 15 concurrent requests
  • to gpt-4o-mini
  • 7000 input each
  • output 200 each
    results in vastly different times: 8s, 13s, 29s.