Variable Response Times in Concurrent API Calls with OpenAI's ChatCompletion API

I’m experiencing variable response times when making concurrent API calls to OpenAI’s ChatCompletion API using Python’s ThreadPoolExecutor. While I have implemented parallel requests for 3 different prompts, the execution times for the 3 requests vary significantly, with some taking much longer than others, despite all being triggered simultaneously.

For example, the response times are:

  • 2.58 seconds
  • 4.44 seconds
  • 12.00 seconds

Environment

  • Model : GPT-4o-mini
  • Tier : 1
  • Avg Token Size : I/P = 5000 , O/p = ~300

I’m looking for insights into optimizing these calls for more consistent response times and any strategies for effectively managing concurrent requests to the API.

1 Like