I’m experiencing variable response times when making concurrent API calls to OpenAI’s ChatCompletion API using Python’s ThreadPoolExecutor. While I have implemented parallel requests for 3 different prompts, the execution times for the 3 requests vary significantly, with some taking much longer than others, despite all being triggered simultaneously.
For example, the response times are:
- 2.58 seconds
- 4.44 seconds
- 12.00 seconds
Environment
- Model : GPT-4o-mini
- Tier : 1
- Avg Token Size : I/P = 5000 , O/p = ~300
I’m looking for insights into optimizing these calls for more consistent response times and any strategies for effectively managing concurrent requests to the API.