’m experiencing a consistent latency issue with the OpenAI API where many requests take 20–30+ seconds to complete, even after several optimizations. I’m using the model to extract entities using an ontology-based function-calling setup, and my input text averages around 30,000 characters (~8–9k tokens). I’ve tried switching models (including gpt-4.1-mini), simplified the function schema, and verified using Stopwatch in .NET that the delay happens inside the OpenAI API, not in my application. Even when the output is small, response times frequently exceed 20 seconds, which feels unusually high and inconsistent. I suspect the slowdown may be due to internal queueing related to TPM/RPM limits, concurrency restrictions, or deployment capacity, but I’m not fully sure how these contribute or how to diagnose them. I have access to the OpenAI dashboard, but I’m unclear which metrics (rate limits, autoscaling behavior, concurrency settings, instance count, etc.) directly impact this type of latency or how to interpret the queue time vs. compute time breakdown. I would appreciate guidance on whether this latency is expected for large ontology-based extraction prompts, how to determine if my requests are being queued, and what configuration changes (adjusting rate limits, enabling autoscaling, modifying deployment settings, or switching model versions) could help improve consistency and reduce response times.
1 Like