We’ve running into latency issues when trying out GPT 3.5 turbo and GPT through the API.
We’re asking the model to create questions based on a simple prompt and return the data as JSON. With text-davinci-003 we’re around 15-25 seconds for the full result to arrive, but with the better models its around 45 secs minimum, often way over 60 secs. Max tokens are around 2000 for the prompt and response combined.
Is this normal or way off? Either way they make the api pretty much unusable for production.
In my experience, it is very varied in terms of response latency. Depending on the time of the day and the traffic, my query returns usually vary between 24 - 87 sec (I have a pipeline with multiple calls to GPT involved, usually with tenacity used to prevent RateLimitError).
While it is definitely not fit for high speed production as it is, maybe with foundry being available in the near future, that might solve the problem.
at the moment this is (unfortunately) quite common. Around 1 Week ago it was more like 1-4 Seconds and quite stable in this range.
If you look in the forum you’ll see a lot of threads regarding high latency and timeouts. In my option this is related to the change in infrastructure and high demand at the moment.
Hope this helps you to better assess the situation