Welcome to the community forums! Good to have you.
Just for clarification,
you’re not using streaming, right?
with latency, you’re not referring to time-to-first-token, but rather just request → response time, is that correct?
This might be pretty concerning, because that might imply that a lot of people using streaming are getting billed for what they would consider timeout errors
Yes, request → response time. We are using a proxy monitoring platform and supposedly the time-to-fist-token is 0ms but I’d take that with a grain of salt. Worth saying these issues were happening before we started using the proxy.
Any advice on how we could go about debugging this? Since it seems random, we considered killing the connection after a certain time limit and retrying but that would still lead to long response times.
May I ask which platform you are using to monitor this? I have also been observing much higher gpt latencies recently and trying to get to the bottom of it. In fact, I use the python client library, and I notice some calls even seem to not return after two minutes!
Same problems here with assistants, starting today. For example, retrieving information from existing threads can take several seconds to tens of seconds, similarly when creating threads etc.
I am facing the same exact problem here, the latency times suddenly significantly longer. I have an optimised prompt that takes 1 second, and now it’s suddenly 30 seconds. I am using both gpt-4o and gpt-4o-mini and have the same excessive latency.
Same here… the requests took more than 31.28 s, normally less than 5 s. Slow on both stream, non stream. Prod env cannot process the requests from the customers. Outages.
It’s solved for me now. Thanks! I am watching OpenAI Status page.
Here’s a minimal reproducible example of the abnormally high latency times. I am also facing similar times when using gpt-4o-mini, and when I converted the code below to langchain
I have 2 api key
one is very long : ‘sk- proj-***********************************************************************************************************************************’
another is shorts : ‘sk- ***********************************’
they show different speed
long key is faster.
short key is slow.
how…
import { ChatOpenAI } from "langchain/chat_models/openai";
import { StringOutputParser } from "langchain/schema/output_parser";
let key = "sk-...."
const parser = new StringOutputParser();
const model = new ChatOpenAI({
modelName: "gpt-4o",
temperature: 0.6,
openAIApiKey: key,
});
const res = await model.pipe(parser).invoke([systemPrompt, humanPrompt]);
return res;
We are using AWS lambdas to make the requests (Ireland based). The proxy uses cloudflare workers so should be quick but it happens even without the proxy.