Is/Was there a slow down in OpenAI Api ? (Increased Latency)

ozan.adiguzel · September 19, 2023, 8:26pm

We have been seeing that the response time/latency has been increased on chat completion api since last week or so. This usage doesn’t require streaming api but it is enabled but feature waits for whole stream to be completed before taking progress to next step so it is not needed in essential Is this something others notice as well or specific to us and our usage? I see other team under the org complain about it as well and I know it is not specific to my team usage.

Foxalabs · September 19, 2023, 9:52pm

Last weeks worth of inference times for 256 tokens, it’s up a bit… but seems about right for this time of the week.

ozan.adiguzel · September 20, 2023, 5:10pm

How do we access this graph? Total tokens count is around 4k with gpt-3-5-turbo model if we can take a look at that specific data. Aside from slow down, this specific feature has a complex task/prompt and it is taking pretty long time to execute and latency is high. We are looking into improving/optimizing the response time as well.

Foxalabs · September 20, 2023, 5:18pm

You can find it here

Foxalabs · September 20, 2023, 5:19pm

4k is fairly heavy, basically maxing out the context for every call, would certainly benefit from some prompt optimisation.

ozan.adiguzel · September 21, 2023, 8:02pm

The problem is that we need to pass content to get an accurate results and prevent hallucination. Content can vary and some can be pretty small as well. Do you have any recommendations of resolving this differantly? I think gpt-4 might be faster maybe but can’t tell yet. I will measure this. I am already thinking about prompt optimization and getting rid of some additional details and change max content token counts and see how speeds is effected. As per I know token count, complex tasks slow down the response as well.

Foxalabs · September 21, 2023, 9:16pm

Deltas are not really of any use to the API due to the stateless nature of the model, there would be no starting point to apply any delta to, so… great idea!, but not applicable here. If you need 4k of context… then you need it, the only real way to reduce that would be to first pass it to the model for summarisation, that would work, but you gain nothing as you still require the entire thing to be fed into the API at some point.

The models are only going to get faster and cheaper as time moves forwards, barring local blips for infrastructure issues that is. So this is a problem that will solve itself, given some time.

ozan.adiguzel · October 13, 2023, 1:37pm

I thought about summarization but it will add unnecessary additional cost plus I might need to summarize in chunks then get summary of all the summaries. It works fine for now with what I am doing. Is there any other tool/library that can be used? I am going to explore lang chain since it seems like a good option for large tokens or text / html content if I am not mistaken but it might not help for this

Topic		Replies	Views
Slow Chat api responses ------ API	17	6394	December 24, 2023
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9524	July 22, 2024
How to speed up GPT4 generation Feedback gpt-4 , chatgpt , api	10	5914	January 29, 2024
How to reduce OpenAI response time? API	13	17447	December 13, 2023
Completions API Suddenly slow API gpt4o	4	389	October 15, 2024

Is/Was there a slow down in OpenAI Api ? (Increased Latency)

Related topics