We have been seeing that the response time/latency has been increased on chat completion api since last week or so. This usage doesn’t require streaming api but it is enabled but feature waits for whole stream to be completed before taking progress to next step so it is not needed in essential Is this something others notice as well or specific to us and our usage? I see other team under the org complain about it as well and I know it is not specific to my team usage.
Last weeks worth of inference times for 256 tokens, it’s up a bit… but seems about right for this time of the week.
How do we access this graph? Total tokens count is around 4k with gpt-3-5-turbo model if we can take a look at that specific data. Aside from slow down, this specific feature has a complex task/prompt and it is taking pretty long time to execute and latency is high. We are looking into improving/optimizing the response time as well.
You can find it here
4k is fairly heavy, basically maxing out the context for every call, would certainly benefit from some prompt optimisation.
The problem is that we need to pass content to get an accurate results and prevent hallucination. Content can vary and some can be pretty small as well. Do you have any recommendations of resolving this differantly? I think gpt-4 might be faster maybe but can’t tell yet. I will measure this. I am already thinking about prompt optimization and getting rid of some additional details and change max content token counts and see how speeds is effected. As per I know token count, complex tasks slow down the response as well.
Deltas are not really of any use to the API due to the stateless nature of the model, there would be no starting point to apply any delta to, so… great idea!, but not applicable here. If you need 4k of context… then you need it, the only real way to reduce that would be to first pass it to the model for summarisation, that would work, but you gain nothing as you still require the entire thing to be fed into the API at some point.
The models are only going to get faster and cheaper as time moves forwards, barring local blips for infrastructure issues that is. So this is a problem that will solve itself, given some time.
I thought about summarization but it will add unnecessary additional cost plus I might need to summarize in chunks then get summary of all the summaries. It works fine for now with what I am doing. Is there any other tool/library that can be used? I am going to explore lang chain since it seems like a good option for large tokens or text / html content if I am not mistaken but it might not help for this