Hi Community,
I’ve been working on a project that uses the GPT-3.5-Turbo model. I demoed a version of the project mid July and the max response generation timeout set for the project is 8 seconds. For all the use-cases I have, the demo was working perfectly. But running the exact same code today, I see that the use-cases are failing. Specially the chatCompletion requests that result in function call request have a significantly higher latency.
I tried the exact same environment as last month so I’m sure almost nothing has changed on my setup. So I’m wondering if the APIs have gotten slower now. Has anyone else also experienced the same thing?
Thanks!
Hi,
There is always time of day and week variation, I’m sure there will also turn out to be time of year variation, but we are still in year one. There will also be proximal variation due to infrastructure and related connectivity factors.
One should ensure a large enough envelope of operation as to prevent timeout errors becoming a significant factor, I realise that some chained systems may have a compounded lengthening of interactions when this is factored in
You should understand that this is typical of a beta development environment and your end users should be aware that significant variation in performance can be expected in the early phases of the project.
Thanks for your response @Foxalabs.
I’ve been trying to quantify the variations I have seen in the last few days. I have tried to run the program throughout the day (4am in the night to 4pm in the afternoon) but I see the same problem. In my observations, the problem I see is with specific requests only.
For requests that just result in text output, the response is just as quick as before. But when LLM response includes a function call request it seems as if its taking unusually long to generate that response. And subsequent calls for text generation after that are still fine.
So at least from my observations it seems like there is something wrong with Function calls. I was not sure if something changed with function calls in the last month.
Indeed, function calling can take a significant amount of additional time to process, I’m not 100% certain, but looking at the source code to the API, there does seems to be a different model used for function results, so that is likely the cause. I can only suggest quantifying these results and accounting for the increased delay.
It could be worth exploring the complexity of the function calls, as they ultimately have to be processed by a model, so large complex functions will slow things down.
Thanks again @Foxalabs for checking this.
but looking at the source code to the API, there does seems to be a different model used for function results, so that is likely the cause
It’d be great if you could give me some reference to this.
Also, the functions themselves are not too complicated. Right now, for my testing I have disabled any complicated logic.
Let me explain this with a scenario to be more specific about the issue I’m facing.
Example: The user can ask my program to summarize a new article and then send the response generated to the user’s phone.
I have a function that I pass with the GPT chat Completion request, which is called send_to_phone(param). The param is a json object that looks like this, { “Title”: “…”, “Content”: “…” }.
The function just posts an api call on a background thread. So the function call itself doesn’t add a lot of latency.
I have an LLM executor that takes
- Takes user text input
- Runs it through LLM
- Infers if LLM requested function call
- If function call requested, run function and with the result go back to step 2
- If function call not requested in Step4 then return back with the llm generated output.
Where I do see the unusual latency is on the first run, step 2 takes unusually long to complete. This wasn’t happening in July.
Also, I know that the latency increases with the number of tokens that are generated. So the param to the function sometimes is very long. But regardless, even when the input has a small num. of tokens I still see longer waits. Which is pretty strange considering it was working a couple of weeks ago.
The possibility of a different model was raised in a chat I had with another forum member, if I can find the thread I’ll link it later.
Although I’ve not found any great differences for basic use myself, it might be worth adding some detailed logging to the system events to see exactly at what point things are getting delayed, are you positive it’s between the API handoff and the receipt of a reply, or could there be some 3rd step/party involved? (I’m thinking maybe it’s a delay in the message to phone handler?)
Also are we talking differences of a few seconds? i.e. from 8 seconds to 25 seconds or from 8 seconds to 5 mins?
I’m talking about latency differences in seconds. It used to be around 2 or 3 seconds before. Now its around 8 to 10 seconds. I checked the rest of my code too I optimized some parts of the code and added additional metrics now. I’ll continue to monitor this.
But for now I was able to get rid of the use case that caused the original issue.
Thank you for your insights.
1 Like
“slower” will happen much in the same way as your GPU is busy decoding 4k video on the other screen.
In fact this is the perfect analogy in this context, just at a company scale.
That’s why, I was expecting to see difference in behavior with time of day. But I don’t see that happening here. I am now timing my executions and consolidating all the data, in the data I’ve collected so far I don’t see any differences.
Hello, I have the same issue but with much longer times than mentioned.
In a process where I made 3 consecutive requests, one after the other in sequence, I used to get the final result in about 5 seconds. Since this Wednesday, the process has been taking up to 8 minutes.
Does anyone else have this problem?
Can someone recommend a solution?
I am using the “gpt-3.5-turbo-16k” model.
Should I switch models?