[GPT-3.5-Turbo-16k] Response generation is slower now for Function Calls

zeeshanap · August 18, 2023, 11:18am

Hi Community,

I’ve been working on a project that uses the GPT-3.5-Turbo model. I demoed a version of the project mid July and the max response generation timeout set for the project is 8 seconds. For all the use-cases I have, the demo was working perfectly. But running the exact same code today, I see that the use-cases are failing. Specially the chatCompletion requests that result in function call request have a significantly higher latency.
I tried the exact same environment as last month so I’m sure almost nothing has changed on my setup. So I’m wondering if the APIs have gotten slower now. Has anyone else also experienced the same thing?

Thanks!

Foxalabs · August 18, 2023, 11:25am

Hi,

There is always time of day and week variation, I’m sure there will also turn out to be time of year variation, but we are still in year one. There will also be proximal variation due to infrastructure and related connectivity factors.

One should ensure a large enough envelope of operation as to prevent timeout errors becoming a significant factor, I realise that some chained systems may have a compounded lengthening of interactions when this is factored in

You should understand that this is typical of a beta development environment and your end users should be aware that significant variation in performance can be expected in the early phases of the project.

zeeshanap · August 18, 2023, 11:44am

Thanks for your response @Foxalabs.

I’ve been trying to quantify the variations I have seen in the last few days. I have tried to run the program throughout the day (4am in the night to 4pm in the afternoon) but I see the same problem. In my observations, the problem I see is with specific requests only.
For requests that just result in text output, the response is just as quick as before. But when LLM response includes a function call request it seems as if its taking unusually long to generate that response. And subsequent calls for text generation after that are still fine.
So at least from my observations it seems like there is something wrong with Function calls. I was not sure if something changed with function calls in the last month.

Foxalabs · August 18, 2023, 11:56am

Indeed, function calling can take a significant amount of additional time to process, I’m not 100% certain, but looking at the source code to the API, there does seems to be a different model used for function results, so that is likely the cause. I can only suggest quantifying these results and accounting for the increased delay.

It could be worth exploring the complexity of the function calls, as they ultimately have to be processed by a model, so large complex functions will slow things down.

zeeshanap · August 18, 2023, 1:22pm

Thanks again @Foxalabs for checking this.

but looking at the source code to the API, there does seems to be a different model used for function results, so that is likely the cause

It’d be great if you could give me some reference to this.

Also, the functions themselves are not too complicated. Right now, for my testing I have disabled any complicated logic.

Let me explain this with a scenario to be more specific about the issue I’m facing.

Example: The user can ask my program to summarize a new article and then send the response generated to the user’s phone.
I have a function that I pass with the GPT chat Completion request, which is called send_to_phone(param). The param is a json object that looks like this, { “Title”: “…”, “Content”: “…” }.
The function just posts an api call on a background thread. So the function call itself doesn’t add a lot of latency.

I have an LLM executor that takes

Takes user text input
Runs it through LLM
Infers if LLM requested function call
If function call requested, run function and with the result go back to step 2
If function call not requested in Step4 then return back with the llm generated output.

Where I do see the unusual latency is on the first run, step 2 takes unusually long to complete. This wasn’t happening in July.
Also, I know that the latency increases with the number of tokens that are generated. So the param to the function sometimes is very long. But regardless, even when the input has a small num. of tokens I still see longer waits. Which is pretty strange considering it was working a couple of weeks ago.

Foxalabs · August 18, 2023, 1:32pm

The possibility of a different model was raised in a chat I had with another forum member, if I can find the thread I’ll link it later.

Although I’ve not found any great differences for basic use myself, it might be worth adding some detailed logging to the system events to see exactly at what point things are getting delayed, are you positive it’s between the API handoff and the receipt of a reply, or could there be some 3rd step/party involved? (I’m thinking maybe it’s a delay in the message to phone handler?)

Also are we talking differences of a few seconds? i.e. from 8 seconds to 25 seconds or from 8 seconds to 5 mins?

zeeshanap · August 19, 2023, 6:08am

I’m talking about latency differences in seconds. It used to be around 2 or 3 seconds before. Now its around 8 to 10 seconds. I checked the rest of my code too I optimized some parts of the code and added additional metrics now. I’ll continue to monitor this.
But for now I was able to get rid of the use case that caused the original issue.

Thank you for your insights.

generalbadwolf · August 20, 2023, 5:09am

“slower” will happen much in the same way as your GPU is busy decoding 4k video on the other screen.

In fact this is the perfect analogy in this context, just at a company scale.

zeeshanap · August 20, 2023, 6:52am

That’s why, I was expecting to see difference in behavior with time of day. But I don’t see that happening here. I am now timing my executions and consolidating all the data, in the data I’ve collected so far I don’t see any differences.

nahuelmd · October 13, 2023, 10:58am

Hello, I have the same issue but with much longer times than mentioned.

In a process where I made 3 consecutive requests, one after the other in sequence, I used to get the final result in about 5 seconds. Since this Wednesday, the process has been taking up to 8 minutes.

Does anyone else have this problem?

Can someone recommend a solution?

I am using the “gpt-3.5-turbo-16k” model.

Should I switch models?

Topic		Replies	Views
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9599	July 22, 2024
Unstable speed of gpt-3.5-turbo-16k API api , gpt-35-turbo-16k , performance	6	1111	January 9, 2024
Response times of GPT3.5 models API	3	496	November 24, 2023
Function Calling is VERY slow API gpt-4 , function-calling	6	2630	July 26, 2024
Very slow response time with chatgpt-3.5 turbo model API API	17	11035	December 19, 2023

[GPT-3.5-Turbo-16k] Response generation is slower now for Function Calls

Related topics