Realtime Speech-to-Speech: Poor function calling performance vs vanilla gpt4o

lylevida · October 11, 2024, 5:32pm

I’ve done quite a bit of testing with the new realtime speech-to-speech API vs our current implementation for a AI phone agent that uses vanilla gpt4o.

The new realtime model is noticeably worse at calling functions vs vanilla gpt4o. Even with a lot of prompt reinforcement, the agent often simply does not call the function, even when it responds to the caller via audio that it did.

Anyone else noticing similar results? I’ve asked a few others I know are testing it and they have confirmed they are experiencing this as well.

anon25271712 · October 11, 2024, 8:44pm

I haven’t tested the realtime endpoint yet. Is there any possible work around you are thinking of implementing to get around this?

lylevida · October 11, 2024, 8:50pm

yeah I’m thinking about monitoring its output with gpt4o-micro to see if/when it misses running a function.

For example if it outputs “OK I’m sending you an SMS now” but there is no function output, the “checker” LLM call should be able to detect that so I can run it then.
Prompt would be something like "Here’s a list of functions able to be run: XYZ. Here is output of the last agent’s response: “OK I’m sending you the SMS now (but no function call)”
Is there a function that matches?

anyway, not sure if it will work but will try it out.

anon25271712 · October 11, 2024, 8:52pm

I’ll probably give the realtime endpoint a try on sunday, if you can, feel free to share your results! I’ll report back to this post if I don’t forget

chetan4 · October 12, 2024, 7:47am

I’ve also noticed the same problem. Your approach to catch misses is likely what we will do as well, for now.

lylevida · October 13, 2024, 8:53pm

Well I implemented the function calling cross-check via gpt4o-mini. It works pretty well! Definitely catching missed function calls. Hopefully it’s just a bandaid I can remove later though

Another thought I had is that gpt4o-realtime seems to have a minimum temperature of 0.6. We run gpt4o in prod at 0.1. I wonder if part of the problem is just the temperature being too high

anon25271712 · October 14, 2024, 8:35am

could be! I do the same, always the lowest temperature

I’ve noticed something similar on gpt-4o-canvas, where if you use a “gpt” it won’t be able to interact with canvas (if you “turn off” the gpt you can intereact with the canvas again), but I think that has to do with the function calling of the canvas only being associated with that specific model. Probably no relation, so I think you might be right about the temperature but I’m not sure.

also, glad that it worked and thanks for reporting back!

Topic		Replies	Views
Function calls in 3.5-turbo-0613 for compound text processing produce unreliable results API functions	7	1933	December 18, 2023
New models are incapable of proper function calling Feedback	22	5560	July 17, 2024
O3-mini in Assistants not following through with function call API	16	2878	April 18, 2025
Gpt-4o-2024-08-06 not reliably calling functions API gpt-4	3	230	March 7, 2025
1 out of 5 calls results in strange responses when using function calls API gpt-4 , gpt-4o	1	274	June 19, 2024

Realtime Speech-to-Speech: Poor function calling performance vs vanilla gpt4o

Related topics