Realtime Speech-to-Speech: Poor function calling performance vs vanilla gpt4o

I’ve done quite a bit of testing with the new realtime speech-to-speech API vs our current implementation for a AI phone agent that uses vanilla gpt4o.

The new realtime model is noticeably worse at calling functions vs vanilla gpt4o. Even with a lot of prompt reinforcement, the agent often simply does not call the function, even when it responds to the caller via audio that it did.

Anyone else noticing similar results? I’ve asked a few others I know are testing it and they have confirmed they are experiencing this as well.

5 Likes

I haven’t tested the realtime endpoint yet. Is there any possible work around you are thinking of implementing to get around this?

2 Likes

yeah I’m thinking about monitoring its output with gpt4o-micro to see if/when it misses running a function.

For example if it outputs “OK I’m sending you an SMS now” but there is no function output, the “checker” LLM call should be able to detect that so I can run it then.
Prompt would be something like "Here’s a list of functions able to be run: XYZ. Here is output of the last agent’s response: “OK I’m sending you the SMS now (but no function call)”
Is there a function that matches?

anyway, not sure if it will work but will try it out.

2 Likes

I’ll probably give the realtime endpoint a try on sunday, if you can, feel free to share your results! I’ll report back to this post if I don’t forget

2 Likes

I’ve also noticed the same problem. Your approach to catch misses is likely what we will do as well, for now.

1 Like

Well I implemented the function calling cross-check via gpt4o-mini. It works pretty well! Definitely catching missed function calls. Hopefully it’s just a bandaid I can remove later though :crossed_fingers:

Another thought I had is that gpt4o-realtime seems to have a minimum temperature of 0.6. We run gpt4o in prod at 0.1. I wonder if part of the problem is just the temperature being too high :thinking:

1 Like

could be! I do the same, always the lowest temperature

I’ve noticed something similar on gpt-4o-canvas, where if you use a “gpt” it won’t be able to interact with canvas (if you “turn off” the gpt you can intereact with the canvas again), but I think that has to do with the function calling of the canvas only being associated with that specific model. Probably no relation, so I think you might be right about the temperature but I’m not sure.

also, glad that it worked and thanks for reporting back!