I have experienced this as well. Overall the realtime models are amazing at low latency non-text interactions but I do find that I need to add extra layers to get really smooth platform/tool interactions when doing complex things like calling tools with 6+ complex params. Some examples:
hesitation to actually call tools… some cases where a mainstream text-to-text model would clearly call a tool do not result in a tool call from realtime.
parameters need extra clarity, and enums sometimes need to be included either in tool descriptions or overall session instructions
some tools get ignored altogether - meaning they are never called
My solution (work around) for this until the realtime models get better is to give realtime a single tool which is something like “tool_agent” and have it pass that tool agent all the necessary context in NL form, then use a quick completion call to a different model to extract the actual tool call that needs to be made. This lowers the burden on the realtime model and lets you hook in a reasoning model at a higher level to determine which tool and how to call it. You can pack the “tool agent” tool with all the metadata, the conversation history, etc. to make sure it really nails the calls. Its a scruffy approach but it works well and I can test every new realtime model to decide when the patch can be removed.
In practice, I do let realtime call tools it seems to understand (not sure what the pattern is but I would characterize them generally as conversational tools like what the user wants, what the discussion is about, sentiment, etc.) and only use this tool_agent approach to encapsulate tools that it struggles with.
After trying lots of other audio models, its clear to me that gpt-realtime is the only choice and engineering some guardrails is a small price to pay for the win! Hope that helps
Thank you for taking the time to report this issue. Could you please share additional details, such as the call_id? That would be helpful for the team to diagnose and fix.