Realtime API sometimes creates speech before a tool call, sometimes doesn't

There is no real control to instruct the system whether it should create speech before calling a function.
For example I tried to prevent the system to generate speech before calling a function by explicitly instructing it multiple times in the base prompt:

Do not announce your function calls or tool calling plans, this is very important.
Never generate audio or text before calling a function.
If you want to call a tool just say nothing.
When you need to use a tool, call it immediately without explaining that you’re about to do so.

Furthermore I implemented a fake conversation history (basically in context learning), explicitly containing one user prompt followed by a function call and an answer (showing the system that no audio is generated before calling the function). This helped but there is still a ~15% chance that it generates speech before calling a function. Also, as far as I know, there is no way that we know in hindsight whether the audio will be followed by a function call or not.

The only thing left I could do is change the function descriptions, explicitly mentioning that it should not create audio when calling the function.