I’m experimenting with the beta Realtime API in a purely speech-to-speech scenario. According to this API reference, transcription via Whisper is not native to the main speech-to-speech model; it’s an optional, asynchronous feature.
My goal is to use function calling to produce structured json outputs based on spoken user input. The model itself seems to handle the audio directly so i’m not sure if enabling Whisper transcription is necessary for my function-calling flow or if it’s purely optional.
Specifically, at the end of the conversation, I need to produce a structured json response containing key information the user provided (through speech interactions). Does that require me to enable ‘input_audio_transcription’? Or can the model handle speech-to-speech natively and still trigger function calls that produce the json data?
Thanks in advance !