I was wondering if the realtime api also gives you back the original user input audio file that it sends to Whisper for transcription? I can only retrieve the transcript now but I might need audio file too for my workflow.
It does not.
You’d have to:
- use your own voice activity detector.
- Isolate the speech to send.
- You’ve got your audio
Then use the API:
- add a conversation item for user
- do a response.create
- get audio back
Sorry if that breaks the illusion of “real-time”, but that’s what’s actually being done for you by the API with VAD. It doesn’t fit in with WebRTC, but then WebRTC authentication doesn’t fit in with any good pattern of security.