Is realtime api directly speech to speech?

The only “text conversion” is providing you a transcript of the output. This uses a separate transcription service for audio to text.

There is conversion: wav audio to a tokenized spectral audio version for understanding (but not text), and the reverse codec for output. This is proprietary.

1 Like