The only “text conversion” is providing you a transcript of the output. This uses a separate transcription service for audio to text.
There is conversion: wav audio to a tokenized spectral audio version for understanding (but not text), and the reverse codec for output. This is proprietary.