I have seen in the new realtime api reference, under transcription object of session.update event, they have mentioned that a new STT Model `gpt-4o-transcribe-latest` is supported but for some reason when i use it in the new GA session, it doesn’t work? It doesnt give the conversation.input_audio_transcription.delta at all. When i revert back to gpt-4o-transcibe it works, not sure what’s happening with oai
Also sometimes the transcripts coming in from realtime doesn’t align with the audio completely, there are instances where
the transcript is missing some phrases but in audio its available and the other way around too
also the events we are receiving are not always correct and sequential for some reason, there is some mismatch in the events being sent (timing wise), which makes it hard to align the audio with the text
thanks for sharing, but may I ask you if this is answering Madhuns question?
We have the same experience on our end. The audio2audio experience is pretty good, but as soon as we compare the users audio input with the transcription, it is getting worse. On top of Madhuns issues we have sometimes different languages between audio and text…
@Co_Brainers I’m partially answering Madhun’s question… how I get the transcripts. Specific things I am not addressing:
use of “…-latest” model… I don’t like those since it is highly volatile… basically they are just a pointer to the real model so why not choose that one explicitly? So I would not do that in general and I have no opinion on how to make it work
transcription models don’t match realtime models: I do see this in practice. The realtime models work directly off of audio tokens meaning they don’t need text… the fact that you’re getting text is just a convenience and there’s no guarantee that the content as understood by realtime will match the text as transcribed by a different model. If this is a big problem, I would start by using full models (not -mini). Ultimately you may have to chain the pipeline microphone→STT→llm→TTS→speaker so that text is truth… but latency will be a challenge.
thank you very much for the elaborated answer. I understand both point you have made. We had a working pipeline of google ASR → gpt4o → google TTS, which was pretty good so far, no latency problems either. The disadvantages recognizing named entities, e.g. names, email adresses were drinving us to try s2s and the results are much better - downsite is the divergence between voice/text and volatility in language detection.
We have only the need for realtime s2s, so I am currently thinging of changing to an post processing approach to get the transcription work done .. br andre