OpenAI’s Realtime API can optionally provide you the user side transcript. Can you use that? OpenAI RT API is voice-to-voice model. Optionally, OpenAI can provide you the user-side transcript by running it through a transcriber. You need to configure in session update that you need user side transcripts and also choose your model. Then, at conversation time, you need to subscribe to an event ‘response.audio_transcript.done’.
See details here https://platform.openai.com/docs/api-reference/realtime-server-events/response/audio_transcript/done