Building a 2-way-audio transcriber using OpenAI Realtime API

Arnab_Mondal · October 13, 2025, 10:13am

I have built a transcriber using Python(asyncio)+Websockets that manages to capture input streaming audio and transcribe it appropriately. I wish to expand it to capture both input and output audio that we might get during a call (e.g. GMeet, MS Teams), and transcribe them separately with minimal latency. I’m a new dev, so I’m confused about the architecture and the tech stack required to implement the same. Should I open 2 different websocket connections and create a different thread from the input audio to pick up the output audio queue? What if I’m using a Bluetooth device for both my input and output requirements? I want to make it device-independent. Can I still pick up the output audio and process it to send it via the Websocket Connection to the API? Need some help with the architecture required and the tech stack to choose for minimal overhead latency. Thanks!

Topic		Replies	Views
ChatGPT API TTS streaming API api	3	5756	January 21, 2025
How to Implement a Real-Time Chatbox with Speech-to-Text Integration in OpenAI API? API assistants-api	0	269	January 13, 2025
Transcription of streaming audio using gpt-4o-transcribe API gpt-4	0	395	March 25, 2025
Syncing audio and text in realtime api API gpt-4 , realtime	0	305	March 20, 2025
Realtime streaming transcription API api-realtime	4	267	February 23, 2026

Building a 2-way-audio transcriber using OpenAI Realtime API

Related topics