Building a 2-way-audio transcriber using OpenAI Realtime API

I have built a transcriber using Python(asyncio)+Websockets that manages to capture input streaming audio and transcribe it appropriately. I wish to expand it to capture both input and output audio that we might get during a call (e.g. GMeet, MS Teams), and transcribe them separately with minimal latency. I’m a new dev, so I’m confused about the architecture and the tech stack required to implement the same. Should I open 2 different websocket connections and create a different thread from the input audio to pick up the output audio queue? What if I’m using a Bluetooth device for both my input and output requirements? I want to make it device-independent. Can I still pick up the output audio and process it to send it via the Websocket Connection to the API? Need some help with the architecture required and the tech stack to choose for minimal overhead latency. Thanks!