I have built a transcriber using Python(asyncio)+Websockets that manages to capture input streaming audio and transcribe it appropriately. I wish to expand it to capture both input and output audio that we might get during a call (e.g. GMeet, MS Teams), and transcribe them separately with minimal latency. I’m a new dev, so I’m confused about the architecture and the tech stack required to implement the same. Should I open 2 different websocket connections and create a different thread from the input audio to pick up the output audio queue? What if I’m using a Bluetooth device for both my input and output requirements? I want to make it device-independent. Can I still pick up the output audio and process it to send it via the Websocket Connection to the API? Need some help with the architecture required and the tech stack to choose for minimal overhead latency. Thanks!
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Multiple API calls - high latency; options / product suggestion | 21 | 3471 | December 25, 2023 | |
| ChatGPT API TTS streaming | 3 | 5522 | January 21, 2025 | |
| How to perform real-time English-to-Chinese translation using Whisper and GPT-3.5-Turbo? | 4 | 5217 | October 10, 2023 | |
| How to Implement a Real-Time Chatbox with Speech-to-Text Integration in OpenAI API? | 0 | 243 | January 13, 2025 | |
| Issue with realtime api user interruption | 6 | 2188 | October 24, 2024 |