Adding voice (Whisper) input support to ChatKit

When using the ChatKit embed, is there a recommended way to capture microphone input,
transcribe the speech with Whisper, and feed that text into the same ChatKit conversation?

Or is native audio / microphone input planned for a future release of the embedded ChatKit?

2 Likes

I can’t speak to roadmap but my approach for this is to open a gpt-realtime conversation (have done websockets from py or webRTC from ts) to capture low-latency transcription and pipe it to my chatkit window as typed text. My chatkit agent has a reply field that I push back to realtime as text so the realtime model stays informed about what’s happening. Not counting on the realtime model to do any tool calling or reasoning on the reply other than to keep the user busy.

Bottom line… realtime does the low-latency audio in a super simple setup, chatkit is a smarter back-end that does the work. sort of a Cyrano de Bergerac setup. Turn detection is tricky and I don’t have it all nailed down but its much better than trying to detect audio turns just to do STT and TTS in and out of chatkit.