Hi there! I’ve been working on a small project that shows how you could develop your own web app which supports real-time voice conversations together with GPT-4o for photo/video questions and answers.
Real-time API: Probably high on the roadmap, but frontend only authentication is not that secure
Very limited p/d tokens in the lowest Tier (guess only max 5 minute conversation)
Would be great if the Realtime API Client give you feedback if you reached the limits (afaik not standard?)
nice to have the React app as reference, but also a bit overwhelmed with all Wav libraries and components. Bit hard to find out what was really needed to have the basics up and running
Again, really see the combination (real-time voice+photo/video) as a usefull tool for blind people