In each response.done
sent by the server you can count the tokens
In my limited testing of Realtime API.
It can:
- Whisper - sometimes is cut off and gets flagged.
- Sing - always immediately cut off and flagged as violation.
- Speak with various accents (Jamaican, Russian, etc.)
- Produce ambient sounds during speech (this happens spontaneously).
- Speak in higher/lower pitched voice.
It cannot:
- Detect how the user speaks (loudly/quietly, whispering/not whispering).
You need to ensure that your system prompt really guides the model for the above to work.
Hey, can someone their experience whether it’s better to integrate realtime API directly or through Livekit, Twilio or Agora?
I’d like to know which one works best, and how to differentiate one service from another. Maybe someone from OpenAI can shed some light here?
I did with LiveKit , easy setup tools and framework.
Realtime API is fantastic, but its extremely which makes deployment of commercial applications commercially unfeasible.
Do you see any reductions in pricing happening anytime soon?
Thanks
Prompt Caching being available now for the realtime API should make this more feasible in commercial deployments.
Hoping to see the service costs drop down as well, but this is pretty new, so only time will tell!