Hi guys, me and my team have been building a custom voice AI agent/assistant in non-english language. We are trying to build a model for various slavic languages.
While the base model is quite OK already, I still have some specific questions for anybody that did some similar experiments in the past.
Im very much interested in:
What language did you build for?
What settings did work best for you? - Has anybody tried to apply custom voice after the model output (for instance, we would like the output to sound like a fictional character)? Is this even possible?
what Realtime API voice did you end up using?
My observations after some experimenting:
For slavic languages Alloy, Coral and Verse seem to be the best options
Temperature aroun 0.9 - 1.0 seem to have the best creative expressions - correct pronunciation ratio
the model performs much better if you stop the first sequence and continue with new (second) one. Im not sure how and why, but multiple tests showed this behaviour.
custom functions for adding more rules work really well
only the gpt-4o gives desired outputs - mini does not perform well
its very expensive if you are planning on using it on a scale (event with caching)
Thank you for any possible info about good practices, about fuckups and learnings. Will share my as well.
I can’t answer all of these, but I can give you insights on this one:
Assuming you mean the actual sound of a voice and not the way they enunciate, these kinds of vocal output requests are typically done through elevenlabs.
While it’s possible to essentially fine-tune models to sound like a particular “fictional character”, this is also illegal without explicit permission from the voice actor who voiced that character.
VAs have the rights to their voice and how it’s used. You’re not making it sound like a fictional character, you’re replicating the voice of the voice actor who plays that fictional character.
Because of this, people use elevenlabs because it’s both emotive and the output options were already trained on VAs who allow this kind of work to be done using their voice.
To first address the legal and copyright issues. We have our own voice or would just like to use a generic “duck sounding quacking” voice manipulatior that would make the voice agent sound more fun and friendly,
I would use elevenlabs but they dont support the language i need to use, thats why I’m looking for some different approaches.