Realtime Russian voice: any path to Custom Voices or stable pronunciation control?

Hi everyone.

We are building a Russian-language AI phone assistant using the OpenAI Realtime API for live phone conversations.

The integration works technically: sessions start, audio is generated, and voice/style instructions are delivered. We can influence general style, warmth, and pace, but we cannot get stable native Russian pronunciation through prompting alone.

The issue is not only phrasing. It is low-level speech quality: English-like melody, unstable Russian consonants, occasional nasal/metallic/synthetic resonance, and inconsistent articulation of Russian sounds and clusters.

We already tried:

  • a long pronunciation/style mask;
  • a shorter compact Russian-only instruction set;
  • positive pronunciation instructions;
  • negative instructions such as avoiding English-like R/L/T/N/M artifacts;
  • different speed/style settings.

The result improves somewhat, but it is not deterministic enough for production calls.

Has anyone found a reliable OpenAI-supported path for this?

Specifically:

  1. Is there any path to Custom Voices for Realtime?
  2. Is there any private beta, enterprise option, or allowlist for custom voice creation or voice tuning?
  3. Are pronunciation dictionaries, phoneme markup, SSML-style controls, or server-side voice profiles available anywhere in the OpenAI audio stack?
  4. If not, what is the best practical OpenAI Realtime setup for natural Russian voice agents?

We are not trying to clone a public person. We can provide a professional voice actor and explicit consent if a custom voice workflow exists.

Any practical guidance would be appreciated.

Hi @SerFer55 i went through the thing on a Russian phone assistant with Realtime

speaking on your questions as far as i can tell, Realtime still has no SSML, phonemes, or pronunciation dictionaries and the main built-in option is the “Reference Pronunciations” bit in the system prompt, it helps a little, but it’s not enough for stable Russian. the custom voices for Realtime do seem to exist, but only through enterprise/account-team approval; i haven’t found a public way to bring your own voice actor in yet

what i think is actually going wrong is a lot of the “English melody” and weird vowels come from wrong stress. Russian stress moves around, and when it’s off, vowel reduction goes off too, so the whole thing sounds foreign no matter how good the style prompt is, this is the issue according to my analysis

what i am trying is a small preprocessing step before Realtime normalize the text, run RUAccent for stress + ё, then sends stress-marked text into Realtime and A/B tests it against Common Voice Russian clips. If Realtime ignores the marks, the fallback is probably hybrid: Realtime for the conversation and dedicated Russian TTS for the audio which would go aorund +300–600 ms, but we will have proper phonetic control

well its still early for me, this is not production-proven yet but if you want to compare notes or throw a few real call phrases at the same test, i did be happy to share what i find. this feels like an open problem for everyone on Realtime + Russian right now

Thanks, this matches what we are seeing too.

One thing we tested is a “voice mask” approach. It is not SSML and not a real custom voice. It is basically a long, structured instruction layer for the Realtime session that tries to keep the Russian voice more stable.

In our setup we use it in three layers:

1. A base voice/session mask, sent in `session.update`.
   It describes the desired Russian phone voice: female assistant, natural business phone style, short phrases, Russian-only delivery, slower articulation, warmer but not theatrical tone, and specific negative constraints against English-like melody/articulation.

2. A stronger opening mask for the first phrase.
   The first response is where the model often drifts most, so we add extra constraints for the greeting only. We do not reuse that full opening mask for all replies, because it contains “say only the greeting and wait” type rules.

3. A smaller “voice anchor” / safe response layer before normal replies.
   This repeats the most important voice/articulation constraints without carrying over the first-greeting logic.

What it does:
- makes the voice more consistent between turns;
- reduces some of the English-like intonation;
- helps the first phrase sound closer to the later phrases;
- keeps the assistant in short phone-style Russian;
- improves stability, but does not give deterministic phoneme-level pronunciation.

What it does not do:
- it does not provide real phoneme control;
- it does not replace SSML;
- it does not solve all Russian stress/vowel reduction issues;
- it is still prompt-based, so results vary.

Latency-wise, the base mask is applied before the call and reused, so it does not add much per-turn latency once the Realtime session is ready. We also do a silent prewarm: the model generates a tiny hidden phrase before the phone call is accepted, so the actual caller does not hear that warmup. Cold start/prewarm can take several seconds, but we hide it behind a ready gate. During the actual call, the mask itself is not the main latency source; the normal Realtime turn latency is still mostly transcription + response generation + audio output.

the way I’m thinking about it: your mask stabilizes prosody and turn-to-turn consistency, and the stress preprocessing handles the vowel-reduction/lexical-stress errors that prompting cannot reach, they seem complementary rather than competing

i am going to add the mask as another arm in the A/B test so I can measure baseline vs mask-only vs stress-only vs mask+stress on the same Common Voice phrases, and once I have numbers i will share them here also If you can drop 10 to 20 real call phrases where pronunciation breaks worst, i will run them through all four and post the comparison it would be useful for both of us

That makes sense. I agree that these two approaches are probably complementary: the mask helps with delivery/prosody/consistency, while stress + ё preprocessing may help with the Russian-specific vowel reduction and lexical stress issues.

Here is a sanitized set of production-like Russian phone phrases where we usually hear the biggest pronunciation drift. These are not private customer data, just typical call phrases from our domain:

  1. Добрый день, меня зовут Марина, я ассистент сервиса «Мои места».
  2. Подскажите, удобно сейчас коротко поговорить?
  3. Мы помогаем компаниям возвращать клиентов через бонусы и рекомендации.
  4. Это не реклама в обычном смысле, я просто уточню пару вопросов.
  5. У вас уже есть клиентская база или программа лояльности?
  6. Клиенты чаще возвращаются, когда у них есть понятный бонус за повторный визит.
  7. Мы можем показать, как это работает на примере вашей компании.
  8. Если сейчас неудобно, я могу перезвонить позже.
  9. Правильно понимаю, что вам интереснее возврат клиентов, а не просто новые отзывы?
  10. Скажите, пожалуйста, кто у вас обычно отвечает за маркетинг или работу с клиентами?
  11. Я не буду занимать много времени, буквально один вопрос.
  12. Мы не заменяем вашу CRM, а добавляем понятный слой лояльности.
  13. Можно отправить короткую информацию, чтобы вы посмотрели спокойно?
  14. Если тема неактуальна, я зафиксирую и больше не буду беспокоить.
  15. Хорошо, тогда спасибо за время, хорошего дня.

The words/parts that often expose the issue are: “Мои места”, “лояльности”, “рекомендации”, “возвращать клиентов”, “клиентская база”, “повторный визит”, “зафиксирую”, “неактуальна”, and the general telephone greeting rhythm.

For the stress-marked version, I would prefer not to add manual stress marks by hand, because it is easy to make mistakes. It would be better to run the same list through your RUAccent pipeline and compare:

  • baseline
  • mask-only
  • stress/ё-only
  • mask + stress/ё

If you post the comparison, especially on these phone-style phrases, it would be very useful. My expectation is that the mask will help the voice stay more consistent, while stress preprocessing may help the words sound less foreign.

Hey Chirag,

thanks for joining in. Good to see you here.

I guess you couldn’t link this - since your account doesn’t allow it yet.

Thanks Jochen, yeah, link posting is restricted on my account for now, i am using Common Voice Russian as the eval corpus and running @SerFer55’s 15 phrases through RUAccent for the 4-way comparison (baseline / mask / stress / both) will post results here once the run is done

That does not sound russian to me. Same for many other phrases in the set…

just played a bit with gpt-4o-mini-tts on shimmer (speed 1.25):

Russian native speaker, in a dialog on the phone, tone friendly and helpful, fluent and natural, appropriate for a call dialog with a prospect. Speech respects dynamic stress, tonalities and russian language rhythm specifics. Samples are taken out of the conversation, so adjust accordingly.

  • У вас уже есть клиентская база или программа лояльности ?

Way better than others, but still english dynamic on clauses… I think a model that understands the context and text-to-speech markup would be good. Not sure where to find it.

BTW this one sounds more natural on same settings:

А Вы пользуетесь каким-либо Ц-Р-М или программой бонусов?

@SerFer55 i have tested 15 phone phrases from the thread · RUAccent + stress (automatic) , gpt-realtime-1.5 , voice alloy , GA Realtime , WER/CER via local Whisper vs reference text

the observations were

  • mask + stress together performed best
  • mask alone and stress alone were similar WER ~0.48–0.49; neither matched the combined arm
  • RUAccent stress alone did not stop paraphrasing but the voice mask stopped paraphrasing
  • «Мои места» improved with mask as ASR match went 0.25 → 0.62; stress-only stayed at 0.25
  • acoustic stress accuracy did not move much (~0.44 across arms)

automated stress scoring is noisy and is not a substitute for listening tests, i wish i knew russian

this automated eval on 15 phrases suggests mask + RUAccent stress is the best combination for getting Realtime to read scripted lines faithfully (WER 1.08 → 0.05). mask stops paraphrasing; stress help on top. as stress scores and MOS barely moved, so ear tests on real calls are still needed. but its worth piloting mask + stress together for scripted phone speech, try this out !

only allowed to post one media per post so second table here !

Need a twilio account for testing?

I need someone who understand the russian language, the issue is i do not understand russian, and i can make fixes in pipeline but someone is needed to take ear test and share feedback.

Hi Chirag,

Yes, we can help with the Russian ear tests. We are native Russian speakers, so we can listen to the generated samples and give detailed feedback on stress, vowel reduction, English-like melody, consonants, and overall naturalness in real phone-style Russian.

What we would like to try now is your Russian normalization approach combined with our existing full Realtime voice mask.

Our current setup is:

  1. Full Realtime voice mask as the base session/style layer.
  2. Stronger opening mask for the first phrase only.
  3. Smaller safe voice anchor before normal replies.
  4. VoxImplant phone bridge for real call testing, so we do not need Twilio for our side right now.

Could you please share the exact preprocessing format/pipeline you used before sending text to Realtime?

Specifically:

  • how RUAccent output should be represented in the final text;
  • how stress marks should be written;
  • whether ё should always be restored;
  • whether punctuation should be changed to improve Russian prosody;
  • whether phrases should be simplified/rephrased before Realtime;
  • whether you found a format that Realtime respects better than plain stress marks.

If possible, could you run or show the normalized output for the same 15 Russian phone phrases from this thread? Then we can apply:

  • mask only;
  • stress/ё normalization only;
  • full mask + stress/ё normalization;

and do real Russian listening tests on our side.

We do not want to post the full production mask publicly, but we can describe its structure and test your normalization layer together with it. If you can share a small script, pseudocode, or input/output examples, we can integrate it into our phone assistant and report back with native-speaker feedback.

Thanks again. This looks like the most practical path so far: not replacing the mask, but adding Russian phrase normalization before speech generation.

Hey Chirag,

we got 3 native speakers in the Empire chat on Signal. Just put some phrases + mp3/wav..

sure @jochenschultz sharing the .wav variation files in Empire Chat on Signal

Hi Chirag,

Small note: we are not on Signal. We are available only on Telegram.

Could you please share the .wav variations there instead, or post a downloadable link here in the thread if that is easier?

We can then listen to the samples as native Russian speakers and give detailed feedback on which version sounds best: mask only, stress/ё only, or full mask + stress/ё normalization.