Realtime API: Poor Portuguese call quality with gpt-realtime-mini / gpt-realtime

Hi everyone,

I’d like to share an issue I’m experiencing in a real production scenario using the OpenAI Realtime API for phone calls in Portuguese.

I have tested different Realtime models, including gpt-realtime-mini and gpt-realtime, but the problem is very similar across them.

In Brazil, many users answer calls using speakerphone mode. This captures a lot of background noise from the environment and causes interference. As a result, the AI frequently starts talking by itself or responds when the user has not clearly finished speaking.

Another frequent issue is speech recognition accuracy when collecting data. For example, when the user provides an address, name, or other important information, the AI often understands completely different words. Sometimes the words are not even similar. For example, the person says “Apple”, but the AI understands “Oscar”.

Currently, I’m using Realtime directly via SIP, with transcription enabled, and I accept the call using these parameters:

‘audio’ => [
‘input’ => [
‘format’ => [‘type’ => ‘audio/pcmu’],

    'transcription' => [
        'model' => 'gpt-4o-transcribe',
        'language' => 'pt',
        'prompt' => 'Transcribe with maximum fidelity. Proper names are critical. Do not correct names. If you’re unsure, keep exactly what you heard.'
    ],

    'turn_detection' => [
        'type' => 'server_vad',
        'threshold' => 0.7,
        'prefix_padding_ms' => 300,
        'silence_duration_ms' => 1100,
        'idle_timeout_ms' => 12000,

        'create_response' => true,
        'interrupt_response' => true,
    ],
],

'output' => [
    'format' => ['type' => 'audio/pcmu'],
    'voice' => 'marin',
],

]

Has anyone had a better experience with calls in noisy environments, especially with not english users on speakerphone?

Any recommendations for improving transcription accuracy, turn detection, VAD configuration, or reducing cases where the AI starts speaking by itself would be very welcome.

I already tested gpt-realtime-1.5 and gpt-realtime-2, but they are not acceptable for my use case right now because they are too expensive. Also, in my tests, the same issues still happened with them.

About your problem with noises and background interference, I would consider a implementation of a filter (band-pass filter, spectral denoise, noise reduction, etc.), there is a lot kind of filters try some to check which one fits better for you case. Is important to know that the API supports noise_reduction filter as parameter. Check Realtime transcription.

About the recognition accuracy, it can be related with the noises. I would first try to implement the filters. If it doesn’t resolve, try another models. But, I would say that the gpt realtime models are already good. I have some projects using Gemini speech-to-text and live API and they’re good too.

Another approach is to implement a second turn that pass along the transcribed text and improve it, but probably it is not worth for real-time cases due to the delay.

Thanks for laying this out so clearly @leandro-ligmee, and good suggestion from @rafa3 on filtering/noise reduction.

This sounds less like one single model issue and more like the usual phone-call stack problem: speakerphone + background noise + μ-law audio + VAD deciding “that was enough speech” too early.

A few things I’d try before changing models:

  • Enable the Realtime input noise reduction option if you are not already using it.
  • Pre-process audio before SIP/Realtime if possible: band-pass for voice, noise suppression, AGC, echo cancellation.
  • Raise silence_duration_ms a bit more for Portuguese phone calls, since users may pause mid-address or mid-name.
  • Consider setting create_response: false and manually creating the response only after you’re confident the user finished. That can reduce “AI starts talking by itself” cases.
  • For addresses/names, don’t rely on one pass. Ask for confirmation: “Did you say Rua X?” or collect critical fields twice in a structured way.
  • Add domain hints in the transcription prompt, like expected city names, street formats, common Brazilian names, etc. The generic “maximum fidelity” instruction may not be enough.

I agree with @rafa3 that noise is probably the first thing to attack. If the mic input is messy, the transcription and VAD will both behave worse, even with better models.

Would be useful to know whether you’re seeing more false starts during silence/background noise, or mostly while the user is still speaking. Those usually need slightly different fixes.

-Mark G.

Hi @rafa3, thank you for your reply.

I tried your recommendation of setting noise_reduction to far_field. It seems to improve the handling of background noise a little, but the call quality is still not too good in some of my scenarios, like order pizzas.

I have also tested gpt-realtime-mini-2025-10-06, gpt-realtime-1.5, and gpt-realtime-2, but I experienced the same issues. In fact, in my Portuguese-speaking use case, those models performed even worse than gpt-realtime-mini / gpt-realtime.

During the call, the voice sometimes changes unexpectedly, almost as if another person is speaking. This happens intermittently and makes the experience feel unstable.

I am still looking for possible solutions, such as improving the prompt or adjusting other parameters.

Thank you again for your help.

Hi everyone,

After using and observing the behavior of the latest changes for a few days, here is what I noticed.

Using noise_reduction as far_field did improve the call quality somewhat, but it is still not good enough. In many cases, the AI still misunderstands names, numbers, and what the customer says, even after the customer repeats it two or three times. This creates a poor user experience.

I attached a call example here: https://youtu.be/Qo4vQNA0wQI

In this call, you can hear the user saying his name, “Marcio”, but the AI understood it as “Lucas”. In the same call, the user says he would like to pay by credit card, but the AI understood it as cash/money and got stuck in a loop asking the same question again. There were several mistakes in the same call.

Model used in this example: gpt-realtime-mini-2025-10-06

Audio input parameters:

'input' => [
    'format' => [
        'type' => 'audio/pcmu'
    ],
    'noise_reduction' => [
        'type' => 'far_field',
    ],
    'transcription' => [
        'model' => 'gpt-4o-transcribe',
        'language' => 'pt',
        'prompt' => 'Transcribe with maximum fidelity. Proper names are critical. Do not correct names. If you’re unsure, keep exactly what you heard.'
    ],
    'turn_detection' => [
        'type' => 'server_vad',
        'threshold' => 0.7,
        'prefix_padding_ms' => 300,
        'silence_duration_ms' => 1100,
        'idle_timeout_ms' => 12000,
        'create_response' => true,
        'interrupt_response' => true,
    ],
],

I would appreciate any guidance or suggestions on how to improve Portuguese call quality, especially for proper names, numbers, and payment methods.

Thanks for sharing the example and configuration details, @leandro-ligmee.

A few areas that may be worth testing:

  • Audio quality: You're currently using audio/pcmu (8 kHz telephony audio), which can make names, numbers, and payment methods harder to recognize. If your setup allows it, testing higher-quality audio (such as 16 kHz PCM) may help.
  • Noise reduction: far_field can be useful in some environments, but phone calls are often closer to a near_field use case. It could be worth comparing near_field, far_field, and no noise reduction to see which performs best.
  • VAD settings: A slightly lower threshold (for example, 0.5-0.6) may help avoid clipping softer speech.
  • Confirmation of critical details: For names, numbers, addresses, or payment methods, adding a confirmation step can help catch recognition errors before they affect downstream actions.

Your transcription prompt already emphasizes accuracy, so the biggest gains may come from the audio pipeline rather than prompt adjustments alone.

Interesting example, especially for Portuguese telephony workflows. It would be useful to know whether the transcription issues are consistent across calls or concentrated around specific terms such as names and payment methods.

-Mark G.

Thanks for the suggestions.

I did some additional testing on the SIP/RTP side with Asterisk/PJSIP, and it looks like increasing the audio quality is not currently practical in this setup.

The OpenAI SIP endpoint accepted G.711 only:

  • PCMU/8000 → accepted, call completed

  • PCMA/8000 → accepted, call completed

  • G722/8000 → rejected with 400 Bad Request

  • L16/16000 → rejected with 400 Bad Request

  • L16/24000 → rejected with 400 Bad Request

For example, this was rejected:

m=audio 15822 RTP/SAVP 123 101
a=rtpmap:123 L16/24000
a=rtpmap:101 telephone-event/8000
a=ptime:20
a=sendrecv

This was accepted:

m=audio 41544 RTP/SAVP 8 101
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=ptime:20

I also tried setting the calls.accept configuration to audio/pcm with rate: 24000, but when the SIP endpoint is configured with ulaw or alaw, the actual negotiated media remains G.711:

NativeFormats: (ulaw)
ReadFormat: ulaw
WriteFormat: ulaw

So it seems that audio/pcm in calls.accept does not make the SIP/RTP leg accept or negotiate L16/24000. At least in my tests, the SIP endpoint only works with G.711 (PCMU/PCMA) at 8 kHz.

For PSTN SIP trunks this is also a practical limitation, because all carriers I work with provide only 8 kHz codecs such as PCMU/PCMA. Even in the WhatsApp Calling SIP scenario, where Opus may be available on the Meta side, the audio would still be transcoded down to G.711 before reaching OpenAI if the OpenAI SIP leg only accepts PCMU/PCMA.

So, unless there is a specific SDP format required for PCM over SIP, or some other supported wideband codec on the OpenAI SIP endpoint, increasing the audio sample rate is not currently feasible with direct SIP integration.

It would be useful to clarify whether audio/pcm in calls.accept is expected to apply to SIP/RTP codec negotiation, or only to non-SIP Realtime media flows.

The issue is not equally distributed across all speech. General conversation is often understandable, but the most problematic parts are critical short entities: names, numbers, addresses, and payment methods. This is especially problematic because those are exactly the fields that need high accuracy in telephony workflows.

In Portuguese phone calls, the model can follow the overall intent, but it frequently mishears proper names or short payment-related terms, even when the user speaks naturally. That makes the workflow risky unless we add explicit confirmation steps.

Thanks for laying all this out, @leandro-ligmee. This one’s definitely tricky because there are a few problems overlapping at once: noisy speakerphone audio, VAD false starts, Portuguese name/number recognition, and the lower audio quality that comes with SIP/G.711.

A few things I’d try next:

  1. Compare near_field, far_field, and no noise reduction. Phone calls are often closer to a near-field setup, so near_field may behave better than far_field.
  2. Add audio cleanup before sending it to the API if you can: echo cancellation, gain control, band-pass filtering, and noise suppression can make a big difference for VAD.
  3. Tune turn detection more conservatively. A slightly lower threshold plus a longer silence_duration_ms can help avoid cutting off Portuguese speech or reacting to background noise too early.
  4. For names, phone numbers, and payment methods, use a stricter confirmation flow. For example, collect one field, repeat it back clearly, and only continue after the user confirms. For names or numbers, spelling things out character by character can help a lot.
  5. For payment, closed questions usually work better: “cartão de crédito ou dinheiro?” Then confirm the answer before moving on.
  6. Since SIP is limited here to 8 kHz G.711, it’s also worth testing the same flow over WebRTC or WebSocket with higher-quality PCM audio. That should help separate “model/prompt issue” from “audio transport issue.”

The biggest practical change is probably combining cleaner audio input with a more structured conversation flow. Don’t let the assistant guess when the audio is unclear. Have it ask “pode repetir?” instead, especially for names, numbers, and payments.

-Mark G.

Thanks, Mark.

I did some additional testing based on your suggestions.

Regarding noise reduction, I tested near_field, far_field, and no noise reduction. In my scenario, near_field works reasonably well when the customer is speaking directly into the phone, but around 20% of our customers use speakerphone. For those cases, far_field seems to behave better, so for now it looks like the best compromise for our customer base.

I also tried applying audio cleanup/filtering on the SIP/Asterisk side before sending the call to OpenAI. I enabled filtering on the SIP channel, but honestly I did not notice a meaningful difference in the final behavior. It may help a little in some cases, but it does not seem to solve the main issue.

About higher-quality audio: I checked the calls.accept documentation Accept call | OpenAI API Reference and saw that audio/pcm at 24 kHz is available. However, in my SIP tests I could not get this working over the SIP/RTP leg. With Asterisk/PJSIP:

  • PCMU/8000 was accepted

  • PCMA/8000 was accepted

  • G722/8000 was rejected with 400 Bad Request

  • L16/16000 was rejected with 400 Bad Request

  • L16/24000 was rejected with 400 Bad Request

So, even if I configure calls.accept with audio/pcm at 24 kHz, the actual SIP channel still negotiates G.711 when the Asterisk endpoint is configured with ulaw or alaw. In my case, I cannot use WebRTC for this integration, so I am limited to SIP.

Regarding confirmation flows: I already have confirmation steps for names, addresses, payment methods, and other critical fields. The assistant repeats the collected information and asks the user to confirm. This helps, but for use cases that require high accuracy, such as taking pizza orders by phone, it is still not enough. The general conversation often works, but short critical entities such as names, numbers, addresses, and payment methods are still risky.

After many changes and tests, my impression is that the adjustments help and the quality improves somewhat, but with SIP/G.711 in Portuguese, the reliability is still not at the level I would need for accurate order-taking workflows.

I would be interested to hear from other people using the Realtime API over SIP in languages other than English. How is it working for you? Are you seeing similar issues with names, numbers, and structured data collection?