Realtime API: Poor Portuguese call quality with gpt-realtime-mini / gpt-realtime

Hi everyone,

I’d like to share an issue I’m experiencing in a real production scenario using the OpenAI Realtime API for phone calls in Portuguese.

I have tested different Realtime models, including gpt-realtime-mini and gpt-realtime, but the problem is very similar across them.

In Brazil, many users answer calls using speakerphone mode. This captures a lot of background noise from the environment and causes interference. As a result, the AI frequently starts talking by itself or responds when the user has not clearly finished speaking.

Another frequent issue is speech recognition accuracy when collecting data. For example, when the user provides an address, name, or other important information, the AI often understands completely different words. Sometimes the words are not even similar. For example, the person says “Apple”, but the AI understands “Oscar”.

Currently, I’m using Realtime directly via SIP, with transcription enabled, and I accept the call using these parameters:

‘audio’ => [
‘input’ => [
‘format’ => [‘type’ => ‘audio/pcmu’],

    'transcription' => [
        'model' => 'gpt-4o-transcribe',
        'language' => 'pt',
        'prompt' => 'Transcribe with maximum fidelity. Proper names are critical. Do not correct names. If you’re unsure, keep exactly what you heard.'
    ],

    'turn_detection' => [
        'type' => 'server_vad',
        'threshold' => 0.7,
        'prefix_padding_ms' => 300,
        'silence_duration_ms' => 1100,
        'idle_timeout_ms' => 12000,

        'create_response' => true,
        'interrupt_response' => true,
    ],
],

'output' => [
    'format' => ['type' => 'audio/pcmu'],
    'voice' => 'marin',
],

]

Has anyone had a better experience with calls in noisy environments, especially with not english users on speakerphone?

Any recommendations for improving transcription accuracy, turn detection, VAD configuration, or reducing cases where the AI starts speaking by itself would be very welcome.

I already tested gpt-realtime-1.5 and gpt-realtime-2, but they are not acceptable for my use case right now because they are too expensive. Also, in my tests, the same issues still happened with them.

About your problem with noises and background interference, I would consider a implementation of a filter (band-pass filter, spectral denoise, noise reduction, etc.), there is a lot kind of filters try some to check which one fits better for you case. Is important to know that the API supports noise_reduction filter as parameter. Check Realtime transcription.

About the recognition accuracy, it can be related with the noises. I would first try to implement the filters. If it doesn’t resolve, try another models. But, I would say that the gpt realtime models are already good. I have some projects using Gemini speech-to-text and live API and they’re good too.

Another approach is to implement a second turn that pass along the transcribed text and improve it, but probably it is not worth for real-time cases due to the delay.

Thanks for laying this out so clearly @leandro-ligmee, and good suggestion from @rafa3 on filtering/noise reduction.

This sounds less like one single model issue and more like the usual phone-call stack problem: speakerphone + background noise + μ-law audio + VAD deciding “that was enough speech” too early.

A few things I’d try before changing models:

  • Enable the Realtime input noise reduction option if you are not already using it.
  • Pre-process audio before SIP/Realtime if possible: band-pass for voice, noise suppression, AGC, echo cancellation.
  • Raise silence_duration_ms a bit more for Portuguese phone calls, since users may pause mid-address or mid-name.
  • Consider setting create_response: false and manually creating the response only after you’re confident the user finished. That can reduce “AI starts talking by itself” cases.
  • For addresses/names, don’t rely on one pass. Ask for confirmation: “Did you say Rua X?” or collect critical fields twice in a structured way.
  • Add domain hints in the transcription prompt, like expected city names, street formats, common Brazilian names, etc. The generic “maximum fidelity” instruction may not be enough.

I agree with @rafa3 that noise is probably the first thing to attack. If the mic input is messy, the transcription and VAD will both behave worse, even with better models.

Would be useful to know whether you’re seeing more false starts during silence/background noise, or mostly while the user is still speaking. Those usually need slightly different fixes.

-Mark G.

Hi @rafa3, thank you for your reply.

I tried your recommendation of setting noise_reduction to far_field. It seems to improve the handling of background noise a little, but the call quality is still not too good in some of my scenarios, like order pizzas.

I have also tested gpt-realtime-mini-2025-10-06, gpt-realtime-1.5, and gpt-realtime-2, but I experienced the same issues. In fact, in my Portuguese-speaking use case, those models performed even worse than gpt-realtime-mini / gpt-realtime.

During the call, the voice sometimes changes unexpectedly, almost as if another person is speaking. This happens intermittently and makes the experience feel unstable.

I am still looking for possible solutions, such as improving the prompt or adjusting other parameters.

Thank you again for your help.

Hi everyone,

After using and observing the behavior of the latest changes for a few days, here is what I noticed.

Using noise_reduction as far_field did improve the call quality somewhat, but it is still not good enough. In many cases, the AI still misunderstands names, numbers, and what the customer says, even after the customer repeats it two or three times. This creates a poor user experience.

I attached a call example here: https://youtu.be/Qo4vQNA0wQI

In this call, you can hear the user saying his name, “Marcio”, but the AI understood it as “Lucas”. In the same call, the user says he would like to pay by credit card, but the AI understood it as cash/money and got stuck in a loop asking the same question again. There were several mistakes in the same call.

Model used in this example: gpt-realtime-mini-2025-10-06

Audio input parameters:

'input' => [
    'format' => [
        'type' => 'audio/pcmu'
    ],
    'noise_reduction' => [
        'type' => 'far_field',
    ],
    'transcription' => [
        'model' => 'gpt-4o-transcribe',
        'language' => 'pt',
        'prompt' => 'Transcribe with maximum fidelity. Proper names are critical. Do not correct names. If you’re unsure, keep exactly what you heard.'
    ],
    'turn_detection' => [
        'type' => 'server_vad',
        'threshold' => 0.7,
        'prefix_padding_ms' => 300,
        'silence_duration_ms' => 1100,
        'idle_timeout_ms' => 12000,
        'create_response' => true,
        'interrupt_response' => true,
    ],
],

I would appreciate any guidance or suggestions on how to improve Portuguese call quality, especially for proper names, numbers, and payment methods.

Thanks for sharing the example and configuration details, @leandro-ligmee.

A few areas that may be worth testing:

  • Audio quality: You're currently using audio/pcmu (8 kHz telephony audio), which can make names, numbers, and payment methods harder to recognize. If your setup allows it, testing higher-quality audio (such as 16 kHz PCM) may help.
  • Noise reduction: far_field can be useful in some environments, but phone calls are often closer to a near_field use case. It could be worth comparing near_field, far_field, and no noise reduction to see which performs best.
  • VAD settings: A slightly lower threshold (for example, 0.5-0.6) may help avoid clipping softer speech.
  • Confirmation of critical details: For names, numbers, addresses, or payment methods, adding a confirmation step can help catch recognition errors before they affect downstream actions.

Your transcription prompt already emphasizes accuracy, so the biggest gains may come from the audio pipeline rather than prompt adjustments alone.

Interesting example, especially for Portuguese telephony workflows. It would be useful to know whether the transcription issues are consistent across calls or concentrated around specific terms such as names and payment methods.

-Mark G.

Thanks for the suggestions.

I did some additional testing on the SIP/RTP side with Asterisk/PJSIP, and it looks like increasing the audio quality is not currently practical in this setup.

The OpenAI SIP endpoint accepted G.711 only:

  • PCMU/8000 → accepted, call completed

  • PCMA/8000 → accepted, call completed

  • G722/8000 → rejected with 400 Bad Request

  • L16/16000 → rejected with 400 Bad Request

  • L16/24000 → rejected with 400 Bad Request

For example, this was rejected:

m=audio 15822 RTP/SAVP 123 101
a=rtpmap:123 L16/24000
a=rtpmap:101 telephone-event/8000
a=ptime:20
a=sendrecv

This was accepted:

m=audio 41544 RTP/SAVP 8 101
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=ptime:20

I also tried setting the calls.accept configuration to audio/pcm with rate: 24000, but when the SIP endpoint is configured with ulaw or alaw, the actual negotiated media remains G.711:

NativeFormats: (ulaw)
ReadFormat: ulaw
WriteFormat: ulaw

So it seems that audio/pcm in calls.accept does not make the SIP/RTP leg accept or negotiate L16/24000. At least in my tests, the SIP endpoint only works with G.711 (PCMU/PCMA) at 8 kHz.

For PSTN SIP trunks this is also a practical limitation, because all carriers I work with provide only 8 kHz codecs such as PCMU/PCMA. Even in the WhatsApp Calling SIP scenario, where Opus may be available on the Meta side, the audio would still be transcoded down to G.711 before reaching OpenAI if the OpenAI SIP leg only accepts PCMU/PCMA.

So, unless there is a specific SDP format required for PCM over SIP, or some other supported wideband codec on the OpenAI SIP endpoint, increasing the audio sample rate is not currently feasible with direct SIP integration.

It would be useful to clarify whether audio/pcm in calls.accept is expected to apply to SIP/RTP codec negotiation, or only to non-SIP Realtime media flows.

The issue is not equally distributed across all speech. General conversation is often understandable, but the most problematic parts are critical short entities: names, numbers, addresses, and payment methods. This is especially problematic because those are exactly the fields that need high accuracy in telephony workflows.

In Portuguese phone calls, the model can follow the overall intent, but it frequently mishears proper names or short payment-related terms, even when the user speaks naturally. That makes the workflow risky unless we add explicit confirmation steps.