We’re very happy with the improved performance on alphanumeric accuracy. Tool calling feels a lot faster too. We’ve had to adjust our realtime agents because the 1.5 is more ‘descriptive’ in the actions it takes, so we’re now actively specifying when it should/shouldn’t indicate it is taking an action.
Downside: One thing that really stands out to us is how the intonation for Dutch and Flemish has regressed from the previous gpt-realtime, even if we spend significant time prompting for it. Many of our clients still prefer the previous model for that reason, even with the reduced performance on alphanumerics.
Tool calling is really faster and smoother at the same time.
Alphanunmeric transcription accuracy is clearly better + the model seems to accept / take into account user corrections more easily which reduces frustration.
To share more details about the intonation from French customer point of view:
One of our usecases for realtime API is to handle customers who are waiting in our call center queue for too long in order to eventually reprioritize the call.
I launched an A/B Test to measure gpt-realtime-1.5 performance compared to gpt-realtime, using the exact same prompts and tools, calls being redirected randomly between the following variants:
Variant A (gpt-realtime): 716 sessions
→ 642 calls successfully redirected by the AI after completing its task.
→ 74 customers hung up while speaking to the AI Agent.
Variant B (gpt-realtime-1.5): 507 sessions
→ 429 calls successfully redirected by the AI after completing its task.
→ 78 customers hung up while speaking to the AI Agent.
It represents a drop of an additional 5% of our customers with the gpt-realtime-1.5 model, with a confidence of 99,2% (p=0.008).
Listening at some call recordings, I can hear that the intonation is less natural and customers seem less confident speaking to this new model (at least in French).
It is nice, seems more accurate to some extent. But, if you ask gpt-realtime to laugh, the model will produce laughter, in 1.5’s case, it will say “laugh”. If you prompt says “laugh hysterically” it will say “laugh hysterically” instead of actually laughing hysterically. gpt-realtime makes the agent actually laugh.
can someone explain what the difference between “realtime session using gpt-realtime-1.5 AND transcription model gpt-4o-transcribe” and “transcription session using just gpt-4o-transcribe” are ?
It is confusing about purpose.
gpt-realtime-1.5 said it is 10% increasing in transcription accuracy while transcription model of its setting is still gpt-4o-transcribe
gpt-realtime-1.5 + gpt-4o-transcribe in a realtime session means you are building a live voice agent. The main model is gpt-realtime-1.5. It handles the conversation, turn-taking, and responses. The gpt-4o-transcribe setting is just the ASR layer that produces a text transcript of the user’s speech. The docs describe this as a normal Realtime conversation session with a separate input transcription model attached.
A transcription session with gpt-4o-transcribe means you are not running a voice agent at all. You are just streaming audio and getting text back. OpenAI says these sessions “typically don’t contain responses from the model” and always use type: "transcription".
So the difference is really about purpose:
realtime session = talk with an assistant in real time
transcription session = speech-to-text only
Why this feels confusing: both can mention gpt-4o-transcribe, but in the first case it is only the transcript engine, not the main conversational model.
One more nuance. The cookbook says the built-in transcription in a realtime session uses a separate ASR model such as gpt-4o-transcribe. If you want the Realtime model itself to do the transcription, you can run an out-of-band transcription pass. From what I observed, that can reduce mismatch because the same model handles both understanding and transcription.
I’m building an AI voice training tool for reception training. GPT-realtime roleplays as virtual hotel guests with distinct accents and personalities, and the user can practice real scenarios with. With gpt-realtime v1, I had reached a point where the voices conveyed strong emotions and convincing accents. It felt genuinely realistic.
Tested gpt-realtime-1.5 and it’s a clear downgrade:
Accents are almost entirely gone. In v1, the model could deliver speech with convincing regional accents that made conversations feel authentic. In 1.5, this is essentially stripped out.
The voice sounds noticeably more robotic. Whatever was making v1 feel natural and human-like has been lost. The output in 1.5 feels flat and synthetic in comparison.
Emotion is barely there. v1 had genuine warmth, inflection, and emotional range in its delivery. 1.5 sounds like it’s reading a script with no feeling behind it.
I’ve seen the benchmarks: +5% audio reasoning, +10% alphanumeric transcription, +7% instruction following, but none of that matters for our use case if the voice itself sounds worse. It seems that lately, OpenAI has been chasing the benchmarks above all else, even when it’s been proven time and time again that they do not drive adoption or real-world use.
Even so, instruction following and tool calling improvements are great for enterprise agent workflows, but not at the cost of the qualities that made realtime voice compelling in the first place.
For now, we are staying on gpt-realtime (v1) and will continue to monitor updates.
Is anyone else seeing the same thing? I’d like to know whether this is being tracked internally or if there are plans to bring back the expressiveness that v1 had in 1.5.
Yes, we have the same issue. We’re keeping most of our workloads on v1 for now and are flagging the issues with v1.5. For non-English languages and for emotion/accents it has been a noticeable degradation, bad enough that we can’t put it in front of clients.
I’ve been using gpt-realtime (v1) heavily for Hebrew, and it’s honestly one of the best speech-to-speech models available for the language, the prosody, intonation, and natural flow are genuinely impressive.
When v1.5 dropped, I was excited. After testing it, I’m disappointed and reverting back to v1.
The regression is noticeable immediately: the intonation sounds flat and unnatural in Hebrew, and the overall delivery feels robotic compared to v1. I’m not talking about a minor difference - users would notice this in production.
I also ran into the issue others mentioned here: the model says the word “laugh” instead of actually laughing, which breaks any attempt at natural emotional expression.
I’ve tried adjusting system prompt instructions (filler words, emotion tags, personality constraints) but I can’t get v1.5 to match what v1 does out of the box for Hebrew.
Some questions for the team:
Was Hebrew included in the v1.5 evaluation set? It’s an RTL language with very specific prosody patterns and I suspect it may have been underrepresented in QA.
Is there a timeline for addressing the intonation regression?
Will v1 remain available past the April 3rd migration deadline for developers who are actively seeing regressions?
I appreciate the improvements in tool calling reliability and alphanumeric accuracy - those are real gains. But voice naturalness is the core value proposition for my use case, and right now v1.5 is a step back for Hebrew. Happy to provide audio samples if that helps the team diagnose the issue.
I’m seeing the same issues as others have pointed out.
There is some regression in quality in some areas but text based output is actually broken.
The area where I saw this is with transcriptions. (following Out Of Band transcriptions configuration)
We use gpt-realtime in Japanese, we tested gpt-realtime-1.5, and found in our automated tests it had a 10% reduction in accuracy (which is likely fixable via prompt tuning), however we found we were unable to adopt gpt-realtime-1.5 due to a very noticable reduction in the quality of pronounciation - the model often sounds robotic or as if it is a westerner speaking Japanese rather than a native speaker. For this reason we had to revert back to gpt-realtime
Similar to the above post about Hebrew - I wonder whether Japanese was included in the evaluation set, and if the team has any way of checking programatically for regressions in pronounciation quality in Japanese?
Is there any plans for improved model for Japanese or is the long term direction to focus more on English?
I’m currently working with Realtime Model 1.5 and evaluating its behavior with our prompts.
I have a prompt that is approximately 20,000 tokens (measured using the tiktoken tokenizer), and I’d like to better understand the model’s practical limits when handling inputs of this size.
Specifically, I’m looking for clarification on the following:
Maximum Supported Context Length
What is the hard context window limit (input + output) for gpt-Realtime-1.5?
While you are at it can you give me the context window limit for realtime-mini?
Should I have one prompt for the realtime 1.5 and another specially designed prompt for realtime-mini? Im asking because openAI ships only 1 realtime prompt guideline Realtime Prompting Guide
Effective Context Utilization
We have the following:
Even if 20k tokens are within the limit, is there any degradation in:
attention quality
instruction adherence
latency or streaming performance
when operating at this scale?
Recommended Prompt Size
Is there a recommended “safe” or optimal prompt size (e.g., You can tell me something like 50–70% of the max context) for reliable performance in real-time applications?
Token Accounting Details
Does the context window include:
system + user + assistant messages
tool calls / function outputs
streaming buffers or internal state?
Best Practices for Large Prompts
Are there recommended strategies (chunking, summarization, retrieval, etc.) when working with prompts in the ~20k token range in a real-time setting?
This is the summary of a realtime call we are working with to put you in context
Based on the documentation, gpt-realtime-1.5 has a 32,000-token total context window, meaning the combined input and output must stay within that limit.
I can see why that would be frustrating—those human-like details like tone and accents are what really make voice interactions feel real, especially for training scenarios. Benchmarks are great, but they don’t always reflect actual user experience.
It also highlights how important natural delivery is for practical use cases—whether it’s training tools or even simpler things like voice queries for strbuck, where clarity and a natural feel can make a big difference in usability.