🚀 gpt-realtime-1.5 is live in Realtime API

Hey everyone,

Big step forward for voice today: gpt-realtime-1.5 just dropped in the Realtime API.

Quick highlights from the team:

  • +5% on Big Bench Audio reasoning
  • +10.23% alphanumeric transcription accuracy
  • +7% instruction following
  • More reliable tool calling and multilingual handling overall

Pricing remains steady with the original gpt-realtime:

  • Text: $4 / 1M input | $0.40 cached | $16 / 1M output
  • Audio: $32 / 1M input | $0.40 cached | $64 / 1M output

Early adopters are already feeling the upgrade:

  • Genspark reports connection rates nearly doubled (up to 66% ) and phone call errors cut in half.
  • Sendbird highlights exceptional improvements in handling interruptions.

Check out the latest docs here: Realtime API | OpenAI API

Curious about your experience:

  • Are you noticing reduced latency in your setups?
  • Any standout improvements or quirks in tool calling and multilingual tasks?
  • How does it stack up side-by-side with previous realtime models?

Drop your insights, benchmarks, or any questions right here!

Excited to hear your thoughts!

14 Likes

Also, don’t miss the cool demo from Charlie:

5 Likes

This is a big improvement. Really liking it.

3 Likes

2 posts were split to a new topic: What are the new GPT Realtime voices?

We’re very happy with the improved performance on alphanumeric accuracy. Tool calling feels a lot faster too. We’ve had to adjust our realtime agents because the 1.5 is more ‘descriptive’ in the actions it takes, so we’re now actively specifying when it should/shouldn’t indicate it is taking an action.

Downside: One thing that really stands out to us is how the intonation for Dutch and Flemish has regressed from the previous gpt-realtime, even if we spend significant time prompting for it. Many of our clients still prefer the previous model for that reason, even with the reduced performance on alphanumerics.

2 Likes

I totally agree with @Dennis_Stellar:

  • Tool calling is really faster and smoother at the same time.
  • Alphanunmeric transcription accuracy is clearly better + the model seems to accept / take into account user corrections more easily which reduces frustration.

To share more details about the intonation from French customer point of view:
One of our usecases for realtime API is to handle customers who are waiting in our call center queue for too long in order to eventually reprioritize the call.
I launched an A/B Test to measure gpt-realtime-1.5 performance compared to gpt-realtime, using the exact same prompts and tools, calls being redirected randomly between the following variants:

  • Variant A (gpt-realtime): 716 sessions
    642 calls successfully redirected by the AI after completing its task.
    74 customers hung up while speaking to the AI Agent.
  • Variant B (gpt-realtime-1.5): 507 sessions
    429 calls successfully redirected by the AI after completing its task.
    78 customers hung up while speaking to the AI Agent.

It represents a drop of an additional 5% of our customers with the gpt-realtime-1.5 model, with a confidence of 99,2% (p=0.008).
Listening at some call recordings, I can hear that the intonation is less natural and customers seem less confident speaking to this new model (at least in French).

2 Likes

It is nice, seems more accurate to some extent. But, if you ask gpt-realtime to laugh, the model will produce laughter, in 1.5’s case, it will say “laugh”. If you prompt says “laugh hysterically” it will say “laugh hysterically” instead of actually laughing hysterically. gpt-realtime makes the agent actually laugh.

2 Likes

No love for a new mini version? Non-mini realtime is too expensive to run production apps.

1 Like

can someone explain what the difference between “realtime session using gpt-realtime-1.5 AND transcription model gpt-4o-transcribe” and “transcription session using just gpt-4o-transcribe” are ?
It is confusing about purpose.
gpt-realtime-1.5 said it is 10% increasing in transcription accuracy while transcription model of its setting is still gpt-4o-transcribe

gpt-realtime-1.5 + gpt-4o-transcribe in a realtime session means you are building a live voice agent. The main model is gpt-realtime-1.5. It handles the conversation, turn-taking, and responses. The gpt-4o-transcribe setting is just the ASR layer that produces a text transcript of the user’s speech. The docs describe this as a normal Realtime conversation session with a separate input transcription model attached.

A transcription session with gpt-4o-transcribe means you are not running a voice agent at all. You are just streaming audio and getting text back. OpenAI says these sessions “typically don’t contain responses from the model” and always use type: "transcription".

So the difference is really about purpose:

  • realtime session = talk with an assistant in real time
  • transcription session = speech-to-text only

Why this feels confusing: both can mention gpt-4o-transcribe, but in the first case it is only the transcript engine, not the main conversational model.

One more nuance. The cookbook says the built-in transcription in a realtime session uses a separate ASR model such as gpt-4o-transcribe. If you want the Realtime model itself to do the transcription, you can run an out-of-band transcription pass. From what I observed, that can reduce mismatch because the same model handles both understanding and transcription.

2 Likes

I’m building an AI voice training tool for reception training. GPT-realtime roleplays as virtual hotel guests with distinct accents and personalities, and the user can practice real scenarios with. With gpt-realtime v1, I had reached a point where the voices conveyed strong emotions and convincing accents. It felt genuinely realistic.

Tested gpt-realtime-1.5 and it’s a clear downgrade:

  • Accents are almost entirely gone. In v1, the model could deliver speech with convincing regional accents that made conversations feel authentic. In 1.5, this is essentially stripped out.

  • The voice sounds noticeably more robotic. Whatever was making v1 feel natural and human-like has been lost. The output in 1.5 feels flat and synthetic in comparison.

  • Emotion is barely there. v1 had genuine warmth, inflection, and emotional range in its delivery. 1.5 sounds like it’s reading a script with no feeling behind it.

I’ve seen the benchmarks: +5% audio reasoning, +10% alphanumeric transcription, +7% instruction following, but none of that matters for our use case if the voice itself sounds worse. It seems that lately, OpenAI has been chasing the benchmarks above all else, even when it’s been proven time and time again that they do not drive adoption or real-world use.

Even so, instruction following and tool calling improvements are great for enterprise agent workflows, but not at the cost of the qualities that made realtime voice compelling in the first place.

For now, we are staying on gpt-realtime (v1) and will continue to monitor updates.

Is anyone else seeing the same thing? I’d like to know whether this is being tracked internally or if there are plans to bring back the expressiveness that v1 had in 1.5.

1 Like

Yes, we have the same issue. We’re keeping most of our workloads on v1 for now and are flagging the issues with v1.5. For non-English languages and for emotion/accents it has been a noticeable degradation, bad enough that we can’t put it in front of clients.

I’ve been using gpt-realtime (v1) heavily for Hebrew, and it’s honestly one of the best speech-to-speech models available for the language, the prosody, intonation, and natural flow are genuinely impressive.

When v1.5 dropped, I was excited. After testing it, I’m disappointed and reverting back to v1.

The regression is noticeable immediately: the intonation sounds flat and unnatural in Hebrew, and the overall delivery feels robotic compared to v1. I’m not talking about a minor difference - users would notice this in production.

I also ran into the issue others mentioned here: the model says the word “laugh” instead of actually laughing, which breaks any attempt at natural emotional expression.

I’ve tried adjusting system prompt instructions (filler words, emotion tags, personality constraints) but I can’t get v1.5 to match what v1 does out of the box for Hebrew.

Some questions for the team:

  • Was Hebrew included in the v1.5 evaluation set? It’s an RTL language with very specific prosody patterns and I suspect it may have been underrepresented in QA.
  • Is there a timeline for addressing the intonation regression?
  • Will v1 remain available past the April 3rd migration deadline for developers who are actively seeing regressions?

I appreciate the improvements in tool calling reliability and alphanumeric accuracy - those are real gains. But voice naturalness is the core value proposition for my use case, and right now v1.5 is a step back for Hebrew. Happy to provide audio samples if that helps the team diagnose the issue.

2 Likes

I’m seeing the same issues as others have pointed out.

There is some regression in quality in some areas but text based output is actually broken.
The area where I saw this is with transcriptions. (following Out Of Band transcriptions configuration)

We use gpt-realtime in Japanese, we tested gpt-realtime-1.5, and found in our automated tests it had a 10% reduction in accuracy (which is likely fixable via prompt tuning), however we found we were unable to adopt gpt-realtime-1.5 due to a very noticable reduction in the quality of pronounciation - the model often sounds robotic or as if it is a westerner speaking Japanese rather than a native speaker. For this reason we had to revert back to gpt-realtime

Similar to the above post about Hebrew - I wonder whether Japanese was included in the evaluation set, and if the team has any way of checking programatically for regressions in pronounciation quality in Japanese?

Is there any plans for improved model for Japanese or is the long term direction to focus more on English?

1 Like

Hi Dear Team

I’m currently working with Realtime Model 1.5 and evaluating its behavior with our prompts.

I have a prompt that is approximately 20,000 tokens (measured using the tiktoken tokenizer), and I’d like to better understand the model’s practical limits when handling inputs of this size.

Specifically, I’m looking for clarification on the following:

  1. Maximum Supported Context Length
    What is the hard context window limit (input + output) for gpt-Realtime-1.5?

  2. While you are at it can you give me the context window limit for realtime-mini?

  3. Should I have one prompt for the realtime 1.5 and another specially designed prompt for realtime-mini? Im asking because openAI ships only 1 realtime prompt guideline Realtime Prompting Guide

  4. Effective Context Utilization
    We have the following:
    Even if 20k tokens are within the limit, is there any degradation in:

    • attention quality
    • instruction adherence
    • latency or streaming performance
      when operating at this scale?
  5. Recommended Prompt Size
    Is there a recommended “safe” or optimal prompt size (e.g., You can tell me something like 50–70% of the max context) for reliable performance in real-time applications?

  6. Token Accounting Details
    Does the context window include:

    • system + user + assistant messages
    • tool calls / function outputs
    • streaming buffers or internal state?
  7. Best Practices for Large Prompts
    Are there recommended strategies (chunking, summarization, retrieval, etc.) when working with prompts in the ~20k token range in a real-time setting?

  8. This is the summary of a realtime call we are working with to put you in context

USAGE REALTIME: {'duration_s': 111.0567850000225, 'accumulated_input_audio_tokens': 1246, 'realtime': {'total_tokens': 111811, 'input_tokens': 110185, 'output_tokens': 1626, 'input_text_tokens': 107186, 'input_audio_tokens': 2999, 'input_text_cached_tokens': 89344, 'input_audio_cached_tokens': 768, 'output_text_tokens': 554, 'output_audio_tokens': 1072, 'responses_count': 19, 'model': 'gpt-realtime-1.5'}, 'transcription': {'total_tokens': 237, 'input_tokens': 174, 'output_tokens': 63, 'input_text_tokens': 0, 'input_audio_tokens': 174, 'events_count': 7, 'model': 'gpt-4o-transcribe'}, 'analysis': {'total_tokens': 0, 'input_tokens': 0, 'output_tokens': 0, 'runs_count': 0, 'by_model': {}}}

My goal is to ensure stable and predictable behavior while maintaining low latency in production.

Thanks so much and great work on this model - Its by far the best we’ve seen

God bless

1 Like

Based on the documentation, gpt-realtime-1.5 has a 32,000-token total context window, meaning the combined input and output must stay within that limit.

Docs: gpt-realtime-1.5 Model | OpenAI API

For prompt optimization, could you provide more details about your use case and implemented architecture?

3 Likes

Dear @Innovatix

Thanks for confirming the 32k context window - that helps.

Let me provide more detail on our setup so you can better advise on prompt optimization.

Architecture Overview

We are building a real-time voice agent using gpt-realtime-1.5 with:

  • Continuous conversation (multi-turn session)

  • Audio input + transcription

  • Tool usage and structured responses

  • A large initial system prompt (~20k tokens)

Key Observation

From our runtime metrics, we are seeing token usage like:

  • Total tokens (session): ~111k

  • Input tokens: ~110k

  • Cached tokens: ~89k

This suggests that tokens are accumulating across the session rather than being strictly limited per request.

Questions

  1. Context Window Scope
    Is the 32k token limit enforced:

    • per individual model response/request?

    • or across the entire realtime session?

  2. Token Accumulation Strategy
    In long-running sessions:

    • does the model automatically truncate older context?

    • or are we responsible for managing context (e.g., pruning history)?

  3. Cached Tokens
    We see a high number of cached tokens (~89k):

    • how do cached tokens interact with the 32k limit?

    • do they still count toward attention/computation?

  4. Large System Prompt (~20k)
    Given that our system prompt is ~20k tokens:

    • is this too large for reliable instruction adherence?

    • would you recommend compressing or restructuring it?

  5. Realtime Optimization
    For low-latency voice applications:

    • what is the recommended effective prompt size?

    • should we actively summarize or window the conversation?

At your disposal for anything you need.

Thanks so much God bless

I can see why that would be frustrating—those human-like details like tone and accents are what really make voice interactions feel real, especially for training scenarios. Benchmarks are great, but they don’t always reflect actual user experience.

It also highlights how important natural delivery is for practical use cases—whether it’s training tools or even simpler things like voice queries for strbuck, where clarity and a natural feel can make a big difference in usability.

1 Like