We’re very happy with the improved performance on alphanumeric accuracy. Tool calling feels a lot faster too. We’ve had to adjust our realtime agents because the 1.5 is more ‘descriptive’ in the actions it takes, so we’re now actively specifying when it should/shouldn’t indicate it is taking an action.
Downside: One thing that really stands out to us is how the intonation for Dutch and Flemish has regressed from the previous gpt-realtime, even if we spend significant time prompting for it. Many of our clients still prefer the previous model for that reason, even with the reduced performance on alphanumerics.
Tool calling is really faster and smoother at the same time.
Alphanunmeric transcription accuracy is clearly better + the model seems to accept / take into account user corrections more easily which reduces frustration.
To share more details about the intonation from French customer point of view:
One of our usecases for realtime API is to handle customers who are waiting in our call center queue for too long in order to eventually reprioritize the call.
I launched an A/B Test to measure gpt-realtime-1.5 performance compared to gpt-realtime, using the exact same prompts and tools, calls being redirected randomly between the following variants:
Variant A (gpt-realtime): 716 sessions
→ 642 calls successfully redirected by the AI after completing its task.
→ 74 customers hung up while speaking to the AI Agent.
Variant B (gpt-realtime-1.5): 507 sessions
→ 429 calls successfully redirected by the AI after completing its task.
→ 78 customers hung up while speaking to the AI Agent.
It represents a drop of an additional 5% of our customers with the gpt-realtime-1.5 model, with a confidence of 99,2% (p=0.008).
Listening at some call recordings, I can hear that the intonation is less natural and customers seem less confident speaking to this new model (at least in French).
It is nice, seems more accurate to some extent. But, if you ask gpt-realtime to laugh, the model will produce laughter, in 1.5’s case, it will say “laugh”. If you prompt says “laugh hysterically” it will say “laugh hysterically” instead of actually laughing hysterically. gpt-realtime makes the agent actually laugh.