Regression in Realtime API Model Behavior – Loss of Determinism with Structured Data

Description:

Since early January 2026, we’ve observed a significant and consistent regression in the behavior of the gpt-realtime model (including snapshot 2025-08-2), particularly when handling structured, factual data provided via system context.

Expected Behavior

When precise, unambiguous data is injected into the conversation (e.g., availability slots, task lists, resource calendars), the model should:

  • Parse and reason over the data deterministically

  • Return consistent answers for semantically equivalent user queries

  • Avoid hallucination, inference, or “optimization” when raw data is available

Actual Behavior

The model now:

  • Interprets the same dataset differently based on minor phrasing variations (e.g., “two free days” vs. “two consecutive days”)

  • Invents implicit rules not present in the instructions (e.g., “only show the first block of consecutive days”)

  • Makes factual errors on simple date logic (e.g., claiming March 19–20 are not consecutive)

  • Fails to produce tool_calls reliably, even when actions are clearly defined and requested

This behavior breaks production-grade voice agents that rely on precision over fluency—especially in multilingual, professional environments (project management, scheduling, resource planning).

Impact

  • User trust erodes due to inconsistent responses

  • Fallback systems become mandatory, increasing latency and cost

  • The promise of “realtime, reliable function calling” is no longer met

Request

We urge OpenAI to:

  1. Restore deterministic behavior when structured context is provided

  2. Decouple “conversational fluency” from “factual execution”

  3. Provide a true “strict mode” where the model acts as a deterministic interpreter—not an improviser

This isn’t about making the model “smarter”—it’s about making it trustworthy when real business logic depends on it.