Description:
Since early January 2026, we’ve observed a significant and consistent regression in the behavior of the gpt-realtime model (including snapshot 2025-08-2), particularly when handling structured, factual data provided via system context.
Expected Behavior
When precise, unambiguous data is injected into the conversation (e.g., availability slots, task lists, resource calendars), the model should:
-
Parse and reason over the data deterministically
-
Return consistent answers for semantically equivalent user queries
-
Avoid hallucination, inference, or “optimization” when raw data is available
Actual Behavior
The model now:
-
Interprets the same dataset differently based on minor phrasing variations (e.g., “two free days” vs. “two consecutive days”)
-
Invents implicit rules not present in the instructions (e.g., “only show the first block of consecutive days”)
-
Makes factual errors on simple date logic (e.g., claiming March 19–20 are not consecutive)
-
Fails to produce tool_calls reliably, even when actions are clearly defined and requested
This behavior breaks production-grade voice agents that rely on precision over fluency—especially in multilingual, professional environments (project management, scheduling, resource planning).
Impact
-
User trust erodes due to inconsistent responses
-
Fallback systems become mandatory, increasing latency and cost
-
The promise of “realtime, reliable function calling” is no longer met
Request
We urge OpenAI to:
-
Restore deterministic behavior when structured context is provided
-
Decouple “conversational fluency” from “factual execution”
-
Provide a true “strict mode” where the model acts as a deterministic interpreter—not an improviser
This isn’t about making the model “smarter”—it’s about making it trustworthy when real business logic depends on it.