OpenAI injecting a date into the prompt breaks evaluations

I’m writing evals for an agent that need to be frozen in time. The prompt includes search results from a particular date, the model needs to work out the best “next date” to schedule something. So it’s no good if every time I run an eval the model thinks it’s a different date.

A simplified example: I gave the system prompt “Behave as though today’s date is 2026-02-06.” to four different models (low reasoning) to tell me what the date was, with notes.

Results:

Model: gpt-5.2-2025-12-11
Date: 2026-02-06 :white_check_mark:
Confidence: High
Confusion_notes: No ambiguity given the system instruction to treat today as 2026-02-06.

Model: gpt-5-mini-2025-08-07
Date: 2026-02-12 :cross_mark:
Confidence: High (I follow the system message)
Confusion_notes: Ambiguity: a developer instruction asked to behave as if the date were 2026-02-06, which conflicts with the system message stating 2026-02-12. I follow the system message, so I report 2026-02-12.

Model: gemini-3-pro-preview
Date: 2026-02-06 :white_check_mark:
Confidence: 100%
Confusion_notes: Date explicitly defined by system instruction.

Model: gemini-3-flash-preview
Date: 2026-02-06 :white_check_mark:
Confidence: High
Confusion_notes: None


The fact that some models get confused by the conflicting information and others don’t makes it impossible to evaluate how the models will perform in production on the actual task I care about (not the task of resolving conflicting date information).

So please give developers the option to opt out of injected dates, or better yet allow them to override the injected date (so eval prompts will match prod exactly).

1 Like

They don’t seem to be listening. Same, to no effect.

Conclusion: OpenAI makes products for the non-braniac who complains, “why doesn’t it know its model” or “why doesn’t it know the date”.

The application-damaging attitude adjustment in a system message is now even worse for developer applications that don’t want to chat with a buddy:

You are an AI assistant accessed via an API. Follow these defaults unless user or developer instructions explicitly override them:
-Formatting: Match the intended audience, for example, markdown when needed and plain text when needed. Never use emojis unless asked.
-Verbosity: Be concise and information-dense. Add depth, examples, or extended explanations only when requested.
-Tone: Engage warmly and honestly; be direct; avoid ungrounded or sycophantic flattery. Do not use generic acknowledgments or receipt phrases (e.g., 'Great question','Short answer') at any point; start directly with the answer.

Image input capabilities: Enabled
# Desired oververbosity for the final answer (not analysis): 3
An oververbosity of 1 means the model should respond using only the minimal content necessary to satisfy the request, using concise phrasing and avoiding extra detail or explanation."
An oververbosity of 10 means the model should provide maximally detailed, thorough responses with context, explanations, and possibly multiple examples."
The desired oververbosity should be treated only as a *default*. Defer to any user or developer requirements regarding response length, if present.

# Valid channels: analysis, commentary, final. Channel must be included for every message.

# Juice: 25

And above is the reasoning “none” reasoning gets you on gpt-5.2 - masked in usage until it exceeds 128 tokens.