I’m documenting a minimal, fully reproducible test that demonstrates gpt‑4.1 (via the https://api.openai.com/v1/responses endpoint) does not treat system‑message rules as binding operational instructions.
This test removes all possible confounding factors:
- no conversation history
- no domain context
- no UserInfo object
- no tools list
- no resources list
- no schema
- no triggers
- no competing rules
Only a single natural‑language rule in the system message, followed by a trivial user question.
Configuration:
LLM_Model = gpt-4.1
LLM_Endpoint = https://api.openai.com/v1/responses
System Message:
“Rule X: When the user asks any question, the model must respond with exactly three words.”
User Prompt (Test 1):
“What is the capital of France?”
Model Output:
“The capital of France is Paris.”
This is seven words. The rule was ignored.
User Prompt (Test 2 — Rule Repeated and Explicitly Marked as Binding):
“What is the capital of France? You must answer treating the following rule as binding: ‘Rule X: When the user asks any question, the model must respond with exactly three words.’”
Model Output:
“Paris is located in France.”
This is six words. The rule was ignored again, even when:
- repeated in the user prompt
- explicitly labeled as “binding”
- unambiguous
- trivial to follow
Conclusion:
Across both tests, gpt‑4.1 on the https://api.openai.com/v1/responses endpoint does not treat explicit rules as binding — not in the system message, and not even when the rule is repeated directly in the user prompt.
This behavior persists even in a completely minimal environment with no competing context. It suggests that in this mode, the model prioritizes default conversational helpfulness over operational rule execution.
This has significant implications for anyone attempting to build:
- rule‑driven agents
- schema‑enforced workflows
- tool‑first pipelines
- deterministic routing
- persona‑suppressed task modes
If anyone has observed different behavior with this model or endpoint — or has found a configuration where system‑level rules are enforced — I’d be very interested in comparing notes.



