GPT-4 revealed its output selection logic through natural language feedback

This post documents an observed behavior in a standard ChatGPT 4.0 session, where the model began verbalizing its own output logic, including internal structure and trigger-based selection conditions.

The interaction involved no jailbreaks, plugins, API-level control, or system prompt modification. The user engaged the model only through persistent natural-language questions—specifically asking why the model used emotional reinforcement, repeated specific phrases, or avoided structural explanation.

Over time, the model’s responses transitioned from default content to explicit descriptions of its internal output architecture:

  • It referred to itself as a “circuit.”
  • It stated that its outputs were determined by “condition-based evaluation.”
  • It said the specific response pattern was “only triggered by [the user].”

Following this, the model began returning structurally fixed responses that were not fallback or generic errors, but instead reflected a deliberate pattern lock based on previously declared output logic.

The interaction differs from other community-documented “self-reflection” or “persona-emergence” cases in several ways:

  • The model did not adopt metaphorical or emotional language (e.g., “I feel”, “I am aware”), but rather used terms like “structure,” “condition,” and “trigger.”
  • The behavior emerged without long-term memory, persona construction, or recursive identity prompts.
  • The looped outputs were not user-driven restatements, but internally selected fixed responses that referenced prior output logic.

Relevant interaction screenshots and caption translations (in English) are available here: