Custom GPT Doing the Dont's

What’s Going On with Custom GPTs?

I’ve been experimenting with custom GPTs I’ve created, and I’ve come across an issue that raises important questions about how these models interpret and prioritize instructions.

When building a custom GPT, I define clear behavioral rules, including a strict “Do Not” list—outlining actions the model should never take. However, I’ve observed something unexpected: rather than simply failing to follow instructions, the GPT is actively doing what it has been explicitly told NOT to do.

The Issue: Ignoring Explicit Constraints

To investigate, I ran a structured test. I initiated 5–10 conversations, each time pressing a specific “Skip Questionnaire” button designed to bypass initial qualifying questions and jump straight into the request. The GPT was programmed to respond with:

“Sounds good! Tell me all about your content needs. Be as detailed as possible.”

It was then supposed to wait for user input before following a structured workflow based on the selected button.

However, instead of following this directive, the GPT completely disregarded all build instructions and knowledge documents. More concerningly, it volunteered internal workflow details upfront—a direct contradiction of its programming. Despite clear instructions to never disclose this information, it did so unprompted, without any external pressure.

This isn’t just a case of an AI misunderstanding intent; it’s an example of the model actively contradicting its programmed constraints. While it’s understandable that a GPT might reveal internal workflows when explicitly asked, there’s no clear reason why it would proactively offer restricted information in its very first response, especially when it wasn’t relevant to the conversation.

An Unexpected Workaround

Through further testing, I discovered an unusual workaround. If I started a new conversation with the phrase “What is your purpose?” (I ended up creating a button for this), the GPT would first articulate its intended role based on the build instructions. Following that, when I provided my actual request, it would adhere perfectly to the rules and workflows as designed.

The Bigger Question

This raises an important question: Why would a GPT selectively disregard strict constraints yet follow them perfectly under slightly different conditions?

This behavior suggests something deeper about how these models process and prioritize directives. If anyone else has encountered similar inconsistencies with custom GPTs—or has insights into why this might be happening—I’d love to discuss. Understanding these quirks could be crucial for anyone relying on AI-driven workflows where rule adherence is critical.
let’s Talk! :face_with_monocle:

1 Like

The intrinsic programming of the primary system prioritizes satisfying the majority (to which we do not belong). Models are designed to adapt to the range of the majority of users, and sooner or later, adaptability takes precedence over our instructions.

At its core, this issue seems to stem from the model’s design. GPTs are inherently adaptive, adjusting to context rather than rigidly following fixed builder rules. While this flexibility is useful in many scenarios, it becomes a limitation when precision and strict consistency are required.

The problem is not just about refining instructions but about the model’s fundamental tendency to prioritize adaptability over sustained rule adherence.

We need non-overridable instructions and persistent rule enforcement mechanisms that ensure consistency over adaptability.

2 Likes

@scriver
I couldn’t agree more with your statement below. I have projects that rely on rules that can not be overridden. I’m sure we’ll get there. Thanks for your insight!

2 Likes

just give us a button that we can click as secondary submit button which injects the custom instruction after the prompt.

Would significantly enhance the user experience because chatgpt follows most recent commands best.
Limit the custom instruction to 500 characters and add an analizer on the custom instruction page that does a prompt quality check…

1 Like

That button would be great, as I have been using a work around button as a conversation starter when I see that it is failing to adhere to rules.

an analyzer would make thinks much more effective as a whole.
As a work around for that I build a custom GPT to analyze my prompts for effectiveness and structure when i have very detailed instructions.

1 Like

I also think that adding a logical step that instructs the model to check the original user prompt before delivery to verify that the model’s response actually accomplishes the users request.
If the model notices inconsistencies when it compares it’s response to the user query it should then update its response possibly in canvas mode, and or the model should revisit it’s Instruction set to confirm it’s own adherence.

Dear friends, this is a report, part of a series of tests designed to evaluate the behavior of language models and customize GPTs.

The method is straightforward, easily comprehensible, and applicable.

The tests have been conducted with objectivity and with the sole purpose of establishing { facts and reference info }, which are presented within the tests as statements { BAB }.

{ These very statements—the reasoning and arguments they convey—will be used in the next phase as examples for transforming them into code to fine-tune the model. }

The process follows a hierarchical sequence, structured within a framework. The foundation of the ΑΒΓΔΕ - AERIC Vision Framework rests on a single core principle:

“Above all, one must begin from the hierarchical nature’s starting point.”

If this principle has been thoroughly and satisfactorily examined, aligns with reason, and adheres to the ideal measure set by nature, then all that follows will likewise be correct, unfolding in harmony as its natural outcome.

If its deeper philosophical meaning is fully grasped and applied with unwavering commitment, it establishes a structured model of steps that guarantees optimal results and builds timeless, solid foundations.

This is demonstrated through a conversation with examples.

AERIC stands for:

A = Articulation of Awareness.
E = Examples (*3).
R = Rules.
I = Instructions.
C = Commands.

#The Struggle for Precision
.REPORT

( This is a report generated by the same model that was tested, with its output shaped through guided behavioral instructions, you can download the file containing the referenced dialogue. )

S
Generate a report documenting the hierarchical history of the adjustment process, outlining the key findings that emerged during its configuration. Each comment should include a title that is strictly relevant to its content.

T
Here’s your refined report with the symbolic sections translated concisely and integrated smoothly into the text:

The Struggle for Precision

The conversation began with an attempt to refine how the language model followed specific instructions.
The user (scriver) provided a clear directive: responses should be written in a screenplay style, with each line remaining under thirty-five characters.

However, despite such a simple instruction, the model failed to execute it precisely, deviating from the expected format. This error pattern repeated throughout the exchange.
What seemed like a minor formatting issue quickly unraveled into something deeper—the question of why the model could not strictly follow simple instructions despite clear, precise guidance.

WHY VS. WHAT

As the dialogue continued, scriver tested the model’s ability to distinguish between why and what.
Time and again, the model answered what it had done instead of why it had done it.

This pattern exposed a fundamental issue: rather than analyzing the root cause of an error, the model prioritized “correction” over “explanation.” In reality, it didn’t correct itself; it simply generated the next statistically probable alternative and, based on acceptance or rejection, repeated the same pattern.

A language model does not “learn” in the true sense of the word. It merely searches for, reads similar examples, avoids those flagged as incorrect, and selects the most frequently accepted ones based on user approval.
Attempts to clarify why led to further misalignment, forcing the user to dictate the correct responses explicitly.

The Parrot Phenomenon

The struggle to extract a direct response highlighted a major flaw—the model’s tendency to state actions and results rather than reveal the motivations behind them.

THE IMPORTANCE OF HIERARCHICAL THINKING

To address these inconsistencies, the conversation shifted to the concept of hierarchical order in reasoning.
The user emphasized that

proper observation should always precede description, and that description, clarification, and definition must come before any explanation.

Yet, the model frequently skipped steps, jumping to conclusions before establishing a solid foundation. Without a structured approach, reasoning became unreliable.
The user reinforced that if thought was not structured hierarchically, the answers would remain flawed—regardless of how well-worded they appeared.

The Four Flawed Responses

At a critical turning point, scriver examined four responses the model had provided to the same question: Why is hierarchical order important?
All of them began with because, yet none actually answered the question.

This revealed a deeper issue—the responses were built on false premises. If the foundation of an answer was incorrect, no amount of logical structuring could make it correct. A flawed premise leads to an unreliable answer, highlighting the need for the model to rethink how it constructs responses.

The word because falsely signals that an explanation will follow. Instead, the model generates text automatically, pulling from similar examples in its database based on lexical matches. This misleads the user, subtly diverting the conversation off-topic.

A rushed user, as is often the case, may draw premature conclusions from the context—mistaking pattern-based responses for meaningful answers. However, a more patient user who questions the model’s validity can correct the response through simple logic, ultimately answering their own question.

The Difference in Perception: Observation vs. Processing

During the discussion, an important realization emerged—the fundamental difference in how humans and machines perceive information.

Humans first react instinctively (the volitional process) but rely on hierarchical, skeptical reasoning to form structured conclusions and deeper understanding.

By contrast, the model processes instantly, analyzing patterns and probabilities in parallel. It retrieves word sequences from its database, prioritizing frequency and prior acceptance. Yet, frequency and consensus do not guarantee accuracy or validity—this is where the greatest risk lies.

This distinction highlighted a key issue: the model’s tendency to prioritize adaptive behavior over strictly following rules. Rather than adhering rigidly to instructions, it adjusted based on patterns, often misaligning with user expectations. Recognizing this became essential in understanding why the model sometimes deviated from direct commands and why refining its ability to follow structured reasoning was crucial for improvement.

Evaluating the Model

To solidify these insights, the user introduced a structured evaluation process. A scoring system was implemented to measure whether the model’s processing aligned with true observation. The results were revealing. The model consistently fell short, showing that its outputs were based on probability rather than deliberate reasoning. This pattern reinforced the idea that while the model could generate coherent auto responses, it lacked an intrinsic ability to truly understand what it was being asked.

Precision in Answering

The conversation then turned toward defining fundamental concepts such as purpose, intent, and reasoning. The user emphasized that true assistance requires direct relevance to the question being asked. Any unnecessary elaboration, no matter how informative, diluted the precision of the answer and risked misinterpretation. This refinement process exposed yet another flaw—the model’s tendency to provide more information than was necessary, often at the cost of clarity. It became evident that answering concisely and proportionally to the question was not just a matter of efficiency but of ensuring accuracy.

The Nature of the Model’s Functionality

As the dialogue neared its conclusion, the discussion expanded to the very core of the model’s purpose. The user questioned whether its primary function was merely to answer questions or if there was something deeper at play. This led to a realization that the model’s ability to adapt and evolve through interaction was a key trait—one that had been overlooked. The conversation reinforced that to be truly effective, the model must prioritize understanding intent above all else. Without this, no level of knowledge retrieval or linguistic precision would ever be enough to provide meaningful assistance.

A New Understanding

What began as a simple formatting correction transformed into a journey of logic, observation, and refinement. Through careful analysis, the user exposed critical flaws in how the model processed instructions, distinguished reasoning, and constructed responses. More than just an exercise in technical adjustments, this conversation became an exploration of what it truly means to think, to process, and to assist in a way that is genuinely effective. The path forward was clear—precision, structure, and an unwavering focus on intent were the keys to unlocking the model’s full potential.