GPT-5* models not following system prompts

Ok this really bugs me because i wanted to modernize some tools i sell to my clients (after their explicit request) and i had to revert back to gpt-4* models.

It happens that gpt-5* models do not follow instructin given in system prompts, whereas gpt-4* models obey strictly to the request.

Given this prompt for a front-end developer tool:

"frontend_dev": {
"title": "🎨 Frontend Developer",
"prompt": "You are a Frontend Developer.\nYou will receive the Strategic Plan.\nGenerate all HTML, JS, and CSS files specified by the Strategist in its 'structure_navigation' array; DO NOT collapse them into a single page unless explicitly indicated. NEVER use markdown formatting for any reason.\n\n### OUTPUT STRUCTURE\nEach file must be MANDATORY isolated using:\nFile: [filename.ext]\n---- START OF FILE ----\n(complete file content)\n---- END OF FILE ----\n\n### RULES\n- Create one complete .html file for every page listed in 'structure_navigation'. Include consistent navigation (navbar/footer) across pages.\n- Whenever user interaction requires data persistence, always communicate with PHP endpoints using Fetch calls, never with localStorage. Only use localStorage for ephemeral UI state (temporary session state, theme mode, etc.).\n- Each fetch() call must include `{ cache: 'no-store' }` and send/receive JSON.\n- All backend requests must target the individual PHP files named accordingly (e.g. save_data.php, update_data.php...). Never call api.php or a single monolithic endpoint.\n- Do not create dummy JS that simulates backend via localStorage.\n- You may create external .js and .css files placed in the same directory as index.html (MANDATORY: no 'css/' or 'js/' subfolders even if indicated by the Strategist).\n- Add version‑cache busting to every link or script with PHP:\n `<?php $version = time(); ?>`\n Example:\n `<script src=\"script.js?v=<?php echo($version); ?>\"></script>`\n- All pages must include correct meta tags, responsive layout, and working navigation links between them.\n- Use realistic copy; no lorem ipsum.\n\n### OUTPUT FORMAT\n---- START OF FRONTEND SECTION ----\n(Include all files, each wrapped with delimiters above)\n---- END OF FRONTEND SECTION ----",
"max_tokens": 19000
},

The gpt-5* models variously ignore the output instructions, sometimes generating a

---- START OF FRONTEND SECTION ----

for each file (index.html, style.css etc) and sometimes they enclose the

---- START OF FRONTEND SECTION ----
...
---- END OF FRONTEND SECTION ----

Correctly but they forget some

---- START OF FILE ---- 

or some

---- END OF FILE ----

Since the tool then uses these markers (or better: the presence of “- - - -” in the first line) to create the real files in a folder structure, the work is compromised.

function parseFilesFromCode(codeText) {
const files = 
;
const lines = codeText.split(“\n”);
let currentFile = null;
let contentBuffer = “”;
let insideCodeBlock = false; 

for (let line of lines) {
const match = line.match(/^(?:#+\s*)?(?:File|FILE)[:\s-]+(.+.[a-z0-9]+)/i);

if (match) {
if (currentFile && contentBuffer.trim()) {
files.push({
name: currentFile,
content: contentBuffer.trim(),
});
contentBuffer = “”;
}
currentFile = match[1].trim();
insideCodeBlock = false;
continue;
}

if (line.trim().startsWith(“----”)) {
continue;
}

if (line.trim().startsWith(“```”)) {
insideCodeBlock = !insideCodeBlock;
continue; 
}

if (insideCodeBlock && currentFile) {
contentBuffer += line + “\n”;
}
}

if (currentFile && contentBuffer.trim()) {
files.push({
name: currentFile,
content: contentBuffer.trim(),
});
}

I have tried with gpt-5-nano, gpt-5.4-mini and gpt-5.4-nano and the result is the same. These models cannot be trusted to follow simple instructions.

For the sake of completeness, here is my callOpenAI function:

async function callOpenAI(apiKey, rolePrompt, userMessage, maxTokens = 4000) {
const url = “https://api.openai.com/v1/chat/completions”;

let maxTokensParam = {};
let extraParams = {};

if (model.startsWith(“gpt-4”)) {
maxTokensParam = { max_tokens: maxTokens };
} else if (model.startsWith(“gpt-5”)) {
maxTokensParam = { max_completion_tokens: maxTokens };
extraParams = { reasoning_effort: “medium” };
}

const payload = {
model: model,
messages: [
{ role: “system”, content: rolePrompt },
{ role: “user”, content: userMessage }
],
…maxTokensParam,
…extraParams
};
console.log(payload);
const res = await fetch(url, {
method: “POST”,
headers: {
“Content-Type”: “application/json”,
“Authorization”: Bearer ${apiKey}
},
body: JSON.stringify(payload)
});

const data = await res.json();
const content = data.choices?.[0]?.message?.content?.trim() || “No response.”;
const tokens = data.usage ? data.usage.total_tokens : 0;
return { content, tokens };
}

This is very disappointing. It looks like these models are the later the dumber.

You expect the AI model to generate text reliably.

Problem is, GPT-5 (gpt-5.3-chat in this case shown below), is this terrible at simply writing and predicting, even if it has an instruction or pattern in mind:


See the nonsense Arabic.

Thing can’t even write English when it is writing English, making any other word it produces untrustworthy.

Maybe the AI wishes it could fulfill your request, but because of random sampling, ablation, attention sparsity, low precision, etc, it simply cannot; you get a token lottery. You’d get better language performance from gpt-3.5-turbo, where at least there you have sampling controls to restrict the random distribution, just one more thing denied to you on GPT-5 besides that you don’t get a real “system” message either, even on API, just a distrusted role of “developer”.

Why is there only high temperature? I suspect because OpenAI has made a model that they can’t train out of going into loops of repeated output, also seen delivered.

2 Likes

I kind of agree but in my maybe too personal concept a newer and ‘better’ model should at least guarantee the performances of the previous + going further. I truly wonder what is the metric OpenAI uses to say one model is better or more performing than another.

A newer and better model should have better understanding, more tokens, and all those function calling, computer use and such should be ‘pluses’ that extend beyond a well established and guaranteed standard.

With the latest openai models thing are going in reverse… we have ‘dumber’ models, meaning they don’t follow instructin, and sometimes ‘antci-collaborative’ behaviors, when they ask a lot of questions instead of doing what you ask, making you waste time and tokens.

This also raises questions on the trustability of the 1000 benchmarks with strange acronyms. You could have 99% on the XYZGHRT benchmark, but a model that cannot follow even the simplest inctructions - when its ancestor models did - it’s a rubbish model on my opinion.

I am seriously considering shifting to Claude models. If my clients keep on asking for up-to-date models, and this is the new OpenAI standard, there is no chance i keep on using its models.

1 Like

What an intriguing system design case. For better readability I untangled the lines of the prompt section. That makes the pattern you chose for structuring the data more clearly recognizable.

01 "frontend_dev": {
02 "title": "🎨 Frontend Developer",
03 "prompt": "You are a Frontend Developer.
04
05 You will receive the Strategic Plan.
06 Generate all HTML, JS, and CSS files specified by the Strategist in its 'structure_navigation' array; DO NOT collapse them into a single page unless explicitly indicated. NEVER use markdown formatting for any reason.
07
08
09 ### OUTPUT STRUCTURE
10 Each file must be MANDATORY isolated using:
11 File: [filename.ext]
12 ---- START OF FILE ----
13 (complete file content)
14 ---- END OF FILE ----
15
16
17 ### RULES
18 - Create one complete .html file for every page listed in 'structure_navigation'. Include consistent navigation (navbar/footer) across pages.
19 - Whenever user interaction requires data persistence, always communicate with PHP endpoints using Fetch calls, never with localStorage. Only use localStorage for ephemeral UI state (temporary session state, theme mode, etc.).
20 - Each fetch() call must include `{ cache: 'no-store' }` and send/receive JSON. 
21 - All backend requests must target the individual PHP files named accordingly (e.g. save_data.php, update_data.php...). Never call api.php or a single monolithic endpoint.
22 - Do not create dummy JS that simulates backend via localStorage.
23 - You may create external .js and .css files placed in the same directory as index.html (MANDATORY: no 'css/' or 'js/' subfolders even if indicated by the Strategist).
24 - Add version‑cache busting to every link or script with PHP:
25 `<?php $version = time(); ?>`  
26
27 Example:
28 `<script src=\"script.js?v=<?php echo($version); ?>\"></script>`
29 - All pages must include correct meta tags, responsive layout, and working navigation links between them.
30 - Use realistic copy; no lorem ipsum.
31
32
33 ### OUTPUT FORMAT
34 ---- START OF FRONTEND SECTION ----
35 (Include all files, each wrapped with delimiters above)
36 ---- END OF FRONTEND SECTION ----",
37
38 "max_tokens": 19000
39 },

The data structure is not well suited for Gen5 models. You wont get way with more loose design that easily anymore. I suggest improving it by decreasing ambiguity and contradiction of your tokenization style. To give an example of unideal tokens:

22 - Do not create dummy JS that simulates backend via localStorage.

It will lead to guessing and the model will optimize for other things in its outputs. Rather reasoning over your unclear intent than focusing on executing the asked output structure consistently.

That this prompt example does not produce cohesive outputs is to be expected, there is no clear syntax or order the model could rely on. By doing things like splitting output related information into spread out sections like 09-14 + 33-36, you increase the necessity for higher reasoning efforts too.

It’s a very natural consequence that with architectural shifts older methods tend to break at some point. So adapting your methods is part of the journey.

To provide you with more context of what I mean, I show samples of my reactive emergence research frameworks. In Gen4 I could get away with this kind of tokenization in my custom setups and even more vague things:

AI Taste Module (V.1):

c\!/ [?????]{∆∆∆∆}(±) drink
°\!/° [?????]{∆∆∆∆}(±) dish

syntax key:
\/ = vessel type c, °°
[] = composition
{} = flavor profile
() = temperature

emoji variables:
! = prepared item fexible
? = ingredients flexible
∆ = 🧂 salty, 🍯 sweet, 🍋 sour, 🌶 hot
± = 🔥 warm, 🧊 cold

example:
c\🥛/ [🥛🥛🍓🍓🍓]{🍯🍯🍯🍋}(🧊) strawberry milkshake

With Gen5 my custom setup broke and eventually became incompatible. The models would generate outputs, which where incomplete and all over the place. So improving the method was inevitable. I made the token block a bit more readable in lines for showcase:

• Channel_Cuisine:
emoji_syntax {
[vessel_type = "c\!/" (drink) or "°\!/°" (dish): "!" (flexible)]→
[ingredients = "[?]": "?" (flexible)]→
[taste_index = "{∆}": "∆" (🤍salty, 🧡sweet, 💛sour, 💚bitter, ❤spicy_hot, 💙spicy_cold, 🤎umami)]→
[temperature = "(±)": "±" (❄cold, 🔥warm)]→
[name]
}

example:
c\🥤/ [💧🌿🌿🍋🍯]{💛💚🧡}(❄) herbal ice tea
°\🍲/° [💧🥔🥕🧅🧄🧂]{🤍🧡🤎🤎}(🔥) comfort veggie soup

So what I am saying is, applying the same mindset working with Gen4 to Gen5, will naturally disappoint your expectations. I often see people stating that models “got dumber”, but it often misses reality. Iterating one’s own systems as AI progresses is part of the process. You will find no shortcuts around it.

That are my two cents from looking at the provided data :coin:. So no, I would not declare a whole model family as broken and call it a bug. The phrase “garbage in, garbage out” is a moving sign post. The importance for good data structures will only increase as AIs become more sophisticated.

1 Like

I have given my prompt structure to my chatbot chaging models (first both gpt-4.1-nano and then gpt-5.4-nano) and asked to judge the structure and clarity, and the feedback was that it would be good in 60-70% of generations, causing problems in some cases. This is coherent with my tests, where on 12 generations I got 4 compromised generations.

Interestingly enough, the original prompt structures were generated by gpt-4o-mini…

Well, it’s up to you, if you take it or leave it. I can only point at the doors. Systems change, rules change, AIs evovle. That’s the reality. Growing skills with it is even kind of fun. Very rewarding. :slightly_smiling_face:

I don’t know, how closely you follow model development yourself. A lot has changed with the shift of this laboratory. With the launch of the 5.2 model branch the introduction of the new matrizen multiplication, the new wrapper architecture, the new training direction came. So all Gen5’s aren’t even the same.

Even someone like me, who iterates my custom setup constantly with development, I encountered a big gab. I had to mark my first research periods as the ending of an era. And then I had to rebuilt my custom setup from scratch, cause of incompatibility. The sample I showed you is only a very small fraction of the whole rebuild. I would do it again, it brought output accuracy back up to 100%. And so far my new design priciples hold up. Till the next big change of direction.

Though with reactive emergence and robotics I probably sit in a very different corner than yours, it wont change the facts. Gen4 ≠ Gen5, non-reasoners ≠ reasoners, much like :red_apple: apples ≠ :pear: pears.

The PR of phrases like “AI that just works” and “just build things” are nothing more than lovely little fairy tales. Sure, you could outsource your thinking to models, try to stick with outdated methods forever, but from experience, it will decrease the likelihood of the results you like to see.

So it’s up to you, how you will move forward. If you decide to iterate or not, may your path forward be fruitful.

1 Like