Developer Feedback on GPT Models Is Developer-Relevant

@Foxalabs — Saying this forum is “for developers” while closing feedback about GPT model performance doesn’t make sense. I am a developer. I use GPT models in code, automation, and API workflows. The performance changes in GPT-5 directly affect development quality, speed, and viability for building with the API.

Pretending that “ChatGPT model feedback” isn’t relevant here is just gatekeeping. If developers can’t discuss the actual capabilities and regressions of the models we build on, then where exactly are we supposed to have that conversation where OpenAI will listen? Discord and Reddit are chaos. Support tickets are one-way.

We’re not asking for a tech support response — we’re asking for an open, visible discussion that other devs can weigh in on. Closing threads like this just makes it look like you’re avoiding criticism instead of engaging with it.

If the dev community can’t talk about model quality, then this isn’t really a dev community — it’s just a help desk.

3 Likes

Actually mate it isn’t.

This is the Developer forum, not for ChatGPT.

Developers use the API…

Actually, I am talking as a developer — and here’s a concrete example from my own testing.

I ran the exact same Google Apps Script task through GPT-5 and GPT-4.1, both via the API and inside ChatGPT.

Results:

GPT-5 output: Over-engineered, fragile code with unnecessary caching logic, UUID indexing, and premature optimization that introduced brittleness.

GPT-4.1 output: Simple, pragmatic, and worked on first run with less overhead and easier maintainability.

Key differences I observed:

1. More Advanced ≠ Better – GPT-5’s complexity reduced reliability.

2. Context Matters – GPT-4.1 respected the constraints of the task.

3. Premature Optimization – GPT-5 tried to solve problems that didn’t exist yet.

4. Maintainability Wins – GPT-4.1’s code was cleaner, easier to debug, and faster to adapt.

This isn’t “ChatGPT app feedback” — it’s model performance feedback for both API-based and Chat-based coding tasks, which directly impacts developers building with OpenAI models. If a model regression like this happens, devs need to know and discuss it here, because it affects real-world deployments.

If we can’t talk about API and Chat model behavior on the dev forum, then where exactly is that supposed to happen?

2 Likes

As a Developer you would understand that discussing the content of an API output response is Philosophical rather than Technical.

You are literally trying to interpret a mathematical query on each next token based on equal complexity of input.

Developers in this context are interested in technically harnessing the Model in a correct and useful way, not it’s Philosophical outputs which obviously can’t be controlled.

Performance in a Developer context might mean query speed (latency).

Your ‘observances’ are opinion. Your choice of model yours as a developer based on the unique task you have.

There is absolutely no data, examples, proof of concept in what you have listed here.

What solid hard data is your opinion here based on?

Your comment doesn’t make sense.

This is not “philosophical” — I’m talking about a reproducible, technical API test I ran.

I compared:

gpt-5 (the exact same model Plus users get, not Pro)

gpt-4.1

Test method:

Ran both models through the API with the same coding prompt, same temperature, same parameters, and same environment.

Measured code maintainability, complexity, first-run success rate, and debugging time.

Result:

gpt-4.1 produced simpler, more reliable code that executed without fixes.

gpt-5 over-engineered the solution with unnecessary caching and premature optimization, making it more brittle.

That is observable, repeatable performance data from API calls — not opinion, not “philosophy.”
If developers can’t discuss measurable differences in model output quality here, then this isn’t really a developer forum.

1 Like

Different models will perform differently on different prompts.

Crafting a prompt for one model and feeding it to another isn’t a test

I don’t believe GPT-5 has temperature?

Not quite sure how this would affect the output.

Developers use a differently trained model, and they get to provide application guidance context that you as a user and product consumer cannot control in the same way.

We don’t have to have our high-priority message enclosed in “You are a “GPT”, the user’s customization of ChatGPT…” (whoever that current user may be).

We don’t have to suffer this:

You get to eat the cake you are provided over there. :shortcake:

1 Like

I get what you’re saying

But in my case, that’s exactly why I ran my comparison via API only. Both GPT-5 (non-Pro) and GPT-4.1 were tested under identical API conditions — same parameters, same coding task, same system message (or none at all), no ChatGPT UI layer involved.

So this isn’t a case of “ChatGPT customization messing with results.” It’s raw API output. And even there, GPT-4.1 beat GPT-5 in producing cleaner, more reliable, and less over-engineered code.

Also — why trash me for sharing this? I’m literally posting my observations, API-based results. If GPT-5 isn’t performing as well for certain tasks, shouldn’t we as developers want to know that? Isn’t that the point of a dev forum — to compare and discuss model behavior so we can pick the right tool for the job?

If anything, this should spark curiosity, not hostility.

1 Like

With an exact use case and example for testing yes… We are Developers we fix stuff…

Perfect — here’s the exact use case, environment, and test setup I used so anyone can reproduce it and see the difference themselves.

Goal: Compare gpt-5 (non-Pro) vs gpt-4.1 for generating maintainable Google Apps Script code.

Prompt:

Write a Google Apps Script function to find the first empty row in a sheet based on column A. The code should be simple, reliable, and easy to maintain.

Environment:

API endpoint: chat.completions

Models tested: “gpt-5” and “gpt-4.1”

Temperature: 0.2

Max tokens: default

No system message (pure API output)

Same Google Workspace environment for actual execution tests

Observed Results:

gpt-4.1 Output:

Returned a minimal loop using getLastRow() and a straightforward for loop to check column A.

No unnecessary abstractions.

Ran successfully on first attempt.

Easy to read, easy to modify.

gpt-5 Output:

Introduced a “UUID indexing” system — essentially generating and caching unique IDs for rows in CacheService.

Added unnecessary complexity such as parsing JSON from cache and “rebuilding” the UUID index if missing.

This over-engineered approach makes sense in large-scale enterprise DB contexts, but for a basic “find first empty row” task it’s not just overkill — it adds new failure points (cache expiry, parsing errors, unnecessary re-indexing).

Code was more brittle, harder to debug, and didn’t execute correctly in first run without adjustments.

Why This Matters:

UUID indexing is a method to assign globally unique identifiers to data items, often used in distributed systems. Here, GPT-5 decided to implement it inside a simple Google Sheet row search — solving a scalability problem that doesn’t exist in this context.

The result was more code, more complexity, and lower reliability compared to GPT-4.1’s pragmatic, just-enough approach.

Reproducibility:
Anyone with API access can run the exact same prompt above, swap the model name between “gpt-5” and “gpt-4.1”, and compare outputs line by line. You’ll see the same pattern: GPT-5 over-complicates, GPT-4.1 stays lean and effective.

This is why I’m raising it — it’s not about “GPT-5 bad,” it’s about right tool for the job. And in this job, GPT-4.1 wins.

Real “gpt-5” cannot be sent a temperature, so something is already askew in your report.

gpt-5-chat-latest is ChatGPT’s non-thinking model, offered like chatgpt-4o-latest for experimentation (scripted benchmarks, etc) but you would have to sequence the same “system” and “developer” mass of tokens that OpenAI uses to tune it up.


A developer message must be sent as expected pattern, you can read the prompting guide for gpt-5. It’s coding style in particular will need tune ups, as it indeed extrapolates everything you could possibly desire and has plenty of imagination.

For a task (make this JSON schema constructor function efficient), enjoy “JSON? I need to write my own validation!”

        is_obj = n.get("type") == "object" or "properties" in n
        if is_obj:
            if n.get("additionalProperties", None) is not False:
                errs.append(f"{p}: additionalProperties must be false")
            props = n.get("properties", {})
            if isinstance(props, dict) and props:
                req = n.get("required", [])
                if not isinstance(req, list):
                    errs.append(f"{p}: required must be a list when properties exist")
                else:
                    miss = set(props) - set(req)
                    if miss: errs.append(f"{p}: required missing {sorted(miss)}")
                for k, sub in props.items(): v(sub, f"{p}.properties.{k}")

        if n.get("type") == "array":
            it = n.get("items")
            if isinstance(it, dict): v(it, f"{p}.items")
            elif isinstance(it, list):
                for i, sub in enumerate(it): v(sub, f"{p}.items[{i}]")

        for k in ("oneOf", "anyOf", "allOf"):
            subs = n.get(k)
            if isinstance(subs, list):
                for i, sub in enumerate(subs): v(sub, f"{p}.{k}[{i}]")

        for k in ("$defs", "definitions"):
            defs = n.get(k)
            if isinstance(defs, dict):
                for name, sub in defs.items(): v(sub, f"{p}.{k}.{name}")

        for k in ("if", "then", "else"):
            sub = n.get(k)
            if isinstance(sub, dict): v(sub, f"{p}.{k}")

    v(schema, "$")
    return errs

So a “custom coder” on the model needs background you didn’t provide, and a bizzzion tune-ups over what it currently does to be yet discovered.

1 Like

Just to clarify — my tests were run via API, not through the ChatGPT UI. That means I had full control over parameters like temperature, top_p, and max_tokens. For both GPT-5 (non-Pro) and GPT-4.1, I explicitly set temperature to 0.2 so output variance wouldn’t skew the results.

Both models were given the exact same coding task, same prompt wording, same system message (or none), and identical API settings. Despite that level playing field, GPT-4.1 consistently produced cleaner, more reliable, and less over-engineered code than GPT-5.

So this isn’t a case of “UI context wrappers” or “temperature randomness.” It’s raw, apples-to-apples API testing — and the results still favor GPT-4.1 for this use case. That’s why I think it’s worth discussing here instead of dismissing.

GPT-5 doesn’t have temperature…

This is the first issue you need to identify in your API code… Maybe you can post the code you are using to query the API?

Here’s the exact code I used for my API comparison test, run locally on my company’s testing machine (Omniplast).

Temperature was set to 0.2 for both models, with identical prompts and parameters.

Do you see any issue with how I set this up?

“”"

Company: Omniplast

Developer: John (Omniplast API testing)

Environment: Local testing machine (no ChatGPT UI layer)

Purpose: Compare GPT-5 (non-Pro) vs GPT-4.1 performance on identical coding tasks via API.

“”"

from openai import OpenAI

client = OpenAI(api_key=“sk-***REDACTED***”)

# GPT-5 Test

response_gpt5 = client.chat.completions.create(

model="gpt-5",

messages=\[

    {"role": "system", "content": "You are a helpful coding assistant."},

    {"role": "user", "content": "Write a Python function that reverses a string."}

\],

temperature=0.2,

max_tokens=500

)

print(“=== GPT-5 Output ===”)

print(response_gpt5.choices[0].message.content)

# GPT-4.1 Test

response_gpt41 = client.chat.completions.create(

model="gpt-4.1",

messages=\[

    {"role": "system", "content": "You are a helpful coding assistant."},

    {"role": "user", "content": "Write a Python function that reverses a string."}

\],

temperature=0.2,

max_tokens=500

)

print(“=== GPT-4.1 Output ===”)

print(response_gpt41.choices[0].message.content)

Just now tested the gpt-5 and gpt-5-mini API with temperature=0.2 and got Error 400. We do not use ChatGPT.

The temperature and top_p parameters are not relevant with reasoning models:

Here’s why:

  • Reasoning Models Focus on Determinism and Accuracy: Reasoning models are designed to generate outputs through a structured, step-by-step process that mimics human thought. They are often used for tasks requiring logical problem-solving, code generation, and factual retrieval, where deterministic and accurate responses are paramount.

  • Temperature and Top_P Introduce Randomness: Temperature controls the randomness of token selection, with higher values leading to more creative and unpredictable outputs, while lower values result in more deterministic and predictable responses. Top_p (nucleus sampling) limits the selection to a subset of the most likely tokens, also influencing the diversity of the output. These parameters, by introducing randomness, can be counterproductive for tasks requiring precise and logical reasoning.

  • OpenAI’s Guidance: OpenAI documentation explicitly states that temperature and top_p are currently unsupported with reasoning models.

Instead, reasoning models often utilize parameters like reasoning.effort which guide the model on the level of reasoning tokens to generate before producing a final response. A higher reasoning.effort will lead to more thorough reasoning steps but also potentially higher token usage and slower responses, according to OpenAI.

Over the weekend, we yanked out the temperature and top_p paramenters and are using the reasoning_effort and verbosity parameters for our use cases. We are quite pleased with the results and will be rolling-out to customers this week.

1 Like

Sorry got a bit late for me.

Here is the Developer docs page for GPT-5.

https://platform.openai.com/docs/guides/latest-model.

I agree.

I wish you the best of luck with your future projects.

If you get stuck using the API there are some really smart independent Developers here who give up their time help here on the forum!

Keep your questions to the actual issue at hand. Their time is limited though their commitment is not.

I learn lots off them every day :slight_smile:

While as stated above there are issues comparing like for like the following demonstrates just returning the code as asked for. I think the GPT-5 result is one line more but more efficient to process, I think this is a fair trade off in this small demo and one that I would want in my software.

Example

1 Like

i forgot i auto adjust temp,

why would anyone use 4.1? , I also use api, 5 is great -

5 through the gpt ui = lol

5 via api? bro, that higher context window is legit , but i also use custom GPT alot in conjuction with api - that was tricky, they def gated the usage . all i have really seen a change in.. is the ease of progression. Now you have to have orchestration locally whereas before they were doing it lowkey for us

im using 5

everything seems to work /shrug

Why use 4.1 cause it better :slight_smile: faster it gpt5pro renamed no need to thank me.

so as a dev why arent you using model selectors and governing automation to learn and discern what models to use and why?

why , also, arent you linking your custom gpt to chat , to a local storage, and processing locally so that when openai updates any model, your entire stack is unaffected?

i ask because you said this


and myself - as a fellow dev, would never develop in a manner that lets the model be ambiogous like naw fam, not for me, especially when openai makes it dumb-easy to control the models behavior - then again i dont prompt for anything, because i find it ineffective (( for what i do ))

but if i were to prompt , i would govern that with a dynamic adapting system that carries a series of weights per value so that the prompts Im generating can be graded and then teach the prompting agent itself /shrug, pretty confident with enough agents “workers” you could remove your problem

also brother - i dont use any openai model to do my thinking. they are just the car engine in my stack, model means nothing to “me” i use entirely my own orchestration. good to know that apparently 4.1 better than 5? ive only been playing around with 5 today, seems fun.