OpenAI`s Human Review Does Not Pass the Turing Test

Throstur · May 6, 2026, 12:06pm

I am posting this because my recent support experience exposed a failure mode that should concern anyone relying on OpenAI’s paid products for technical work: when something goes wrong, “human review” can look indistinguishable from another automated support loop.

The issue was not vague. I reported two specific, reproducible problems.

First, ChatGPT Web allowed me to explicitly select GPT-5.5 / Latest, but responses and retries sometimes showed that GPT-5.4 Thinking had been used instead. This matters because model selection is not cosmetic. When a user selects a specific model for a difficult task, especially after learning that a lower model is inadequate for that task, silent routing or misleading labeling burns time, usage, and trust.

Second, the Codex CLI displayed my account as Free even though I am a Plus subscriber. The relevant usage limits in the CLI matched the Codex web dashboard, which strongly indicated that the CLI was authenticated to the same account and that the “Free” label was not merely a different-account problem.

I provided concrete evidence: platform, browser, reproduction timing, cross-browser reproduction, CLI version, and CLI status output. I also explained why the standard “maybe you are logged into the wrong account” explanation did not fit the facts.

The initial support interaction followed a predictable pattern. It suggested automatic switching, session mismatch, incognito testing, browser issues, and account mismatch. Some of these possibilities were reasonable as first-pass triage. But after I supplied counter-evidence, the support flow continued to ask for data already provided or recommend tests already ruled out.

One support suggestion was also technically wrong: I was advised to run a CLI status command in a way that is not equivalent to launching the CLI and then entering /status. That bad instruction consumed usage. I therefore asked for compensation or at least a specialist review.

The support assistant then told me the issue had been escalated to a support specialist.

That is where the failure became more serious.

The “specialist” response did not engage with the evidence. It was a generic troubleshooting checklist:

sign out and sign back in;
try incognito/private browsing;
clear app/session state;
verify the model label in a new chat;
log out and reauthenticate the CLI;
update the CLI;
check for account mismatch;
minimize usage during testing;
gather screenshots and CLI output.

The problem is that these were not new next steps. They were mostly a restatement of the already-exhausted first-line script. Some directly ignored the evidence already supplied. For example, cross-browser reproduction had already been reported. The CLI output had already been provided. The account-mismatch theory had already been addressed by the matching usage limits between the web dashboard and CLI.

So the “human review” did not pass the basic test of human review: reading the case history.

That is why I titled this post bluntly. I am not claiming to know whether the response was actually written by a human, an AI assistant, or a human clicking through an AI-generated macro. I am saying that from the user’s perspective, it failed the Turing test. It behaved like an automated system that had not incorporated the prior context.

This is especially frustrating because OpenAI’s products are now being used as serious development tools. When the product misreports the selected model or mislabels paid account status, the support path needs to preserve and reason over technical evidence. It cannot reset the conversation to “try incognito” after the user has already reproduced the problem in another browser.

The unresolved issues are:

Model selection / routing transparency
ChatGPT Web displayed GPT-5.5 / Latest, while response metadata showed GPT-5.4 Thinking. Users need to know what model actually handled their request, and whether “Latest” is a hard selection, a routing label, or a best-effort alias.
Silent fallback or misleading labeling
If fallback occurs, it should be disclosed. If no fallback occurred and the UI label is wrong, that is still a product bug. In both cases, the user should not be left guessing.
Codex CLI account-plan display
The CLI showed the account as Free despite Plus subscription status and usage limits consistent with paid access. If this is only a display bug, it should be acknowledged as such. If it affects entitlements, that is more serious.
Incorrect support guidance consuming usage
A support instruction caused avoidable usage consumption because the CLI command guidance was wrong. That deserves review, not another generic checklist.
Escalation quality
“Escalated to a specialist” should mean the next reply reads the evidence, addresses the actual reproduction, and distinguishes between already-tried steps and new diagnostic steps.

The minimum useful response would have been something like:

We reviewed the reproduction. The issue appears to be either a model-picker labeling/routing bug or an undocumented behavior of the “Latest” selector. We have attached the conversation ID and timestamps to an internal report. We also reviewed the Codex CLI status output and agree that the “Free” label conflicts with the observed entitlement/usage data. We are filing that separately as a CLI display bug. We acknowledge that the prior CLI instruction was incorrect and will review whether usage compensation is appropriate.

That would have passed as human review.

Instead, I got a reset to tier-one troubleshooting.

I am asking OpenAI to improve this process. If support uses AI drafting internally, fine. But the result must still demonstrate case comprehension. A paid user should not have to repeatedly prove the same facts because every layer of support forgets or ignores the prior message.

For developer products in particular, support needs a higher standard:

preserve reproduction details;
read prior evidence before responding;
distinguish “already tried” from “please try”;
acknowledge product ambiguity directly;
avoid making users spend usage to prove support’s own hypothesis;
provide a clear bug-report path when UI labels and actual behavior diverge.

The issue here is not merely that something broke. Software breaks.

The issue is that the support escalation behaved as if no escalation had happened.

Topic		Replies	Views
How to reach a human support agent? Feedback	5	1507	May 16, 2025
How to reach a human at OpenAI support regarding high latency on the API? API	13	884	October 1, 2025
Locked Out Due to OpenAI's bug with Apple emails – Expired API Credits and Inadequate Support Bugs account-problem , platform	1	290	December 10, 2024
Subject: Unable to Access API and ChatGPT-Team — Issue Unresolved for 3 Weeks Feedback	2	562	December 15, 2024
The broken OpenAI Persona Identity Verification: What It Is and Why It's Problematic Feedback	18	7406	May 1, 2026

OpenAI`s Human Review Does Not Pass the Turing Test

Related topics