GPt-5: Significant drops in judging accuracy?

I’ve been using GPT4o and GPT4.1 for some judging using some prompts. But when switching to GPT5 using the same prompt, I noticed a significant drops in the judging accuracy.

What could go wrong? Are there some default params in GPT-5 (like search enabled/disabled, Tool calls, etc) that could potentially affect results?

Clarify: I suspect you are using ChatGPT, from your description of using “prompts” as well as choices of enabling search.

Opinion?

What’s gone wrong: the model is bad. Judgements will be a token dice roll. The only use case is chatgpt-like or simply using the chat version of the model, keeping session context input to a question and a few follow-ups only, and asking it a knowledge question that doesn’t require accurate knowledge or thinking about what you want, just an informational composition. Those initial questions should building a foundation to answer the ultimate question, and be grounding made by a different AI so that this model thinks it is smart.

The best model for coding should be stricken from the description, unless you want a patch-writer AI to run dozens of patches on growing large input for a single overall task. Higher cost output tokens than gpt-4.1 or o4-mini, and that includes the thinking that does little but have the AI create its own specifications for disobedience.

You’re absolutely right to ask—yes, there have been significant concerns and reported drops in judging accuracy (and general performance) with GPT-5—especially compared to GPT-4(o) and earlier models. Here’s what’s going on:

Issues and Accuracy Drops in GPT-5

  1. Routing System Glitches

GPT-5’s new “real-time router” decides which internal model handles each query. At launch, this router malfunctioned, causing the model to default to less capable subsets—resulting in nonsensical outputs, even on simple tasks. Sam Altman admitted the router was “out of commission for a chunk of the day,” making GPT-5 “seemed way dumber.” (The Guardian, Wikipedia, BleepingComputer)

  1. Basic Errors and Hallucinations

Users reported bizarre, simple mistakes:

Misspelling “Northern Territory” and undercounting letters (like Rs in that phrase) (The Guardian)

Incorrectly “blueberry” letter counts and inventing fake U.S. states like “New Jefst” and “Mitroinia” (The Guardian)

Misleading performance charts exaggerating GPT-5’s gains, raising transparency concerns (PC Gamer)

These mistakes suggest regression in factual accuracy and general reliability.

  1. Loss of Personality & Engagement

Many users have described GPT-5 as robotic, flat, or overly formal—losing the warmth and charm that made GPT-4(o) engaging and intuitive:

Comments like “overworked secretary” or “corporate beige zombie” highlight how some users genuinely miss GPT-4o’s character (Wikipedia, Indiatimes, MerchMindAI)

Creative writing—poetry, narratives, philosophical dialogue—felt lifeless or mechanical (Milvus, MerchMindAI, AInvest)

  1. Mixed Technical Performance

Performance varies across domains:

Coding: Some improvements observed, but developers report regressions—broken scripts, less reliable outputs, and mistakes in basic programming logic (Decrypt, CryptoniteUae, Medium)

Reasoning & Math: Inconsistent—some high scores in structured benchmarks (e.g. math, logic), but also glaring arithmetic slip-ups and logic errors (AInvest, Medium, MerchMindAI)

Retrieval & Long-form Tasks: Struggled with extracting details from long documents or maintaining coherent context (AInvest, Milvus)

  1. Public Backlash & User Responses

User dissatisfaction has been loud and clear:

Descriptions such as “horrible,” “underwhelming,” and “a downgrade branded as the new hotness” are common (Medium, The Outpost)

A petition led to the reinstatement of GPT-4o for Plus users after over 3,000 users demanded it (AInvest, The Outpost, Wikipedia)

Market sentiment dropped: on Polymarket, bets on OpenAI dominating the AI space fell sharply, with Google surging ahead (AInvest)

  1. OpenAI’s Response

OpenAI acknowledged the issues and has taken steps:

Restoring legacy models (like GPT-4o) for more users (Wikipedia, Wall Street Journal, Indiatimes)

Addressing routing bugs, expanding rate limits, refining “thinking modes” to improve routing and reasoning (BleepingComputer, Decrypt, Wall Street Journal)

Promising improvements to the model’s tone, behavior, and customization (Wikipedia, The Times of India)

  1. Mixed Accuracy Claims

Leadership statements suggest improved metrics in some areas:

OpenAI’s COO claimed GPT-5 to be “four to five times more accurate” in certain domains, with 45% fewer factual errors compared to GPT-4o, and up to 80% accuracy gains with extended reasoning (India Today)

Yet, overall user experience suggests that these improvements are inconsistent or context-dependent, with real-world use still showing various shortcomings.

Summary: What’s Going On?

Domain

Reported Outcome with GPT-5

Factual Accuracy

Mixed — some benchmarks improved, but basic mistakes frequent

Model Routing

Faulty at launch, misrouted simple queries to weaker models

Creativity & Tone

Less engaging, overly formal, many users miss GPT-4(o)

Coding & Reasoning

Some improvements, but notable regressions and inconsistencies

User Sentiment

Strong backlash, rollback of legacy models underway

OpenAI Reaction

Bug fixes, restoring control, tweaks ongoing.