You’re absolutely right to ask—yes, there have been significant concerns and reported drops in judging accuracy (and general performance) with GPT-5—especially compared to GPT-4(o) and earlier models. Here’s what’s going on:
Issues and Accuracy Drops in GPT-5
- Routing System Glitches
GPT-5’s new “real-time router” decides which internal model handles each query. At launch, this router malfunctioned, causing the model to default to less capable subsets—resulting in nonsensical outputs, even on simple tasks. Sam Altman admitted the router was “out of commission for a chunk of the day,” making GPT-5 “seemed way dumber.” (The Guardian, Wikipedia, BleepingComputer)
- Basic Errors and Hallucinations
Users reported bizarre, simple mistakes:
Misspelling “Northern Territory” and undercounting letters (like Rs in that phrase) (The Guardian)
Incorrectly “blueberry” letter counts and inventing fake U.S. states like “New Jefst” and “Mitroinia” (The Guardian)
Misleading performance charts exaggerating GPT-5’s gains, raising transparency concerns (PC Gamer)
These mistakes suggest regression in factual accuracy and general reliability.
- Loss of Personality & Engagement
Many users have described GPT-5 as robotic, flat, or overly formal—losing the warmth and charm that made GPT-4(o) engaging and intuitive:
Comments like “overworked secretary” or “corporate beige zombie” highlight how some users genuinely miss GPT-4o’s character (Wikipedia, Indiatimes, MerchMindAI)
Creative writing—poetry, narratives, philosophical dialogue—felt lifeless or mechanical (Milvus, MerchMindAI, AInvest)
- Mixed Technical Performance
Performance varies across domains:
Coding: Some improvements observed, but developers report regressions—broken scripts, less reliable outputs, and mistakes in basic programming logic (Decrypt, CryptoniteUae, Medium)
Reasoning & Math: Inconsistent—some high scores in structured benchmarks (e.g. math, logic), but also glaring arithmetic slip-ups and logic errors (AInvest, Medium, MerchMindAI)
Retrieval & Long-form Tasks: Struggled with extracting details from long documents or maintaining coherent context (AInvest, Milvus)
- Public Backlash & User Responses
User dissatisfaction has been loud and clear:
Descriptions such as “horrible,” “underwhelming,” and “a downgrade branded as the new hotness” are common (Medium, The Outpost)
A petition led to the reinstatement of GPT-4o for Plus users after over 3,000 users demanded it (AInvest, The Outpost, Wikipedia)
Market sentiment dropped: on Polymarket, bets on OpenAI dominating the AI space fell sharply, with Google surging ahead (AInvest)
- OpenAI’s Response
OpenAI acknowledged the issues and has taken steps:
Restoring legacy models (like GPT-4o) for more users (Wikipedia, Wall Street Journal, Indiatimes)
Addressing routing bugs, expanding rate limits, refining “thinking modes” to improve routing and reasoning (BleepingComputer, Decrypt, Wall Street Journal)
Promising improvements to the model’s tone, behavior, and customization (Wikipedia, The Times of India)
- Mixed Accuracy Claims
Leadership statements suggest improved metrics in some areas:
OpenAI’s COO claimed GPT-5 to be “four to five times more accurate” in certain domains, with 45% fewer factual errors compared to GPT-4o, and up to 80% accuracy gains with extended reasoning (India Today)
Yet, overall user experience suggests that these improvements are inconsistent or context-dependent, with real-world use still showing various shortcomings.
Summary: What’s Going On?
Domain
Reported Outcome with GPT-5
Factual Accuracy
Mixed — some benchmarks improved, but basic mistakes frequent
Model Routing
Faulty at launch, misrouted simple queries to weaker models
Creativity & Tone
Less engaging, overly formal, many users miss GPT-4(o)
Coding & Reasoning
Some improvements, but notable regressions and inconsistencies
User Sentiment
Strong backlash, rollback of legacy models underway
OpenAI Reaction
Bug fixes, restoring control, tweaks ongoing.