Topic : OpenAI A/B Testing UX Is Actively Discouraging Feedback and Producing Garbage Data
Tags : A/B Testing, Feedback Quality, UI Design, UX, iPad, Tablets, Desktop, ChatGPT Apps
⸻
The Problem :
If OpenAI wants meaningful feedback from serious users, it needs to fix how A/B testing is rendered - especially on iPads/Tablets
Right now, A/B responses are displayed in narrow, side-by-side panes that don’t adapt to screen width. On an iPad - even in landscape mode, which is how most people use it, a few characters on both the left and right side of each answer are always cut off, forcing the user to scroll horizontally back and forth repeatedly in both answers, just to read the full content and make a decision. This builds user frustration, friction and task abandonment.
The result ? Many times - especially if I’m time-pressed, I skip both answers or pick one randomly, because the UX makes reading them frustrating and slow. So you’re capturing zero or distorted signals, not actual user preference. The test is likely producing junk feedback data that OpenAI may mistakenly interpret as being “useful”.
Forcing A/B testing mid-task disrupts both my focus and more importantly, my flow with the assistant, especially during research, developing story outlines, long brainstorming sessions, fleshing out design ideas, or detailed problem solving. You’re not just breaking our interaction - you’re breaking momentum.
I’m a user who actively wants to help improve the system, and have contributed a ton of real, structured feedback before. But the current A/B test implementation urgently needs to be refined.
If you want to run A/B response tests and get accurate data, here’s what I think should happen :
-
Fix the layout. Let users read A/B answers without scrolling by implementing a full or 3/4 screen popover, or tap to expand each answer full-width. Make use of the full screen real estate and don’t cram complex answers into a cramped side-by-side box layout that cuts off the edges.
-
Give us a comment field. If you want real insight, let users drop a quick one or two sentence freeform note. Something like : “Both were too similar,” or “A had better reasoning but clunky flow.” Optional, but very powerful.
-
Give users a “opt-out” option - Or better yet, detect when A/B testing is inappropriate. Let users temporarily disable A/B testing, or better yet, don’t make users flag it at all - The UI should be designed or updated to detect “deep work mode” from prompt density, frequency, and intent.
-
Offer “regenerate” or “combine both” options. If both responses fall flat, which they do at times, or if each has something useful that should be merged, let users signal that. You’ll get much higher-quality signals and more useful training data with this enabled.
-
Don’t run A/B tests during highly personal or emotionally sensitive conversations. Ever.
As a example : I was exploring some deeply personal topics like EOL planning and my father’s Alzheimer’s late last night, and while in Temporary Chat Mode, the system dropped in a A/B test in mid-conversation. That shocked the hell out of me - Full Stop.
It was wildly inappropriate, tone-deaf, and emotionally jarring. A/B testing should be disabled when the conversation context involves trauma, death, psychological distress or other deeply personal topics. This isn’t a UX detail—this is a trust boundary, and violating it undermines what OpenAI is trying to build with human-aligned AI.
⸻
Real-time example the feedback system is flawed :
While I was actively composing this very post about how disruptive and inappropriate A/B testing can be, I was hit with yet another disruptive A/B test. You can’t make this up.
The timing was not just ironic, it was living proof that the system has no contextual awareness whatsoever.
These A/B test design flaws interrupt high-focus, cognitively demanding user workflows with untimely feedback requests that may rarely be answered accurately, and in fact, users may by actively sabotaging their feedback, due to frustration.
How did this even get through testing ?
It’s hard to believe anyone used this in a real workflow before shipping it. Some plausible explanations are: It was generated by an internal AI and shipped without human iteration; only reviewed on desktop by someone who didn’t test it on tablets or in real tasks; or worse, someone knew the UX was flawed but bet the volume of feedback would offset the friction. If it’s the third one, that’s a serious misread.
To be clear: I want to give feedback. It’s in my own best interest to do so. But OpenAI is making it needlessly difficult and frustrating.
If you re trying to collect accurate data on user preferences, then start by respecting the user’s ability to comfortably read and respond to questions - And know when not to ask at all.
David Young
Boca Raton, FL