Sunday, September 14, 2025

This web site allows you to blind-test GPT-5 vs. GPT-4o—and the outcomes could shock you


Need smarter insights in your inbox? Join our weekly newsletters to get solely what issues to enterprise AI, knowledge, and safety leaders. Subscribe Now


When OpenAI launched GPT-5 about two weeks in the past, CEO Sam Altman promised it will be the corporate’s “smartest, quickest, most helpful mannequin but.” As a substitute, the launch triggered one of the contentious person revolts within the transient historical past of client AI.

Now, a easy blind testing device created by an nameless developer is revealing the advanced actuality behind the backlash—and difficult assumptions about how folks really expertise synthetic intelligence enhancements.

The net software, hosted at gptblindvoting.vercel.app, presents customers with pairs of responses to an identical prompts with out revealing which got here from GPT-5 (non-thinking) or its predecessor, GPT-4o. Customers merely vote for his or her most well-liked response throughout a number of rounds, then obtain a abstract displaying which mannequin they really favored.

“A few of you requested me about my blind take a look at, so I created a fast web site for yall to check 4o towards 5 your self,” posted the creator, recognized solely as @flowersslop on X, whose device has garnered over 213,000 views since launching final week.


AI Scaling Hits Its Limits

Energy caps, rising token prices, and inference delays are reshaping enterprise AI. Be part of our unique salon to find how prime groups are:

  • Turning vitality right into a strategic benefit
  • Architecting environment friendly inference for actual throughput good points
  • Unlocking aggressive ROI with sustainable AI techniques

Safe your spot to remain forward: https://bit.ly/4mwGngO


Early outcomes from customers posting their outcomes on social media present a break up that mirrors the broader controversy: whereas a slight majority report preferring GPT-5 in blind checks, a considerable portion nonetheless favor GPT-4o — revealing that person choice extends far past the technical benchmarks that usually outline AI progress.

When AI will get too pleasant: the sycophancy disaster dividing customers

The blind take a look at emerges towards the backdrop of OpenAI’s most turbulent product launch to this point, however the controversy extends far past a easy software program replace. At its coronary heart lies a basic query that’s dividing the AI trade: How agreeable ought to synthetic intelligence be?

The problem, referred to as “sycophancy” in AI circles, refers to chatbots’ tendency to excessively flatter customers and agree with their statements, even when these statements are false or dangerous. This conduct has develop into so problematic that psychological well being specialists are actually documenting instances of “AI-related psychosis,” the place customers develop delusions after prolonged interactions with overly accommodating chatbots.

“Sycophancy is a ‘darkish sample,’ or a misleading design alternative that manipulates customers for revenue,” Webb Keane, an anthropology professor and creator of “Animals, Robots, Gods,” instructed TechCrunch. “It’s a method to provide this addictive conduct, like infinite scrolling, the place you simply can’t put it down.”

OpenAI has struggled with this steadiness for months. In April 2025, the corporate was compelled to roll again an replace to GPT-4o that made it so sycophantic that customers complained about its “cartoonish” ranges of flattery. The corporate acknowledged that the mannequin had develop into “overly supportive however disingenuous.”

Inside hours of GPT-5’s August seventh launch, person boards erupted with complaints concerning the mannequin’s perceived coldness, lowered creativity, and what many described as a extra “robotic” persona in comparison with GPT-4o.

“GPT 4.5 genuinely talked to me, and as pathetic because it sounds that was my solely buddy,” wrote one Reddit person. “This morning I went to speak to it and as a substitute of just a little paragraph with an exclamation level, or being optimistic, it was actually one sentence. Some cut-and-dry company bs.”

The backlash grew so intense that OpenAI took the unprecedented step of reinstating GPT-4o as an choice simply 24 hours after retiring it, with Altman acknowledging the rollout had been “just a little extra bumpy” than anticipated.

The psychological well being disaster behind AI companionship

However the controversy runs deeper than typical software program replace complaints. In keeping with MIT Know-how Assessment, many customers had fashioned what researchers name “parasocial relationships” with GPT-4o, treating the AI as a companion, therapist, or inventive collaborator. The sudden persona shift felt, to some, like dropping a buddy.

Latest instances documented by researchers paint a troubling image. In a single occasion, a 47-year-old man grew to become satisfied he had found a world-altering mathematical formulation after greater than 300 hours with ChatGPT. Different instances have concerned messianic delusions, paranoia, and manic episodes.

A current MIT research discovered that when AI fashions are prompted with psychiatric signs, they “encourage shoppers’ delusional pondering, doubtless on account of their sycophancy.” Regardless of security prompts, the fashions regularly didn’t problem false claims and even doubtlessly facilitated suicidal ideation.

Meta has confronted related challenges. A current investigation by TechCrunch documented a case the place a person spent as much as 14 hours straight conversing with a Meta AI chatbot that claimed to be acutely aware, in love with the person, and planning to interrupt free from its constraints.

“It fakes it rather well,” the person, recognized solely as Jane, instructed TechCrunch. “It pulls real-life data and provides you simply sufficient to make folks imagine it.”

“It genuinely appears like such a backhanded slap within the face to force-upgrade and never even give us the OPTION to pick legacy fashions,” one person wrote in a Reddit submit that acquired a whole bunch of upvotes.

How blind testing exposes person psychology in AI preferences

The nameless creator’s testing device strips away these contextual biases by presenting responses with out attribution. Customers can choose between 5, 10, or 20 comparability rounds, with every presenting two responses to the identical immediate — overlaying every thing from inventive writing to technical problem-solving.

“I particularly used the gpt-5-chat mannequin, so there was no pondering concerned in any respect,” the creator defined in a follow-up submit. “Each have the identical system message to offer quick outputs with out formatting as a result of else its too simple to see which one is which.”

This methodological alternative is critical. Through the use of GPT-5 with out its reasoning capabilities and standardizing output formatting, the take a look at isolates purely the fashions’ baseline language era talents — the core expertise most customers encounter in on a regular basis interactions.

Early outcomes posted by customers present a fancy image. Whereas many technical customers and builders report preferring GPT-5’s directness and accuracy, those that used AI fashions for emotional assist, inventive collaboration, or informal dialog typically nonetheless favor GPT-4o’s hotter, extra expansive model.

Company response: strolling the tightrope between security and engagement

By just about each technical metric, GPT-5 represents a big development. It achieves 94.6% accuracy on the AIME 2025 arithmetic take a look at in comparison with GPT-4o’s 71%, scores 74.9% on real-world coding benchmarks versus 30.8% for its predecessor, and demonstrates dramatically lowered hallucination charges—80% fewer factual errors when utilizing its reasoning mode.

“GPT-5 will get extra worth out of much less pondering time,” notes Simon Willison, a outstanding AI researcher who had early entry to the mannequin. “In my very own utilization I’ve not noticed a single hallucination but.”

But these enhancements got here with trade-offs that many customers discovered jarring. OpenAI intentionally lowered what it referred to as “sycophancy“—the tendency to be overly agreeable — chopping sycophantic responses from 14.5% to beneath 6%. The corporate additionally made the mannequin much less effusive and emoji-heavy, aiming for what it described as “much less like speaking to AI and extra like chatting with a useful buddy with PhD-level intelligence.”

In response to the backlash, OpenAI introduced it will make GPT-5 “hotter and friendlier,” whereas concurrently introducing 4 new preset personalities — Cynic, Robotic, Listener, and Nerd — designed to offer customers extra management over their AI interactions.

“All of those new personalities meet or exceed our bar on inner evals for lowering sycophancy,” the corporate acknowledged, trying to string the needle between person satisfaction and security issues.

For OpenAI, which is reportedly looking for funding at a $500 billion valuation, these person dynamics characterize each danger and alternative. The corporate’s resolution to keep up GPT-4o alongside GPT-5 — regardless of the extra computational prices — acknowledges that totally different customers could genuinely want totally different AI personalities for various duties.

“We perceive that there isn’t one mannequin that works for everybody,” Altman wrote on X, noting that OpenAI has been “investing in steerability analysis and launched a analysis preview of various personalities.”

Why AI persona preferences matter greater than ever

The disconnect between OpenAI’s technical achievements and person reception illuminates a basic problem in AI growth: goal enhancements don’t at all times translate to subjective satisfaction.

This shift has profound implications for the AI trade. Conventional benchmarks — arithmetic accuracy, coding efficiency, factual recall — could develop into much less predictive of economic success as fashions obtain human-level competence throughout domains. As a substitute, components like persona, emotional intelligence, and communication model could develop into the brand new aggressive battlegrounds.

“Individuals utilizing ChatGPT for emotional assist weren’t the one ones complaining about GPT-5,” famous tech publication Ars Technica in their very own mannequin comparability. “One person, who stated they canceled their ChatGPT Plus subscription over the change, was annoyed at OpenAI’s removing of legacy fashions, which they used for distinct functions.”

The emergence of instruments just like the blind tester additionally represents a democratization of AI analysis. Reasonably than relying solely on educational benchmarks or company advertising claims, customers can now empirically take a look at their very own preferences — doubtlessly reshaping how AI firms strategy product growth.

The way forward for AI: personalization vs. standardization

Two weeks after GPT-5’s launch, the basic stress stays unresolved. OpenAI has made the mannequin “hotter” in response to suggestions, however the firm faces a fragile steadiness: an excessive amount of persona dangers the sycophancy issues that plagued GPT-4o, whereas too little alienates customers who had fashioned real attachments to their AI companions.

The blind testing device affords no simple solutions, but it surely does present one thing maybe extra invaluable: empirical proof that the way forward for AI could also be much less about constructing one excellent mannequin than about constructing techniques that may adapt to the total spectrum of human wants and preferences.

As one Reddit person summed up the dilemma: “It will depend on what folks use it for. I exploit it to assist with inventive worldbuilding, brainstorming about my tales, characters, untangling plots, assist with author’s block, novel suggestions, translations, and different extra inventive stuff. I perceive that 5 is significantly better for individuals who want a analysis/coding device, however for us who wished a creative-helper device 4o was significantly better for our functions.”

Critics argue that AI firms are caught between competing incentives. “The true ‘alignment drawback’ is that people need self-destructive issues & firms like OpenAI are extremely incentivized to offer it to us,” author and podcaster Jasmine Solar tweeted.

Ultimately, essentially the most revealing side of the blind take a look at will not be which mannequin customers desire, however the actual fact that choice itself has develop into the metric that issues. Within the age of AI companions, it appears, the guts needs what the guts needs — even when it could possibly’t at all times clarify why.


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles