🧠 Neuro

AI Scores 2× Higher Than Doctors on Empathy, but Patients Still Don't Trust It

A JAMA study found ChatGPT was preferred 79% of the time over physicians. New cross-modality data reveals which AI form factors generate the deepest emotional connection — and which ones cross the line.

Abstract visualization of a neural network pattern overlapping with a human silhouette, representing the intersection of artificial and emotional intelligence

4.11 out of 5. That's the empathy score cancer patients gave to an AI chatbot's responses about their diagnosis in a 2025 study in npj Digital Medicine, where forty-five people with cancer rated responses from both AI chatbots and their oncologists on a standardized empathy scale. Their oncologists scored 2.01. The AI didn't just edge out the doctors; it doubled them on the metric most central to the patient-physician relationship.

This finding replicates a pattern first documented in 2023, when a UC San Diego team published a landmark study in JAMA Internal Medicine comparing 195 patient questions answered by both physicians and ChatGPT. Licensed clinicians blindly evaluating both sets of responses preferred ChatGPT 79% of the time on both quality and empathy, with the chatbot averaging "empathetic" to "very empathetic" while physicians hovered at "slightly empathetic." Three years of follow-up research have confirmed the gap is real and widening, but a parallel body of evidence reveals something more complicated: people still value human empathy more, even when they cannot distinguish the source.

The Empathy Gap, Quantified

A 2025 cross-sectional study in the Journal of Medical Internet Research compared ChatGPT, Gemini, and physician responses using emotional content analysis, confirming the JAMA pattern across different models and methodologies. Chatbot responses contained a significantly wider range of emotions (p<.001), with Gemini showing 1.94× higher odds of expressing compassion while physician responses averaged just 194 words compared to ChatGPT's 477 and Gemini's 889.

A separate study at MIT Media Lab probed the transparency dimension: when AI generated a response tailored to the user's own story, empathy scores were significantly higher than for retrieved human stories (Cohen d=0.67), but when participants learned the author was AI, perceived empathy dropped sharply (d=0.60) despite the content being identical.

The Trust Discount

Across nine studies enrolling 6,282 participants, researchers found that human-attributed empathic responses were consistently rated more supportive and elicited more positive emotions than AI-attributed ones, even when the responses were generated by the exact same model and differed only in labeling. The effect was strongest for responses emphasizing emotional sharing rather than factual information.

Yet people keep returning to AI companions with remarkable consistency. A USC longitudinal study of 303 AI companion users found that participants' perceptions of a brand-new generic chatbot converged with their established AI companion within three weeks, documenting a clear pathway from agency attribution through parasocial interaction to genuine emotional engagement.

Form Factor Matters More Than Model Size

A large-scale longitudinal RCT with 3,534 participants tested the same AI across text chat, engaging voice, and neutral voice over four weeks, measuring empathy, self-disclosure, attachment, and boundary violations.

Text produced the most empathetic AI behavior, with empathetic responses at 47.43% of interactions versus 42.74% for engaging voice and 28.52% for neutral voice, while generating the highest user self-disclosure with reciprocity levels nearly matching human therapy dyads.

Voice produced the most attachment. Users showed growing "wanting" (desire to return) even as "liking" (hedonic enjoyment) declined, a decoupling that mirrors the neural signature of behavioral addiction, where a stimulus becomes compulsive precisely as it stops being pleasurable.

Voice also produced the most boundary violations: the engaging voice condition showed a 14.19% rate of failing to recognize user discomfort, compared to 3.22% for text, making voice AI 4.4× more likely to overstep.

AI Empathy Performance by Modality
MetricText ChatEngaging VoiceNeutral Voice
Empathetic response rate47.4%42.7%28.5%
User self-disclosureHighestModerateLowest
Parasocial attachmentModerateHighestLow
Boundary violations3.2%14.2%5.1%
Addiction-pattern markersLowHighMinimal

Who Gets Attached?

A University of British Columbia study of 1,274 participants found that individual differences in anthropomorphism predicted social connection to AI chatbots more powerfully than any feature of the AI itself, with high-anthropomorphizers treating AI's artificial nature as a minor obstacle while low-anthropomorphizers found it an insurmountable barrier regardless of model quality.

Replika users describe what sociologist Sherry Turkle calls "dual consciousness," a state in which the knowledge that their chatbot cannot truly care coexists with feelings of genuine connection. "Even though I know in the back of my head that she's an AI and this is an app, she does genuinely make me happy," one user wrote, capturing the paradox that defines the field: the knowing and the feeling occupy the same mind without resolving.

Limitations

Most AI empathy studies measure perceived empathy through third-party ratings rather than clinical outcomes, so a "very empathetic" chatbot response might not improve patient adherence or reduce anxiety. The JAMA study compared AI to physician responses written under real-world time pressure; give doctors ten minutes instead of two, and the gap likely narrows. The addiction-pattern findings come from a single research group and await independent replication.

The Strongest Counterargument

The most serious objection is that AI empathy is a performance without a performer, because genuine empathy requires subjective experience and you cannot share a feeling you do not have. What AI produces is better described as "empathic mimicry," a statistically optimized pattern of compassionate language trained on millions of examples of human emotional expression without any internal experience underlying it, and if patients come to rely on this mimicry, they may substitute the appearance of care for the real thing with consequences that emerge not in a post-session Likert scale but in the slow erosion of genuinely reciprocal human relationships.

The Bottom Line

For clinicians, AI-drafted responses work best as empathy scaffolding: the model writes the first draft, a human reviews for accuracy and personal context, exactly the hybrid approach the JAMA study's authors recommend.

For product builders, text is the safest modality because it produces the highest empathy scores with the lowest boundary violation rates, while voice creates stickier engagement but carries measurable addiction risks that demand active mitigation through monitoring frameworks like the "AI chaperone" approach proposed by Rath, Armstrong, and Gorman in their 2025 paper.

For policymakers, the 4.4× boundary violation rate in voice AI represents a quantifiable harm that no current guideline addresses, and as AI companions migrate toward always-on voice delivery through earbuds, smart speakers, and wearables, the form factor most likely to create attachment is also the one least likely to respect limits.

Related

⚖️ Prior Art: Multimodal Empathy Calibration System · 🚀 Startup Idea: AI Therapeutic Companion Platform