Stanford-Harvard Report Finds Clinical AI Isn't Living Up to the Hype

A new report from Stanford and Harvard researchers finds that most clinical AI tools aren’t being used, even by hospitals that bought them. The State of Clinical AI 2026, released last month by the ARISE research network, documents a significant gap between AI performance in controlled studies and actual medical practice.

The numbers are sobering. The FDA has cleared over 1,200 AI-enabled medical devices. Less than 15% are used routinely in hospitals that purchased them.

Lab Performance Doesn’t Transfer

The report analyzed over 500 medical AI studies and found that half of them tested AI using exam-style multiple-choice questions. Only 5% used real patient data. That’s a problem, because the controlled test environment bears little resemblance to actual clinical work.

When researchers at Stanford modified medical questions to include “none of the above” as an option - something doctors face constantly when symptoms don’t fit clean categories - AI model accuracy dropped by more than one-third. The clinical reasoning required was identical; the format was just more realistic.

Performance also declined when models encountered incomplete information, follow-up questions, or situations requiring them to revise earlier decisions. On tests measuring uncertainty recognition - knowing what you don’t know - AI performed closer to medical students than experienced physicians.

Friction Is Winning

Technical performance isn’t even the main problem. According to the report, many AI tools fail at the hospital level because of usability issues, not accuracy.

Dr. Jonathan Chen, one of the report authors, noted that “friction is winning.” Algorithms achieving 99% accuracy on imaging detection fail because of poor interface design - multiple login screens, slow load times, workflows that don’t match how clinicians actually work.

The result is what the researchers call “shadow AI.” Physicians bypass approved hospital tools and use unauthorized consumer apps on their phones to check drug interactions and patient histories. The apps are faster, even if they’re less accurate or compliant.

One resident quoted in the report explained the tradeoff: workflows demand 15-minute patient visits, approved AI tools take 2+ minutes to load, so clinicians use 5-second phone apps instead.

Models Degrade Over Time

The report also documented algorithmic degradation. Sepsis prediction models showed nearly 20% accuracy decline between 2023 and 2026. These weren’t broken algorithms - they were static models that didn’t adapt to changing hospital protocols, patient demographics, and viral evolution.

This isn’t unique to AI. Any prediction model trained on historical data can drift as the world changes. But it highlights a gap between the one-time validation required for FDA clearance and the ongoing monitoring needed for clinical reliability.

What Actually Works

Not everything in the report is discouraging. Several areas showed consistent value:

Patient deterioration prediction systems provided 8-24 hours advance warning before standard alerts, giving clinicians meaningful lead time. AI-derived biological age predictions outperformed traditional measures like frailty scores for predicting mortality.

A German study found radiologists using optional AI assistance detected more breast cancers without increasing false alarms - a finding the report highlighted as an example of AI supporting rather than replacing clinical judgment.

The common thread: AI working alongside clinicians, in well-designed interfaces, with clear escalation pathways.

What This Means

The report makes a case for recalibrating expectations. The authors argue that the next phase of clinical AI won’t be driven by smarter algorithms. Dr. Chen was blunt: “The next billion dollars shouldn’t be spent on making the algorithms smarter.”

Instead, the priority should be integration - building AI that fits into existing workflows, adapts to changing clinical reality, and earns clinician trust through consistent real-world performance rather than benchmark scores.

The FDA cleared 1,200+ devices. Less than 15% get used. That gap isn’t a technology problem. It’s a deployment problem, and solving it requires different skills than building the models in the first place.