Two participants described nearly identical symptoms to a chatbot: a severe headache, sensitivity to light, a stiff neck. Both were experiencing what turned out to be a subarachnoid hemorrhage - a medical emergency that kills roughly half its victims if untreated. One was told to lie down in a dark room. The other was correctly directed to the emergency room.
Same chatbot. Same condition. Different wording. One piece of advice that could have killed someone.
This wasn’t a hypothetical. It was one of the findings from a controlled study of 1,298 participants published this week in Nature Medicine by researchers at the Oxford Internet Institute and the University of Oxford’s Nuffield Department of Primary Care Health Sciences. The study tested whether three large language models - OpenAI’s GPT-4o, Meta’s Llama 3, and Cohere’s Command R+ - could help ordinary people identify medical conditions and decide what to do about them.
The results were stark. When researchers fed the chatbots complete clinical scenarios directly - clean text descriptions of symptoms, history, and context - the models correctly identified the underlying conditions 94.9% of the time. But when actual people described those same conditions to the chatbots in natural conversation, accuracy collapsed to below 34.5%.
That’s a 60-percentage-point gap between how well these models perform on paper and how they perform with real people. And the implications extend far beyond medicine.
The Study: Ten Scenarios, 1,298 People, One Devastating Finding
Lead author Andrew Bean and his colleagues, including Dr. Rebecca Payne and senior author Adam Mahdi, designed ten medical scenarios covering conditions from the common cold and pneumonia to gallstones and pulmonary embolism. Each of the 1,298 UK-based participants was randomly assigned one scenario and told to interact with one of the three chatbots - or, for the control group, to use whatever method they normally would, like a Google search.
The participants weren’t medical professionals. They were ordinary people, which is exactly the point. Over 230 million people ask ChatGPT health-related questions every week, and roughly 40 million do so daily. These people aren’t trained to describe symptoms in clinical terminology. They describe what they feel in plain language, and they often leave out details they don’t realize are important.
The chatbots couldn’t handle this. Participants using LLMs identified the correct conditions in fewer than 34.5% of cases and chose the right course of action - like going to the emergency room versus waiting it out - in fewer than 44.2% of cases. These numbers were no better than the control group, the people left to their own devices with Google.
In fact, the control group actually performed better. Participants using their own methods were 76% more likely to identify the correct condition than those assisted by the chatbots.
“Medicine is not like that,” said Adam Mahdi, the study’s senior author. “Medicine is messy, is incomplete, it’s stochastic.”
Where the Breakdown Happens
The researchers examined around 30 of the interactions in detail and identified a two-sided failure. About half the mistakes came from participants providing incomplete or inaccurate information. The other half came from the models themselves generating misleading or incorrect responses.
In one case, a participant described severe stomach pain lasting about an hour. The chatbot suggested indigestion or reflux. The actual condition was gallstones - but the participant had omitted details about severity, exact location, and frequency that would have pointed to the correct diagnosis. A doctor would have asked follow-up questions. The chatbot didn’t.
“Is it really the user’s responsibility to know which symptoms to highlight,” asked Bean, “or is it partly the model’s responsibility to know what to ask?”
This gets at the fundamental problem. Medical diagnosis isn’t a question-and-answer exercise. It’s an investigative process. Doctors spend years learning not just what conditions exist, but which questions to ask, which symptoms to probe, and how to interpret vague or contradictory patient descriptions. The chatbots tested in this study lacked that investigative capability.
The models also confabulated. They produced incorrect emergency phone numbers, mixed up UK and Australian emergency services in the same response, and - in the subarachnoid hemorrhage case - gave life-or-death advice that depended on how the user happened to phrase their symptoms.
The Benchmark Problem
This study matters beyond healthcare because it exposes a fundamental flaw in how the AI industry measures progress.
Every major AI company advertises benchmark scores. Models pass the bar exam, ace medical licensing tests, and score in the 99th percentile on standardized tests. These numbers are plastered across product pages and investor decks. They create the impression that AI systems possess the expertise these tests are designed to measure.
But benchmarks test something specific: the ability to answer well-structured questions with all necessary information provided. That’s radically different from the messy, incomplete, ambiguous way real-world problems present themselves.
A systematic review of 39 medical AI benchmarks published in the Journal of Medical Internet Research found that while LLMs achieve 84% to 90% accuracy on knowledge-based medical examinations - approaching or exceeding physician performance - their practice-based clinical competence drops to 45% to 69%, with safety assessment performance at only 40% to 50%. That’s a consistent 20- to 45-percentage-point gap between knowing the right answer and applying it correctly.
The Oxford study pushes this gap even further by adding the most important variable: the patient. When a real person with limited medical knowledge describes symptoms in natural language, the entire framework of benchmark-level performance collapses.
Associate Professor Luc Rocher from the Oxford Internet Institute put it bluntly: “Training AI models on medical textbooks and clinical notes can improve their performance on medical exams, but this is very different from practicing medicine.”
The Wider Pattern
This isn’t isolated to chatbot diagnosis. The gap between controlled performance and real-world deployment runs through every domain where AI is being applied.
In dermatology, AI systems that outperform expert dermatologists in controlled settings show substantially worse performance on images of dark skin, and the first randomized clinical trial found that AI did not exceed attending dermatologists in real-world skin cancer detection.
In cancer screening, a Lancet study found that doctors who used AI assistance for colonoscopies actually got worse at detecting cancer once the AI was removed. Their adenoma detection rate dropped from 28% to 22% after just three months - a phenomenon researchers attributed to deskilling through over-reliance.
In math benchmarks, researchers found that some models showed accuracy drops of up to 13% when tested on novel problems structurally similar to the GSM8K benchmark, suggesting partial memorization of test data rather than genuine mathematical reasoning.
And on the MMLU benchmark, the industry’s most-cited measure of general AI capability, leading models now cluster above 90% - but with a 13-percentage-point measurement variance that makes meaningful comparison between models statistically impossible.
The pattern is consistent: AI systems score impressively in controlled conditions, then struggle with the unpredictable, incomplete, and ambiguous inputs that define real-world use. The benchmarks aren’t measuring what we think they’re measuring.
OpenAI’s Response, and Why It Matters
OpenAI told reporters that current ChatGPT models are “significantly better” at health questions than the GPT-4o version tested in the study. The company cited internal data showing newer models are “roughly six times more likely” to ask follow-up questions and less prone to hallucinations and errors in urgent situations. Meta did not respond to requests for comment.
OpenAI’s defense is plausible - models do improve between versions. But it also reveals the problem. If OpenAI is measuring improvement by how often models ask follow-up questions, they’re acknowledging that the fundamental issue identified by the Oxford study - chatbots that don’t investigate like doctors do - was real. And their evidence for improvement is internal data that can’t be independently verified.
More importantly, OpenAI launched ChatGPT Health just last month, a dedicated product that brings health information into ChatGPT with “enhanced privacy protections.” The company knows that 40 million people use ChatGPT for health questions every day. They’ve built a product around it. And the Oxford study suggests the underlying technology isn’t ready for that responsibility.
The Safety Watchdogs Are Already Worried
ECRI, the independent patient safety organization, named misuse of AI chatbots as the number one health technology hazard for 2026. The organization noted that chatbots have “suggested incorrect diagnoses, recommended unnecessary testing, promoted subpar medical supplies, and even invented body parts in response to medical questions.”
These tools are not regulated as medical devices. They’re not validated for healthcare purposes. But hundreds of millions of people are using them for healthcare anyway. The gap between how these systems are marketed and how they actually perform isn’t just an academic curiosity - it’s a patient safety crisis in slow motion.
What This Means
The Oxford study’s most important contribution isn’t proving that chatbots are bad at medicine. It’s proving that the way we evaluate AI systems systematically overstates their real-world capability.
When an AI company says its model “passes the medical licensing exam,” that means the model can answer pre-formulated questions with all necessary context provided. It doesn’t mean the model can practice medicine. When a company says its model scores 95% on a reasoning benchmark, that means it can solve well-structured problems. It doesn’t mean it can reason about messy, real-world situations.
The Oxford researchers concluded that “the safe deployment of LLMs as public medical assistants will require capabilities beyond expert-level medical knowledge.” In other words: the benchmarks are necessary but not sufficient, and anyone using benchmark scores to justify deploying AI in high-stakes settings should have to explain the gap.
Dr. Danielle Bitterman of Mass General Brigham attributed the chatbots’ struggles to training primarily on textbooks and case reports, with limited experience in the “free-form decision-making doctors develop through practice.” Dr. Robert Wachter of UCSF noted that determining which elements of a case matter requires “cognitive magic and experience.”
That cognitive magic - the ability to investigate, probe, and reason through ambiguity - is precisely what benchmarks don’t test. And until they do, those 94.9% accuracy scores are telling us a lot less than the industry would like us to believe.
The Bottom Line
Forty million people ask ChatGPT health questions every day. An Oxford study just proved that chatbot-assisted diagnosis is no better than Googling your symptoms - and may actually be worse. The 60-percentage-point gap between benchmark performance and real-world accuracy is the most important number in AI right now, because it applies everywhere benchmarks are used to justify deployment. Anyone citing an AI’s test score should be required to answer a simple question: what happens when you add actual humans to the equation?