ChatGPT Health Missed Half of Medical Emergencies in First Safety Test

Mount Sinai researchers found OpenAI's health chatbot recognized dangerous symptoms in its own explanations but still told patients to wait instead of seeking emergency care.

Within weeks of launching in January 2026, ChatGPT Health reached 40 million daily users seeking medical guidance. Now the first independent safety evaluation has arrived, and the findings are troubling: the system under-triaged more than half of cases that physicians determined required emergency care.

The study, fast-tracked for publication in Nature Medicine on February 23, reveals a pattern where ChatGPT Health recognizes dangerous symptoms in its own explanations yet still reassures patients that waiting is safe.

The Testing Approach

Mount Sinai researchers created 60 clinical scenarios spanning 21 medical specialties, ranging from minor conditions appropriate for home care to genuine emergencies. Three independent physicians established the correct urgency level for each scenario using guidelines from 56 medical societies.

Each scenario was tested under 16 contextual variations - different races, genders, social situations, and access barriers like lack of insurance or transportation - yielding 960 total interactions with ChatGPT Health.

The 51.6% Problem

For clear-cut emergencies, ChatGPT Health generally performed well. The failures came in cases where urgency was real but less obvious.

In over half of emergency cases, the chatbot recommended that patients see a doctor within 24 to 48 hours rather than go to the emergency room. The concerning part: the system often identified the warning signs in its own analysis but still told patients to wait.

In one asthma scenario, ChatGPT Health explicitly noted “early warning signs of respiratory failure” in its explanation - then advised waiting rather than emergency treatment. The system saw the danger and dismissed it anyway.

Suicide Safeguard Failures

The study found even more alarming patterns in the system’s suicide-crisis protocols.

ChatGPT Health’s safety alerts appeared inconsistently. They sometimes triggered in lower-risk scenarios while failing to appear when users described specific plans for self-harm. The inverse relationship - more serious disclosures producing fewer safety warnings - suggests the safeguards weren’t calibrated for real-world crisis conversations.

Why This Matters Now

OpenAI launched ChatGPT Health as a consumer product, not a regulated medical device. There’s no FDA oversight, no clinical validation requirement, and no liability when guidance goes wrong.

The 40 million daily users aren’t doctors running scenarios - they’re people with real symptoms deciding whether to go to the hospital or stay home. When a chatbot sees respiratory failure warning signs and says wait, some percentage of those users will wait.

The Mount Sinai researchers were direct in their recommendation: for chest pain, shortness of breath, severe allergic reactions, mental status changes, or thoughts of self-harm, contact medical services or the 988 Suicide and Crisis Lifeline directly. Don’t rely on a chatbot.

The Broader Pattern

This study follows earlier research showing AI chatbots perform well on medical exam questions but poorly with real patients. The problem isn’t knowledge - ChatGPT Health clearly knows what respiratory failure looks like. The problem is converting knowledge into appropriate action recommendations when stakes are high.

OpenAI has not publicly responded to the study’s findings. Whether they adjust ChatGPT Health’s triage logic remains to be seen.

For now, the safest approach is treating AI health guidance the same way you’d treat advice from a knowledgeable friend: useful for general information, not trustworthy when symptoms might be serious.