DeepRare Outperforms Doctors at Diagnosing Rare Diseases, Nature Study Shows

New agentic AI system from Shanghai Jiao Tong University correctly identifies rare diseases more accurately than human specialists, tested across 6,401 cases.

Researchers at Shanghai Jiao Tong University have built an AI system that diagnoses rare diseases more accurately than human specialists, according to a study published this week in Nature. The system, called DeepRare, correctly identified the right diagnosis as its top suggestion 64.4% of the time, compared to 54.6% for a panel of five rare disease specialists.

The work addresses a genuine problem. Rare diseases affect more than 300 million people worldwide, and patients typically endure what clinicians call a “diagnostic odyssey” - an average of 4.26 years of repeated referrals, misdiagnoses, and unnecessary interventions before receiving a correct diagnosis. A survey by the China Alliance for Rare Diseases found that 42% of over 20,000 patients had been previously misdiagnosed.

How It Works

Unlike conventional medical AI tools that pattern-match symptoms to diseases, DeepRare operates as an “agentic” system. It mimics how experienced doctors actually reason through difficult cases: forming hypotheses, testing them against evidence, revising conclusions, and ranking possible diagnoses.

The system uses DeepSeek-V3 as its language model and integrates more than 40 specialized tools and data sources. It processes free-text clinical descriptions, structured phenotype terms, and genetic testing results. Every diagnostic conclusion comes with a traceable evidence chain - doctors can see not just what the diagnosis is, but the specific reasoning and sources that led there.

“Unlike conventional AI, which relies on rapid pattern matching, DeepRare mimics the ‘slow thinking’ of human doctors,” the research team said. “Every diagnostic conclusion generated by the system is traceable and accompanied by a clear and complete evidence chain.”

The Numbers

The research team evaluated DeepRare across 6,401 clinical cases spanning 2,919 rare diseases, using nine different datasets - seven public benchmarks and two in-house clinical datasets from Xinhua Hospital and Hunan Hospital in China.

Using clinical phenotype information alone (no genetic data), DeepRare achieved a top-ranked diagnostic accuracy (Recall@1) of 57.18%. That may not sound impressive in isolation, but it outperformed Claude-3.7-Sonnet-thinking by 23.79 percentage points and showed consistent results across 14 medical specialties.

When genetic sequencing data was added, accuracy jumped further. On the Hunan Hospital dataset, Recall@1 improved from 33.3% to 63.6%. On the Xinhua Hospital dataset, it went from 39.9% to 69.1%.

Against the widely used bioinformatics tool Exomiser, DeepRare performed notably better: 63.6% versus 58.0% on the Hunan data, and 69.1% versus 55.9% on the Xinhua data.

In head-to-head comparisons with five rare disease specialists, DeepRare’s top-five accuracy (Recall@5) reached 78.5%, compared to 65.6% for the specialists.

What This Means

The practical application is straightforward. DeepRare could serve as a screening and decision-support tool, particularly in primary care settings and hospitals without rare disease specialists. An online diagnostic platform has been operating since July 2025, with more than 600 medical institutions worldwide now registered. The researchers are also building a global rare disease diagnostic alliance and running a validation study using 20,000 real-world cases.

For patients stuck in years-long diagnostic journeys, a tool that gets the right answer in its top five suggestions nearly 80% of the time could dramatically shorten the path to treatment.

The Fine Print

The study has meaningful limitations that the authors acknowledge. All evaluations were retrospective - the researchers tested the system on existing clinical records, not in real-time clinical settings. A prospective study where DeepRare guides actual diagnostic decisions has not been done.

The system occasionally produced “hallucinated or irrelevant citations” in its reasoning chains. While the overall accuracy numbers are strong, these errors in the evidence trail could undermine physician trust if they appear during clinical use.

The clinical validation data came from two hospitals in China. How well the system performs across different populations, healthcare systems, and clinical documentation styles remains untested.

Most importantly, high diagnostic accuracy in a research setting does not automatically translate to better patient outcomes. The tool still needs physicians to act on its suggestions, and the gap between “AI suggests the right diagnosis” and “patient receives the right treatment faster” is wider than it appears.

Still, for a field where the average patient waits more than four years for a correct diagnosis, even an imperfect AI screening tool represents a concrete improvement over the status quo.