AI Chatbots Analyzed Medical Data Faster Than Human Teams, UCSF Study Finds

Half of the tested AI tools produced prediction models that matched or beat human researchers. A master's student and high schooler built working code in minutes.

A team at UC San Francisco gave eight AI chatbots the same medical data analysis tasks that teams of human researchers had spent months completing. Four of the AI tools produced prediction models that performed as well as the human teams, and some outperformed them.

The study, published February 17 in Cell Reports Medicine, tested whether large language models could analyze complex biological datasets to predict preterm birth, a condition that affects roughly one in ten pregnancies.

What They Actually Did

The researchers used data from the March of Dimes Preterm Birth Data Repository, which contains vaginal microbiome samples, blood samples, and placental tissue data from over 1,200 pregnant women across nine studies. This same dataset had been used in the DREAM Challenge, where teams of researchers worldwide competed to build the best preterm birth prediction models.

The AI tools were given three tasks:

  • Analyze vaginal microbiome data to identify preterm birth indicators
  • Analyze blood samples to estimate pregnancy stage
  • Analyze placental tissue samples to determine pregnancy stage

Researchers provided the AI systems with carefully written natural language prompts describing exactly what they wanted, then ran the AI-generated code against the datasets.

Results: Mixed But Notable

Half the chatbots failed to produce usable code at all. Of the four that worked, some matched human team performance and others exceeded it.

Perhaps more striking than the raw performance: a research pair consisting of UCSF master’s student Reuben Sarwal and high school student Victor Tarca successfully built viable prediction models with AI assistance. They generated working code in minutes, a task that would typically take experienced programmers hours or days.

The entire AI project went from inception to journal submission in six months. The original DREAM Challenge took nearly two years to compile results and publish.

“These AI tools could relieve one of the biggest bottlenecks in data science: building our analysis pipelines,” said Marina Sirota, co-senior author and interim director of the Bakar Computational Health Sciences Institute at UCSF.

What This Means

The findings suggest a potential shift in how medical research data gets analyzed. If AI tools can handle the initial coding work, researchers can spend less time on technical implementation and more time designing experiments and interpreting results.

For resource-constrained research teams and institutions without large computational staff, AI assistance could level the playing field. The fact that a master’s student and high schooler produced competitive results demonstrates that domain expertise matters more than coding experience when AI handles the implementation.

The speed improvement matters too. In medical research, faster analysis means faster insights. For conditions like preterm birth, where early intervention can save lives, that timeline compression has real clinical implications.

The Fine Print

The 50% failure rate deserves attention. Half the AI tools produced nothing usable. The paper doesn’t name which specific models were tested, but the wide variance in performance means researchers can’t simply pick any chatbot and expect results.

The study authors also emphasize that AI doesn’t replace human expertise. “Scientists still need to be on guard for misleading results,” Sirota noted. The AI generates code based on prompts, but interpreting what that code produces and determining whether the results make biological sense still requires trained researchers.

There’s also the question of what happens when AI-generated code contains subtle errors. A human programmer might catch a conceptual mistake while writing code. AI generates code all at once, and errors might not become apparent until the results look wrong, if they look wrong at all.

The study tested a specific, well-defined task with a standardized dataset and clear success metrics. Whether these results generalize to messier, less structured research problems remains to be seen.

Still, as a proof of concept, the results are significant. AI assistance in medical data analysis isn’t theoretical anymore. It’s producing competitive results faster than traditional approaches, at least for some tasks.