This Open Source AI Does Literature Reviews Better Than GPT-4o - And You Can Run It Yourself

If you’ve ever asked ChatGPT to summarize research papers with citations, you’ve probably noticed the citations don’t always… exist. GPT-4o fabricates research citations 78-90% of the time, according to a study published this month in Nature.

OpenScholar, developed by researchers at the Allen Institute for AI and University of Washington, does the opposite. It matches human expert citation accuracy while outperforming GPT-4o, Meta’s models, and other state-of-the-art systems on scientific synthesis tasks.

The best part: everything is open source. The models, the code, the 45 million paper corpus - all free to download and run on your own hardware.

The Citation Problem

Large language models are remarkably good at sounding authoritative while making things up. This is particularly damaging in scientific contexts, where incorrect citations can propagate misinformation through the research community.

When researchers tested GPT-4o on scientific literature review tasks, between 78% and 90% of its citations were fabricated. The model would confidently cite papers that don’t exist, or attribute findings to papers that don’t contain them.

OpenScholar takes a fundamentally different approach. Instead of generating plausible-sounding citations from its training data, it uses retrieval-augmented generation (RAG) to search a corpus of 45 million open-access scientific papers, find relevant passages, and cite actual sources for its claims.

How It Performs

The researchers created ScholarQABench, a benchmark with 3,000 queries and 250 longform expert-written answers across computer science, physics, biomedicine, and neuroscience.

The results:

OpenScholar answered 51% of computer science questions correctly vs. 45% for GPT-4o
Scientists preferred OpenScholar responses to human-written answers 51% of the time
When OpenScholar’s citation methods were combined with GPT-4o, scientists preferred AI answers 70% of the time (compared to 32% for GPT-4o alone)

“Scientists see so many papers coming out every day that it’s impossible to keep up,” said lead researcher Akari Asai. “But existing AI systems weren’t designed for scientists’ specific needs.”

What Makes It Different

OpenScholar pairs a model trained specifically for scientific synthesis with RAG. When you ask it a question, it:

Searches a corpus of 45 million open-access papers
Identifies relevant passages with full-text snippet indexing
Ranks and filters the retrieved content
Synthesizes an answer with inline citations to actual papers
Integrates with the Semantic Scholar API for real-time paper access

This means it can incorporate research published after the model was trained - something traditional LLMs can’t do without fine-tuning.

Self-Hosting Options

Everything is publicly available:

Models: Available on Hugging Face (OpenScholar-8B is the main model)
Code: Open source on GitHub
Demo: Try it free at openscholar.allen.ai

The 8B parameter model is small enough to run on consumer hardware with a decent GPU, though performance scales with available memory and compute. For researchers who need offline access or have data sensitivity requirements, self-hosting means your queries never leave your infrastructure.

What This Means

This is a meaningful win for open science. A specialized open-source model is outperforming one of the most capable commercial LLMs on a task that matters - accurate scientific synthesis with verifiable citations.

It also demonstrates that bigger isn’t always better. OpenScholar’s 8B model beats GPT-4o by being designed for a specific task rather than trying to be good at everything. The retrieval-augmented approach means the model doesn’t need to memorize every paper; it just needs to know how to find and synthesize relevant ones.

For researchers in regulated industries or institutions with strict data policies, this is especially relevant. You can run literature reviews without sending your research questions through commercial API endpoints.

What’s Next

The team is developing DR Tulu (Deep Research Tulu), which extends the approach with multi-step search for more comprehensive long-form research reports. They’re also continuing to expand the paper corpus and improve retrieval accuracy.

If you work with scientific literature and have been frustrated by LLM hallucinations, try the demo. And if you need to run it yourself, the code is waiting.