Every AI chatbot lies to you. The question is how often, how confidently, and whether it’ll at least have the decency to admit when it doesn’t know something.
With ChatGPT, Claude, and Gemini all charging the same $20 a month at their standard tier, you’d think the choice would come down to features or vibes. But there’s a more fundamental question: when you ask these things for facts, which one is most likely to give you real ones?
We dug through the latest benchmarks, independent tests, and hallucination leaderboards from early 2026 to find out. The answers are more complicated — and more interesting — than any single winner.
What “Hallucination” Actually Means
Before the scorecards, a quick clarification. When researchers measure hallucination, they’re typically testing one specific thing: give the model a document, ask it to summarize, then check whether the summary contains information that wasn’t in the original.
The Vectara Hughes Hallucination Evaluation Model (HHEM) leaderboard is the industry standard for this. It’s narrow by design — summarization accuracy is just one slice of “making stuff up” — but it’s the most consistent benchmark we have.
The problem is that real-world hallucination looks different from benchmark hallucination. You don’t usually ask ChatGPT to summarize a document you already have. You ask it questions you don’t know the answer to. And that’s where things get messy.
The Benchmark Numbers
Here’s where the three flagship models land on Vectara’s updated dataset as of March 2026:
| Model | Vectara HHEM (New Dataset) | Lower is better |
|---|---|---|
| Gemini 3.1 Pro | 10.4% | Best of the three |
| Claude Sonnet 4.6 | 10.6% | Close second |
| Claude Opus 4.6 | 12.2% | Third |
| GPT-5.2 (xhigh) | 10.8% | Middle of the pack |
On the original Vectara dataset, the gaps were wider. GPT-5 scored 1.4%, while Gemini 2.0 Flash hit 0.7% — the lowest of any model ever tested. But researchers updated the benchmark because models had essentially memorized the old test set.
On the harder, updated evaluation, nobody scores below 4%. The real-world gap between these three is about two percentage points. That’s close enough that you shouldn’t pick a model based on this number alone.
Where It Gets Interesting: The Refusal Strategy
Raw hallucination rates miss something important: how a model handles uncertainty.
Claude’s approach is distinctive. Rather than confidently generating an answer it isn’t sure about, Claude is calibrated to refuse — to say “I don’t know” or flag its uncertainty. On the AA-Omniscience benchmark, Claude 4.1 Opus scored 0% hallucination by refusing to answer rather than guessing. Opus 4.6 still hits 46.4% accuracy on the same test, but with substantially fewer fabrications than models that attempt every question.
Gemini takes the opposite approach. It answers everything, and on AA-Omniscience it leads with 55.3% accuracy. More answers means more chances to be right — and more chances to be wrong.
ChatGPT splits the difference. GPT-5.2 hits 43.8% accuracy on AA-Omniscience, willing to attempt most questions but occasionally hedging.
The practical takeaway: If a wrong answer is worse than no answer — legal research, medical questions, anything with real consequences — Claude’s refusal strategy is structurally safer. If you’d rather get a best-guess answer you can verify yourself, Gemini’s aggressiveness gives you more to work with.
The Citation Problem
Here’s where things get genuinely alarming. The Columbia Journalism Review tested whether AI models correctly attribute information to the sources they cite. In other words: when an AI says “according to The New York Times,” did The New York Times actually say that?
Perplexity scored best with 37% of its citation responses being incorrect. Best. Meaning even the top performer returned wrong answers — fabricated or misattributed citations — more than a third of the time. Grok-3 scored worst at 94%.
This is the hallucination problem that actually matters for daily use. When ChatGPT gives you a URL, there’s a non-trivial chance it doesn’t exist. When Gemini quotes a study, the study might say something completely different. When any of these tools attribute a claim to a source, you need to verify it yourself.
Domain-Specific Accuracy: Where Models Break Down
The sub-1% hallucination headlines obscure a harder truth. Even the best models’ accuracy collapses in specialized domains:
- Legal questions: 18.7% hallucination rate among top models
- Medical queries: 15.6% hallucination rate
- General knowledge: 0.8% hallucination rate
The models are good at the easy stuff and unreliable at the hard stuff. Which is exactly backwards from what most people need — nobody uses ChatGPT to answer questions they already know the answer to.
A 2026 benchmark across 37 models reported hallucination rates between 15% and 52% on challenging knowledge tasks. Even frontier models aren’t immune.
The Reasoning Gap
Beyond hallucination, there’s a broader accuracy picture from April 2026 benchmarks:
| Benchmark | Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| MMLU-Pro (general knowledge) | 91.4% | 92.1% | 91.7% |
| GPQA Diamond (graduate reasoning) | 87.4% | 83.9% | 94.3% |
| Long-context accuracy (1M tokens) | 97.2% | 94.6% | 91.4% |
General knowledge is a wash — all three score above 91%. The gaps open up in specialized reasoning and long-document tasks.
Claude leads on long-context accuracy, which matters if you’re feeding it entire codebases or lengthy research papers. Gemini leads on graduate-level reasoning benchmarks by a significant margin. GPT-5.4 sits in the middle on most measures.
What This Means for You
The “which AI lies least” question doesn’t have a clean answer because they lie differently:
Claude makes up fewer things overall and is more likely to tell you when it doesn’t know. But it’s also more likely to leave you without an answer at all. Best for: legal research, medical information, anything where a confident wrong answer could cause harm.
Gemini attempts more answers and has the best real-time information thanks to native search integration. But it’s also more willing to fabricate when it’s unsure. Best for: current events, time-sensitive research, tasks where you can verify its claims.
ChatGPT is the most consistent across task types, with the fewest dramatic failures in either direction. It won’t refuse as often as Claude, and it won’t hallucinate as boldly as Gemini’s worst cases. Best for: general-purpose work where you need reliable-enough answers across many domains.
The Uncomfortable Bottom Line
All three models still hallucinate at rates that should make you uncomfortable if you’re relying on them for anything important. The best summarization score on a modern benchmark is around 4-5%. On hard questions, that number balloons to 15-20%.
None of these models should be your sole source of truth for factual claims. Not Claude, not Gemini, not ChatGPT. The differences between them are real but small enough that your verification habits matter more than your model choice.
The most accurate AI setup in 2026 isn’t picking the right model. It’s picking any good model and checking its work.