Cloud OCR services charge per page and send your documents to someone else’s servers. If you’re processing sensitive contracts, medical records, or confidential business documents, that’s a problem. Local OCR solves both issues — you keep your data and your money.
But which local OCR tool actually works? The landscape has shifted dramatically. Vision-language models now compete with traditional approaches, and the old standby Tesseract faces serious competition.
The Contenders
We’re testing four open-source OCR tools that represent different approaches to text recognition:
Tesseract 5.x — The venerable open-source standard, backed by Google. Been around since the 1980s, still actively maintained. Works on CPU with minimal setup.
PaddleOCR-VL-1.5 — Baidu’s vision-language model approach, released January 2026. Combines traditional OCR with document understanding capabilities.
Surya — A modern, transformer-based OCR system from Datalab supporting 90+ languages. Focused on layout analysis and reading order.
OlmOCR-2 — Allen AI’s document parsing toolkit designed for linearizing PDFs for LLM training. Released October 2025, targets complex academic documents.
Benchmark Results
Drawing from the olmOCR-Bench test suite covering 7,000+ test cases across 1,400 documents, plus community benchmarks from Modal and CodeSOTA:
| Model | Overall Accuracy | Speed (pages/sec) | VRAM Required | Handwriting |
|---|---|---|---|---|
| OlmOCR-2 | 82.4% | 1.78 | 16GB | Poor |
| PaddleOCR-VL-1.5 | 94.5%* | 2.1 | 8GB | Fair |
| Surya | 97.4% | 3.2 | 4GB | 87.2% |
| Tesseract 5.x | 72-85% | 8.4 | <1GB | Poor |
*PaddleOCR-VL-1.5 score is on OmniDocBench v1.5, which uses different methodology than olmOCR-Bench.
The numbers tell an interesting story. Tesseract remains the fastest option by a wide margin, processing pages more than four times faster than OlmOCR-2. But accuracy suffers, particularly on complex layouts.
What Each Tool Does Best
Tesseract: The Practical Baseline
If your documents are clean scans with straightforward layouts — single column text, clear fonts, minimal noise — Tesseract still works. Initialization takes under 0.3 seconds compared to 4+ seconds for deep learning alternatives. On a CPU-only machine, it’s often the only realistic option.
The catch: anything beyond basic single-column documents causes problems. Tables get scrambled. Multi-column layouts produce word salad. Handwriting? Forget it.
Surya: The Balanced Choice
Surya hits a sweet spot for most real-world use cases. Its 97.4% accuracy on typed text rivals commercial services, while handling 90+ languages out of the box. The 4GB VRAM requirement means it runs on consumer GPUs.
The standout feature is layout analysis. Surya identifies headers, paragraphs, tables, and figures, then preserves reading order. A two-column academic paper comes out as coherent text, not alternating fragments.
Handwriting recognition hits 87.2% accuracy — usable but not reliable for critical documents. If you’re digitizing handwritten notes, expect to correct errors.
PaddleOCR-VL-1.5: The Document Understanding Engine
Baidu’s PP-OCRv5 engine provides about 13% higher accuracy than previous versions on multilingual documents. The vision-language architecture understands document structure semantically, not just visually.
The January 2026 update added better handling for mixed-language documents — useful for contracts or academic papers with foreign citations. On the OmniDocBench benchmark, it reached 94.5% accuracy for document parsing tasks.
Downsides: Baidu’s documentation is partly in Chinese, and the model takes longer to initialize than alternatives.
OlmOCR-2: The Academic Paper Specialist
Allen AI built OlmOCR-2 specifically for linearizing PDFs into LLM training data. It excels at the dense, complex layouts found in research papers — equations, footnotes, bibliographies, multi-column text with embedded figures.
The October 2025 v0.4.0 release boosted benchmark scores about 4 points through synthetic data and reinforcement learning training. At 82.4 on olmOCR-Bench, it’s not the highest-accuracy option, but its output preserves academic document structure better than competitors.
Processing speed of 1.78 pages/sec reflects the model’s complexity. This isn’t a tool for bulk processing invoices — it’s for extracting clean text from research papers that would defeat simpler approaches.
The Privacy Advantage
Running OCR locally means your documents never leave your machine. For legal firms processing client contracts, healthcare organizations handling patient records, or anyone working with confidential business documents, this matters more than benchmark numbers.
Commercial OCR services from Google, Amazon, and Microsoft all process your documents on their servers. Most comply with SOC 2 and similar certifications, but compliance isn’t privacy. Your data still transits to and resides on third-party infrastructure.
Local processing eliminates that exposure entirely. The tradeoff is managing your own infrastructure and accepting whatever accuracy your chosen tool provides.
Setup Requirements
All four tools run on Linux (most common), macOS (with some limitations), and Windows (varying support).
Tesseract requires the least: apt install tesseract-ocr on Debian/Ubuntu. No GPU needed. Works anywhere.
Surya needs Python 3.9+ and a CUDA-capable GPU with 4GB+ VRAM. pip install surya-ocr, then download model weights (~2GB).
PaddleOCR requires Python 3.7+ and ideally a GPU. pip install paddlepaddle paddleocr installs the base version; VL-1.5 requires additional setup.
OlmOCR-2 wants Python 3.10+, 16GB+ VRAM, and patience. Installation involves cloning the repo and managing dependencies manually.
Which Should You Use?
Choose Tesseract if: You’re processing clean, single-column documents at high volume on CPU-only infrastructure. Speed matters more than accuracy.
Choose Surya if: You need reliable accuracy across varied document types, support multiple languages, and have a mid-range GPU. This is the default recommendation for most users.
Choose PaddleOCR-VL-1.5 if: Your documents are multilingual or you need semantic document understanding beyond raw text extraction.
Choose OlmOCR-2 if: You’re processing academic papers, research documents, or anything with complex layouts, equations, and citations. Accuracy on these documents justifies the slower speed.
What You Can Do
Start with Surya for most document processing needs. The installation is straightforward:
pip install surya-ocr
surya_ocr input.pdf --output_dir ./results
Results come as JSON with bounding boxes and confidence scores, or rendered images with text overlays for verification.
If Surya struggles with your specific document types, PaddleOCR-VL-1.5 provides a more capable alternative with different tradeoffs. Tesseract remains the fallback for simple documents or CPU-constrained environments.
The broader point: local OCR has caught up with cloud services for most use cases. You no longer need to send sensitive documents to third parties to get them digitized. The tools are free, the quality is there, and your data stays yours.