Best Local Vision Models in 2026: Image Understanding on Every GPU Tier

Run image analysis, document OCR, and visual reasoning locally. Qwen3-VL, Gemma 3, Phi-4 Vision, and more tested from 8GB to 32GB VRAM with real benchmarks.

Robot with glowing blue eyes representing AI vision

Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB

Deep dives: Chat | Coding | Translation | Vision | Speech | Agents

Drop a screenshot, a photo of a whiteboard, a scanned document, or a chart into a local AI model and get back text descriptions, extracted data, or answers to questions about what’s in the image. No cloud upload, no API costs, no privacy concerns.

Vision language models (VLMs) have improved dramatically in 2026. A model that fits on an 8GB GPU can now read documents more accurately than GPT-4o could two years ago. Here’s what works at every VRAM tier.

The Benchmarks That Matter

  • MMMU - college-level multi-discipline questions requiring image understanding. The gold standard for general visual reasoning
  • DocVQA - document question answering. Can the model read a receipt, invoice, or form and answer questions about it?
  • ChartQA - understanding charts and graphs. Crucial for data analysis workflows
  • MathVista - mathematical reasoning from visual inputs (diagrams, equations, geometry)
  • OCRBench - text extraction accuracy from images

The Models

ModelParamsMMMUDocVQAChartQAMathVistaVRAM (Q4)
Qwen3-VL 8B8B-95+88+85.8~6 GB
Qwen2.5-VL 7B7B58.695.787.368.2~6 GB
Gemma 3 27B27B64.986.678.0-~14 GB (QAT)
Gemma 3 12B12B59.687.175.7-~6.6 GB
Phi-4-reasoning-vision15B54.3-83.375.2~10 GB
Llama 3.2 Vision 11B11B50.788.4-51.5~8 GB
Gemma 3 4B4B48.875.868.8-~2.6 GB
SmolVLM2 2.2B2.2B----~2 GB

8GB VRAM {#8gb}

GPUs: RTX 4060, RTX 3060 8GB, RTX 3070

Candidates

ModelVRAM (Q4)MMMUDocVQASpeed (RTX 4060)
Qwen3-VL 8B~6 GB-95+40-60 tok/s
Qwen2.5-VL 7B~6 GB58.695.740-55 tok/s
Gemma 3 4B QAT~2.6 GB48.875.860+ tok/s

Winner: Qwen3-VL 8B

Qwen3-VL 8B is the new default local vision model. It scores 85.8 on MathVista and handles OCR, charts, screenshots, and photos better than the previous Qwen2.5-VL 7B across the board. At ~6GB Q4, it fits on 8GB GPUs with room for context.

What makes it practical: point it at a screenshot and ask “what’s the error message?” or photograph a whiteboard and get structured notes. The document understanding is genuinely good - it reads forms, receipts, and handwritten text reliably.

Runner-up: Qwen2.5-VL 7B. If Qwen3-VL gives you compatibility issues (GGUF support is still catching up), the previous generation scores 95.7 DocVQA and 87.3 ChartQA - still excellent. More community testing and wider Ollama support.

Ultralight option: Gemma 3 4B QAT at just 2.6GB. Useful for basic image description and simple OCR. The benchmarks are notably lower (48.8 MMMU, 75.8 DocVQA), but if you need vision alongside another model, this is the smallest viable option.

For all use cases at this level, see our 8GB VRAM complete guide.

12GB VRAM {#12gb}

GPUs: RTX 3060 12GB, RTX 4070

Candidates

ModelVRAM (Q4)MMMUDocVQASpeed (RTX 4070)
Qwen3-VL 8B~6 GB-95+80-120 tok/s
Gemma 3 12B~6.6 GB59.687.150-65 tok/s
Phi-4-reasoning-vision 15B~10 GB54.3-35-50 tok/s

Winner: Qwen3-VL 8B (with headroom)

Same model as the 8GB tier, but now with 6GB free for context. That headroom matters for vision - you can process higher-resolution images and include more text alongside them. The model has room to breathe and performs faster.

STEM specialist: Phi-4-reasoning-vision 15B at ~10GB squeezes into 12GB. It scores 75.2 MathVista and 83.3 ChartQA, making it the best choice for science diagrams, mathematical proofs, and GUI understanding. Microsoft trained it specifically for structured visual reasoning. The trade-off: it’s slower (35-50 tok/s) and leaves only 2GB for context.

General upgrade: Gemma 3 12B at 6.6GB gives you higher MMMU scores (59.6 vs Gemma 4B’s 48.8) while using similar VRAM to the 8B Qwen. Good as a second model for tasks where broader visual understanding matters more than document accuracy.

For all use cases at this level, see our 12GB VRAM complete guide.

16GB VRAM {#16gb}

GPUs: RTX 4060 Ti 16GB, RTX 5060, Intel Arc A770, AMD RX 7800 XT

Candidates

ModelVRAM (Q4)MMMUDocVQAChartQA
Gemma 3 27B QAT~14 GB64.986.678.0
Phi-4-reasoning-vision 15B~10 GB54.3-83.3
Gemma 3 12B~6.6 GB59.687.175.7

Winner: Gemma 3 27B QAT

At 16GB you can run the big one. Gemma 3 27B with QAT drops from 54GB (full precision) to 14GB at INT4 quantization, and because it was trained with quantization awareness, the quality loss is minimal. 64.9 MMMU is the highest general visual understanding score available at this tier.

This model does everything: photo descriptions, document reading, chart analysis, screenshot understanding, and visual reasoning. The broad capability set means you don’t need to switch models for different vision tasks.

Alternative for math/charts: Phi-4-reasoning-vision at ~10GB with 6GB headroom. When you need the strongest performance on mathematical diagrams, science charts, and structured visual reasoning, Phi-4 is still the specialist choice.

For all use cases at this level, see our 16GB VRAM complete guide.

24GB VRAM {#24gb}

GPUs: RTX 3090, RTX 4090

Candidates

ModelVRAM (Q4)MMMUDocVQASpeed (RTX 4090)
Qwen3-VL 32B~21 GB--~30-40 tok/s
Qwen2.5-VL 32B~21 GB-96+~30-40 tok/s
Gemma 3 27B QAT~14 GB64.986.630-40 tok/s
Qwen3-VL 8B~6 GB-95+100-140 tok/s

Winner: Qwen2.5-VL 32B

The largest single-GPU VLM for 24GB cards. Qwen2.5-VL 32B at ~21GB Q4 pushes document understanding to commercial quality. For OCR-heavy workflows - extracting data from invoices, reading multi-page documents, analyzing complex charts - this is the model.

Speed vs. quality trade-off: If you process many images and need fast throughput, run Qwen3-VL 8B at ~6GB and enjoy 100-140 tok/s on the RTX 4090. The quality gap exists, but for quick image descriptions and basic OCR it’s often not worth waiting for the 32B model.

Dual model: Gemma 3 27B QAT (14GB) + a chat model like Qwen 3 8B (6.5GB) = 20.5GB total. Use the vision model for image tasks and the chat model for text conversations, both loaded simultaneously.

For all use cases at this level, see our 24GB VRAM complete guide.

32GB VRAM {#32gb}

GPUs: RTX 5090

Candidates

ModelVRAMMMMUDocVQA
Qwen3-VL 32B (Q6)~28 GB--
Qwen2.5-VL 32B (Q6)~28 GB-96+
Gemma 3 27B QAT + chat LLM~14 + 10 GB64.986.6

Winner: Qwen2.5-VL 32B at Q6_K

Same model as the 24GB tier, now at Q6_K quantization (~28GB). The higher quantization means fewer artifacts in text recognition and more accurate handling of fine details in images - important for OCR of small text and complex diagrams.

Full vision pipeline: With 32GB, run a vision model + chat model + speech model simultaneously. Example: Gemma 3 27B QAT (14GB) for vision + Qwen 3 14B (10.7GB) for chat + Kokoro-82M (<1GB) for TTS = ~25GB total. Point a camera at something, ask about it, and hear the answer.

For all use cases at this level, see our 32GB VRAM complete guide.

Cross-Tier Summary

TierBest PickVRAM UsedStrength
8GBQwen3-VL 8B~6 GBDocument OCR, charts, screenshots
12GBQwen3-VL 8B / Phi-4-vision~6-10 GBGeneral / STEM specialist
16GBGemma 3 27B QAT~14 GBBest general visual understanding
24GBQwen2.5-VL 32B~21 GBCommercial-grade document analysis
32GBQwen2.5-VL 32B (Q6)~28 GBMaximum quality

What Can Vision Models Actually Do?

The practical use cases that work well today:

  • OCR and document reading - extract text from photos, scanned PDFs, receipts, invoices. The Qwen VL models are especially strong here (95%+ DocVQA)
  • Screenshot analysis - describe what’s on screen, identify UI elements, read error messages
  • Chart and graph reading - extract data points, describe trends, answer questions about visualizations
  • Photo description - accessibility descriptions, content moderation, image cataloguing
  • Whiteboard to text - photograph handwritten notes and get structured text output
  • Math from images - photograph an equation or geometry problem and get it solved

What they struggle with:

  • Spatial reasoning - “Is the red object to the left of the blue one?” is still unreliable
  • Counting - “How many people are in this photo?” often gets wrong answers
  • Fine-grained detail - small text, subtle differences between similar objects
  • Video understanding - frame-by-frame analysis is possible but slow and memory-intensive

Quick Start

# Install Ollama and pull a vision model
ollama pull qwen2.5vl:7b     # 8GB tier
ollama pull gemma3:27b        # 16GB+ tier

# Analyze an image from command line
ollama run qwen2.5vl:7b "Describe this image" ./photo.jpg

# Or use the API
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5vl:7b",
  "messages": [{
    "role": "user",
    "content": "What text is in this image?",
    "images": ["base64_encoded_image_here"]
  }]
}'

For a web interface with image upload support, use Open WebUI with Ollama. We cover the setup in our Ollama + Open WebUI guide.

For more about the newest small vision models, see our Qwen 3.5 Small Models article.