8GB VRAM: Every AI Task You Can Run Locally in 2026

Complete guide to running local AI on 8GB GPUs - chat, coding, translation, vision, speech, and agents. Model picks, benchmarks, and honest limits for RTX 4060, RTX 3070, and similar cards.

NVIDIA graphics card closeup showing cooling fans

Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB

Deep dives: Chat | Coding | Translation | Vision | Speech | Agents

You have 8GB of VRAM. A year ago that meant toy models and disappointment. In March 2026, it means a genuinely useful local AI setup across chat, coding, translation, vision, speech, and basic agent tasks - all running privately on your own hardware.

Your Hardware

8GB VRAM cards include the RTX 4060, RTX 3060 8GB, RTX 3070, RTX 2080, and GTX 1080. These retail for $250-350 (RTX 4060) or less used.

The math: At Q4_K_M quantization (4-bit), a model uses roughly 0.6-0.8 GB per billion parameters. That means 8GB fits models up to about 9B parameters with room for 1-2GB of context. Larger models need more aggressive quantization (Q3_K_M) or simply won’t fit.

Quick Reference

Use CaseBest PickVRAM UsedKey ScoreSpeed
ChatQwen 3.5 9B~7 GB82.5 MMLU-Pro35-45 tok/s
Coding (autocomplete)Qwen 2.5 Coder 7B~5 GB88.4% HumanEval~40 tok/s
Coding (chat)Qwen 3.5 9B~7 GB65.6% LiveCodeBench35-45 tok/s
TranslationTranslateGemma 4B3.3 GB80.1 COMET22Fast
VisionQwen3-VL 8B~6 GB85.8 MathVista40-60 tok/s
Speech (STT)Whisper large-v3-turbo~6 GB~7.75% WER6x real-time
Speech (TTS)Kokoro-82M<1 GB#1 TTS Arena96x real-time
AgentsQwen 3.5 9B~7 GB66.1 BFCL V435-45 tok/s

Chat & General Assistant

Qwen 3.5 9B scores 82.5 on MMLU-Pro - beating models 13x its size. It handles everyday questions, brainstorming, summarization, and simple analysis well. At ~7GB it’s tight on an 8GB card but workable.

For a lighter option, Gemma 3 4B QAT at 3.5GB leaves 4.5GB free and still handles basic conversations.

For the full comparison with benchmark tables, see our local chat model guide.

Coding

Autocomplete: Qwen 2.5 Coder 7B at 88.4% HumanEval with FIM support and 128K context. At ~5GB, it leaves room for context. Plug it into Continue or Tabby for Copilot-style inline suggestions.

Chat-based coding: Qwen 3.5 9B again - its 65.6% LiveCodeBench score makes it the strongest code chat option at this tier.

Full comparison and IDE setup: local coding model guide.

Translation

TranslateGemma 4B at just 3.3GB translates 55 languages at quality that surpasses the Gemma 3 4B baseline. Purpose-built translation beats general models at this size.

For 200-language coverage (rare pairs), NLLB-200 1.3B at ~2GB is literal but accurate.

Full comparison: local translation model guide.

Vision

Qwen3-VL 8B at ~6GB reads documents (95%+ DocVQA), understands charts, and describes photos. Point it at a screenshot to extract error messages, or photograph a whiteboard for structured notes.

Ultralight: Gemma 3 4B QAT at 2.6GB for basic image description alongside other models.

Full comparison: local vision model guide.

Speech

Transcription: Whisper large-v3-turbo at ~6GB delivers near-commercial accuracy (~7.75% WER) at 6x faster than real-time. Use Faster-Whisper for the best performance.

Text-to-speech: Kokoro-82M runs at 96x real-time on GPU and 3-5x on CPU. Ranked #1 on the TTS Arena. No voice cloning, but natural built-in voices. At <1GB it fits alongside anything.

Both together: Whisper turbo (6GB) + Kokoro (<1GB) = ~7GB total. Transcription and readback simultaneously.

Full comparison and pipeline setups: local speech model guide.

Agents (Tool Use)

Qwen 3.5 9B scores 66.1 on BFCL V4, outperforming GPT-5 mini (55.5) on function calling. It handles 2-3 step tool chains reliably - enough for smart home control, simple API calls, and basic automations.

Full comparison and setup: local agent model guide.

Getting Started

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the essentials for 8GB
ollama pull qwen3.5:9b           # Chat + coding + agents
ollama pull qwen2.5-coder:7b     # Autocomplete
ollama pull translategemma:4b     # Translation

# Start chatting
ollama run qwen3.5:9b

For a web interface, pair Ollama with Open WebUI. For coding, add Continue or Tabby.

Honest Limits

8GB is the entry tier. Here’s what it can’t do well:

  • Complex multi-step reasoning - models under 14B lose coherence on long chains of logic
  • Multi-model pipelines - fitting two 7B models simultaneously is impossible. You switch between them
  • Large context - after loading a 7GB model, you have ~1GB for context. Long documents get truncated
  • Reliability for agents - 2-3 step chains work, but 5+ step workflows fail frequently

If you find yourself constantly hitting these walls, a 12GB card (RTX 3060 12GB, ~$200 used) or 16GB card (RTX 4060 Ti 16GB, ~$400) is a meaningful upgrade. See our 12GB guide or 16GB guide.