Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB
Deep dives: Chat | Coding | Translation | Vision | Speech | Agents
You have 8GB of VRAM. A year ago that meant toy models and disappointment. In March 2026, it means a genuinely useful local AI setup across chat, coding, translation, vision, speech, and basic agent tasks - all running privately on your own hardware.
Your Hardware
8GB VRAM cards include the RTX 4060, RTX 3060 8GB, RTX 3070, RTX 2080, and GTX 1080. These retail for $250-350 (RTX 4060) or less used.
The math: At Q4_K_M quantization (4-bit), a model uses roughly 0.6-0.8 GB per billion parameters. That means 8GB fits models up to about 9B parameters with room for 1-2GB of context. Larger models need more aggressive quantization (Q3_K_M) or simply won’t fit.
Quick Reference
| Use Case | Best Pick | VRAM Used | Key Score | Speed |
|---|---|---|---|---|
| Chat | Qwen 3.5 9B | ~7 GB | 82.5 MMLU-Pro | 35-45 tok/s |
| Coding (autocomplete) | Qwen 2.5 Coder 7B | ~5 GB | 88.4% HumanEval | ~40 tok/s |
| Coding (chat) | Qwen 3.5 9B | ~7 GB | 65.6% LiveCodeBench | 35-45 tok/s |
| Translation | TranslateGemma 4B | 3.3 GB | 80.1 COMET22 | Fast |
| Vision | Qwen3-VL 8B | ~6 GB | 85.8 MathVista | 40-60 tok/s |
| Speech (STT) | Whisper large-v3-turbo | ~6 GB | ~7.75% WER | 6x real-time |
| Speech (TTS) | Kokoro-82M | <1 GB | #1 TTS Arena | 96x real-time |
| Agents | Qwen 3.5 9B | ~7 GB | 66.1 BFCL V4 | 35-45 tok/s |
Chat & General Assistant
Qwen 3.5 9B scores 82.5 on MMLU-Pro - beating models 13x its size. It handles everyday questions, brainstorming, summarization, and simple analysis well. At ~7GB it’s tight on an 8GB card but workable.
For a lighter option, Gemma 3 4B QAT at 3.5GB leaves 4.5GB free and still handles basic conversations.
For the full comparison with benchmark tables, see our local chat model guide.
Coding
Autocomplete: Qwen 2.5 Coder 7B at 88.4% HumanEval with FIM support and 128K context. At ~5GB, it leaves room for context. Plug it into Continue or Tabby for Copilot-style inline suggestions.
Chat-based coding: Qwen 3.5 9B again - its 65.6% LiveCodeBench score makes it the strongest code chat option at this tier.
Full comparison and IDE setup: local coding model guide.
Translation
TranslateGemma 4B at just 3.3GB translates 55 languages at quality that surpasses the Gemma 3 4B baseline. Purpose-built translation beats general models at this size.
For 200-language coverage (rare pairs), NLLB-200 1.3B at ~2GB is literal but accurate.
Full comparison: local translation model guide.
Vision
Qwen3-VL 8B at ~6GB reads documents (95%+ DocVQA), understands charts, and describes photos. Point it at a screenshot to extract error messages, or photograph a whiteboard for structured notes.
Ultralight: Gemma 3 4B QAT at 2.6GB for basic image description alongside other models.
Full comparison: local vision model guide.
Speech
Transcription: Whisper large-v3-turbo at ~6GB delivers near-commercial accuracy (~7.75% WER) at 6x faster than real-time. Use Faster-Whisper for the best performance.
Text-to-speech: Kokoro-82M runs at 96x real-time on GPU and 3-5x on CPU. Ranked #1 on the TTS Arena. No voice cloning, but natural built-in voices. At <1GB it fits alongside anything.
Both together: Whisper turbo (6GB) + Kokoro (<1GB) = ~7GB total. Transcription and readback simultaneously.
Full comparison and pipeline setups: local speech model guide.
Agents (Tool Use)
Qwen 3.5 9B scores 66.1 on BFCL V4, outperforming GPT-5 mini (55.5) on function calling. It handles 2-3 step tool chains reliably - enough for smart home control, simple API calls, and basic automations.
Full comparison and setup: local agent model guide.
Getting Started
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull the essentials for 8GB
ollama pull qwen3.5:9b # Chat + coding + agents
ollama pull qwen2.5-coder:7b # Autocomplete
ollama pull translategemma:4b # Translation
# Start chatting
ollama run qwen3.5:9b
For a web interface, pair Ollama with Open WebUI. For coding, add Continue or Tabby.
Honest Limits
8GB is the entry tier. Here’s what it can’t do well:
- Complex multi-step reasoning - models under 14B lose coherence on long chains of logic
- Multi-model pipelines - fitting two 7B models simultaneously is impossible. You switch between them
- Large context - after loading a 7GB model, you have ~1GB for context. Long documents get truncated
- Reliability for agents - 2-3 step chains work, but 5+ step workflows fail frequently
If you find yourself constantly hitting these walls, a 12GB card (RTX 3060 12GB, ~$200 used) or 16GB card (RTX 4060 Ti 16GB, ~$400) is a meaningful upgrade. See our 12GB guide or 16GB guide.