Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB
Deep dives: Chat | Coding | Translation | Vision | Speech | Agents
32GB of GDDR7 on a single consumer GPU. The RTX 5090 delivers 60-80% faster AI performance than the 4090, roughly 213 tok/s average inference, and opens up models that were previously enterprise-only. Near-lossless quantization on 32B models. Llama 3.3 70B on a single card. MoE models at Q8. This is the new ceiling.
Your Hardware
32GB VRAM means one card: the RTX 5090 at $1,999 MSRP. Reality check: as of mid-March 2026, street prices hover around $3,000-5,000+ due to ongoing shortages. Supply is expected to normalize by mid-2026.
The math: At Q4_K_M, 32GB fits models up to about 45B dense parameters. At Q6_K (near-lossless), 32B models fit with headroom. At Q3_K_M, the Llama 3.3 70B squeezes in. MoE models like the Qwen 3 30B fit at Q8_0 (essentially lossless) with room to spare.
Speed: 213 tok/s average inference. Sub-100ms time-to-first-token. 72% improvement over RTX 4090 on NLP tasks.
Quick Reference
| Use Case | Best Pick | VRAM Used | Key Score | Speed (RTX 5090) |
|---|---|---|---|---|
| Chat (speed) | Qwen 3 30B MoE (Q8) | ~28 GB | 30B quality, lossless | ~320 tok/s |
| Chat (quality) | EXAONE 4.0 32B (Q6) | ~28 GB | 92.3 MMLU-Redux | ~50 tok/s |
| Coding (autocomplete) | Qwen 2.5 Coder 32B (Q6) | ~28 GB | 92.7% HumanEval | ~35 tok/s |
| Coding (agentic) | Qwen 3.5 27B (Q6) | ~22 GB | 72.4% SWE-bench | ~45 tok/s |
| Translation | TranslateGemma 27B (Q6) | ~22 GB | MetricX 3.09 | Fast |
| Vision | Qwen2.5-VL 32B (Q6) | ~28 GB | 96+ DocVQA | ~35 tok/s |
| Speech | Full pipeline + LLM | ~28 GB | Premium quality | Real-time |
| Agents | Qwen 3 30B MoE (Q8) | ~28 GB | Fast 10+ step chains | ~320 tok/s |
| Reach | Llama 3.3 70B (Q3) | ~32 GB | 83.6 MMLU | ~20 tok/s |
Chat & General Assistant
Speed king: Qwen 3 30B MoE at Q8_0 (~28GB). Near-lossless quantization means you get the full quality of the 30B MoE at an estimated ~320 tok/s. That’s faster than most cloud APIs, with zero latency, zero cost, and zero data leaving your machine.
Knowledge ceiling: EXAONE 4.0 32B at Q6_K (~28GB). 92.3 MMLU-Redux puts it in frontier territory. At near-lossless quantization, this is the most knowledgeable model you can run on a single consumer GPU.
The 70B reach: Llama 3.3 70B at Q3_K_M (~32GB). Aggressive quantization costs 3-5 MMLU points, and speed drops to ~20 tok/s. But for complex analysis, research questions, and tasks requiring maximum reasoning depth, having a 70B on a single GPU was impossible until now.
Full comparison: local chat model guide.
Coding
Qwen 2.5 Coder 32B at Q6_K (~28GB). The quality bump from Q4 to Q6 is measurable - fewer hallucinated function names, more accurate API calls, and more reliable FIM completions. This is the closest thing to a local Copilot replacement.
Dual setup: Qwen 2.5 Coder 14B (~9GB) for autocomplete + Qwen 3.5 27B (~16GB) for chat = 25GB total with 7GB free. Best of both worlds at the highest quality tier.
Full comparison: local coding model guide.
Translation
TranslateGemma 27B at Q6_K (~22GB) + Qwen 3 14B (10.7GB) = ~33GB. Run the best translation model at near-lossless quality alongside a strong general LLM for context-aware literary translation. Both loaded simultaneously. This exceeds 32GB slightly, so in practice: TranslateGemma 27B at Q6 alone at ~22GB with 10GB for context, or paired at Q4.
Full comparison: local translation model guide.
Vision
Qwen2.5-VL 32B at Q6_K (~28GB). Higher quantization means fewer OCR artifacts and more accurate fine-detail recognition. For document processing workflows that need to be reliable, this matters.
Full comparison: local vision model guide.
Speech
Premium voice assistant: Whisper large-v3 (10GB) + Qwen3-TTS 1.7B (7GB) + Qwen 3 14B (10.7GB) = ~28GB. Maximum quality across all three components - ten-language voice cloning, accurate transcription, and a genuinely smart conversational AI on a single GPU.
Full comparison: local speech model guide.
Agents
Qwen 3 30B MoE at Q8_0 for agent workflows. The speed (~320 tok/s) means 10+ step tool chains execute fast enough to feel interactive. The lossless quantization eliminates function-calling formatting errors that plague Q3/Q4 quantized models.
Full comparison: local agent model guide.
Getting Started
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# The 32GB tier essentials
ollama pull qwen3:30b-a3b-q8_0 # Chat + agents (MoE, near-lossless)
ollama pull qwen2.5-coder:32b-q6_K # Coding (near-lossless)
ollama pull qwen3:32b-q6_K # Quality chat
ollama pull translategemma:27b # Translation
# The reach pick
ollama pull llama3.3:70b-q3_K_M # 70B on a single GPU
# Start chatting
ollama run qwen3:30b-a3b-q8_0
The 32GB Advantage: Quality and Headroom
The difference from 24GB isn’t just bigger models. It’s better models:
- Near-lossless 32B models - Q6_K eliminates virtually all quantization artifacts
- MoE at Q8 - the Qwen 3 30B MoE at lossless quality is the best speed-quality ratio in consumer AI
- 70B on a single card - previously required multi-GPU or Apple Silicon with 64GB+
- Multi-model pipelines - run a full voice assistant (STT + TTS + LLM) at maximum quality
- Future-proofing - as 40B+ dense models become common, 32GB keeps you in the game
The Reality Check
The RTX 5090 is an aspirational card for most people right now. If you have a 4090 (24GB), you’re running 95% of what the 5090 offers. The quality difference between Q4 and Q6 quantization is real but not transformative. The 70B reach pick is interesting but slow.
If you’re deciding between a 4090 now versus waiting for a 5090 at MSRP, the practical answer for most people is: the 4090 does the job. The 5090 is for people who want the absolute best local AI experience and don’t mind paying a premium for the last 10-15% of quality.