Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB
Deep dives: Chat | Coding | Translation | Vision | Speech | Agents
24GB is where local AI stops being a compromise. The Qwen 3 30B MoE delivers 196 tok/s - faster than most 8B models on lesser hardware. The Qwen 2.5 Coder 32B hits 92.7% HumanEval and made people cancel their Copilot subscriptions. The 32B chat models compete with cloud APIs on quality. This is the enthusiast tier, and it delivers.
Your Hardware
24GB VRAM cards: RTX 3090 ($700-900 used), RTX 4090 ($1,600-2,000). The 4090 is roughly 2x faster than the 3090 for inference (131 tok/s vs ~65 tok/s on Qwen 3 8B). If speed matters as much as capacity, the 4090 is worth the premium.
The math: At Q4_K_M, 24GB fits models up to about 32B parameters with 2-4GB for context. At Q5_K_M, 27B models fit comfortably. MoE models like the Qwen 3 30B (which only activates 3B params per token) use ~18GB, leaving 6GB free.
Quick Reference
| Use Case | Best Pick | VRAM Used | Key Score | Speed (RTX 4090) |
|---|---|---|---|---|
| Chat | Qwen 3 30B MoE | ~18 GB | 30B quality | 196 tok/s |
| Chat (quality) | Qwen 3 32B | 22.2 GB | 83.6 MMLU | 34 tok/s |
| Coding (autocomplete) | Qwen 2.5 Coder 32B | ~20 GB | 92.7% HumanEval | ~25 tok/s |
| Coding (agentic) | Qwen 3.5 27B | ~16 GB | 72.4% SWE-bench | ~35 tok/s |
| Translation | TranslateGemma 27B | ~17 GB | MetricX 3.09 | Fast |
| Vision | Qwen2.5-VL 32B | ~21 GB | 96+ DocVQA | 30-40 tok/s |
| Speech | Whisper + Chatterbox + LLM | ~18.5 GB | Full pipeline | Real-time |
| Agents | Qwen 3 30B MoE | ~18 GB | Fast chains | 196 tok/s |
Chat & General Assistant
Speed pick: Qwen 3 30B MoE at 196 tok/s on the RTX 4090. MoE architecture means 30B-class quality at small-model speed. This is the model that killed the “local models are too slow” argument. At ~18GB, you have 6GB free for context.
Quality pick: Qwen 3 32B dense at 83.6 MMLU and 49.5 GPQA Diamond. Slower (34 tok/s) but measurably smarter for complex reasoning and factual accuracy.
Dark horse: EXAONE 4.0 32B scores 92.3 MMLU-Redux - frontier-class knowledge in a local model. Less community testing but the benchmarks are hard to ignore.
Full comparison: local chat model guide.
Coding
This tier transforms local coding.
Autocomplete: Qwen 2.5 Coder 32B at 92.7% HumanEval with FIM support and 128K context at ~20GB. The best local autocomplete model at any price. 73.7% on Aider means it can fix its own mistakes.
Agentic coding: Qwen 3.5 27B at 72.4% SWE-bench Verified (tying GPT-5 mini) at ~16GB. It resolves real GitHub issues autonomously through tools like Aider and OpenHands.
Full comparison: local coding model guide.
Translation
TranslateGemma 27B at ~17GB approaches Google Translate quality (MetricX 3.09, COMET22 84.4) for 55 languages. 7GB headroom for context means full-document translation in one pass.
For low-resource languages, Aya Expanse 32B at ~22GB is the strongest option - 25% higher accuracy than competitors on underserved languages.
Full comparison: local translation model guide.
Vision
Qwen2.5-VL 32B at ~21GB pushes document understanding to commercial quality (96+ DocVQA). For OCR-heavy workflows - invoices, scanned PDFs, complex charts - this is the model.
Budget alternative: Gemma 3 27B QAT at 14GB plus a chat model alongside it.
Full comparison: local vision model guide.
Speech
24GB makes complete voice assistants possible. Whisper turbo (6GB) + Chatterbox (6GB) + Qwen 3 8B (6.5GB) = ~18.5GB. Speak your question, get an AI response in a cloned voice. All local, all real-time.
Or maximize speech quality: Whisper large-v3 (10GB) + Qwen3-TTS 1.7B (7GB) = ~17GB. Best-in-class STT and multilingual TTS with 7GB to spare.
Full comparison: local speech model guide.
Agents
The Qwen 3 30B MoE at 196 tok/s makes agent workflows fast enough to feel interactive. A 5-step tool chain completes in seconds. At ~18GB with 6GB for context, it handles complex tool responses well.
For maximum reliability, the Qwen 3 32B dense at 22.2GB has stronger reasoning for planning complex workflows.
Full comparison: local agent model guide.
Getting Started
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# The essentials for 24GB
ollama pull qwen3:30b-a3b # Chat + agents (MoE, ~18GB)
ollama pull qwen2.5-coder:32b # Coding autocomplete (~20GB)
ollama pull qwen3:32b # Quality chat (when you need it)
ollama pull translategemma:27b # Translation (~17GB)
# Note: these don't all fit simultaneously
# Ollama automatically loads/unloads as needed
ollama run qwen3:30b-a3b
The 24GB Advantage: Everything Fits
The models that define 2026 local AI all fit in 24GB:
- Qwen 3 30B MoE - the speed breakthrough (196 tok/s)
- Qwen 2.5 Coder 32B - the Copilot killer (92.7% HumanEval)
- TranslateGemma 27B - near-Google Translate quality
- Complete voice pipelines - STT + TTS + LLM running simultaneously
- 32B chat models - competing with cloud APIs on quality
This is the tier where “should I use local or cloud?” becomes a genuine question rather than an automatic answer.
When to Upgrade
24GB covers nearly everything. The jump to 32GB (RTX 5090) buys you:
- Higher quantization on 32B models (Q6 instead of Q4) - noticeably better quality
- 70B models at Q3 on a single GPU - massive knowledge
- Multi-model pipelines with more headroom
- Future-proofing as 40B+ dense models become common
See our 32GB guide for the frontier tier.