Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB
Deep dives: Chat | Coding | Translation | Vision | Speech | Agents
A local chatbot that runs on your own hardware, answers your questions without sending data to anyone, and costs nothing per query. That’s the pitch. The reality depends entirely on your GPU.
We tested the major open-weight chat models across five VRAM tiers to find what actually works - not just what looks good on a benchmark chart, but what feels responsive and useful in daily conversations. Every model was run through Ollama at Q4_K_M quantization unless noted otherwise.
How We Compare
The benchmarks that matter for chat:
- MMLU / MMLU-Pro - general knowledge and reasoning across dozens of subjects
- MT-Bench - multi-turn conversation quality scored by GPT-4
- GSM8K - grade school math (basic reasoning sanity check)
- GPQA Diamond - graduate-level science questions (ceiling test)
- Tokens/second - how fast it actually talks back to you
Anything above 15 tok/s feels responsive for chat. Above 30 tok/s is fast enough that you forget it’s local. Below 10 tok/s gets painful.
8GB VRAM {#8gb}
GPUs: RTX 4060, RTX 3060 8GB, RTX 3070, GTX 1080
The entry tier. You’re limited to models around 4-9B parameters at Q4 quantization. A year ago this meant mediocre responses. In 2026, it means genuinely useful conversations.
Candidates
| Model | Params | Quant | VRAM Used | MMLU | GSM8K | Speed (RTX 4060) |
|---|---|---|---|---|---|---|
| Qwen 3.5 9B | 9B | Q4_K_M | ~7.0 GB | 82.5 (Pro) | - | 35-45 tok/s |
| Qwen 3 8B | 8B | Q4_K_M | 6.5 GB | 76.9 | 89.8 | ~40 tok/s |
| Gemma 3 4B QAT | 4B | INT4 | 3.5 GB | 43.6 | 89.2 | 60+ tok/s |
| Phi-4-mini | 3.8B | Q4_K_M | ~2.5 GB | 52.8 (Pro) | 88.6 | 70+ tok/s |
| Llama 3.2 3B | 3B | Q4_K_M | ~2 GB | - | 77.7 | 80+ tok/s |
Winner: Qwen 3.5 9B
The Qwen 3.5 9B scores 82.5 on MMLU-Pro - beating OpenAI’s GPT-OSS-120B (80.8) with a fraction of the parameters. It also hits 81.7 on GPQA Diamond, meaning it handles genuinely difficult questions, not just trivia.
At Q4_K_M it fits in 7GB, leaving about 1GB for context on an 8GB card. That’s tight but workable for normal conversations. If you need longer context or want to run other things alongside it, drop to the Qwen 3.5 4B (79.1 MMLU-Pro, ~3GB VRAM) and enjoy substantially more headroom.
Runner-up: Qwen 3 8B. Slightly lower benchmarks (76.9 MMLU) but battle-tested with a massive Ollama community. If the 3.5 9B gives you any trouble, this is the safe fallback.
Speed pick: Gemma 3 4B QAT. Google’s quantization-aware training means it was trained to be quantized, so quality degradation is minimal. At 3.5GB, you can run it alongside Whisper for voice chat. The MMLU score is lower, but the GSM8K math score (89.2%) shows it reasons better than the headline number suggests.
The honest take: 8GB chat models are good for everyday questions, brainstorming, and simple tasks. They struggle with complex multi-step reasoning, nuanced analysis, and tasks requiring deep world knowledge. If you ask them to plan a trip with detailed logistics, you’ll see the limits.
For all use cases at this level, see our 8GB VRAM complete guide.
12GB VRAM {#12gb}
GPUs: RTX 3060 12GB, RTX 4070
The “comfortable” tier. You can run 8-14B models with room to breathe. This is where local chat goes from “usable” to “actually good.”
Candidates
| Model | Params | Quant | VRAM Used | MMLU | GSM8K | Speed (RTX 4070) |
|---|---|---|---|---|---|---|
| Qwen 3 14B | 14B | Q4_K_M | 10.7 GB | 81.1 | 92.5 | ~50 tok/s |
| DeepSeek-R1 14B | 14B | Q4_K_M | ~9 GB | ~79 | ~90 | ~45 tok/s |
| Qwen 3.5 9B | 9B | Q4_K_M | ~7 GB | 82.5 (Pro) | - | ~55 tok/s |
| Phi-4 14B | 14B | Q4_K_M | ~10 GB | 84.8 | ~91 | 25-35 tok/s |
Winner: Qwen 3 14B
The jump from 8B to 14B is where local models start feeling smart rather than just functional. Qwen 3 14B at Q4_K_M uses about 10.7GB VRAM, leaving 1.3GB for context on a 12GB card. It scores 81.1 MMLU, 92.5 GSM8K, and runs at roughly 50 tok/s on an RTX 4070.
The quality difference from the 8B tier is noticeable in conversations. Longer, more coherent responses. Better handling of follow-up questions. More accurate factual recall.
Runner-up: DeepSeek-R1 14B Distill. If you want to see the model think through problems step-by-step, this is the reasoning specialist. It shows its work - literally prints chain-of-thought traces. Performance: 58.6 tok/s on RTX 4090, around 45 tok/s on RTX 4070 at Q4 quantization (9GB VRAM). The reasoning traces make it slower for simple questions but genuinely useful for debugging logic or working through math.
Surprise pick: Qwen 3.5 9B again. On 12GB it runs with plenty of context headroom and hits ~55 tok/s. Its MMLU-Pro score (82.5) actually beats the Qwen 3 14B on the harder Pro variant, despite being smaller. If speed and context length matter more than raw knowledge breadth, the 9B at 12GB is arguably the better experience.
For all use cases at this level, see our 12GB VRAM complete guide.
16GB VRAM {#16gb}
GPUs: RTX 4060 Ti 16GB, RTX 5060, Intel Arc A770, AMD RX 7800 XT
The sweet spot. Every 14B model fits comfortably, and you start reaching into 20B territory. This is also where MoE models begin making sense.
Candidates
| Model | Params | Quant | VRAM Used | MMLU | Key Score | Speed (est.) |
|---|---|---|---|---|---|---|
| GPT-OSS 20B | 20B | Q4_K_M | ~14 GB | ~80 | matches o3-mini | ~140 tok/s |
| Qwen 3 14B | 14B | Q4_K_M | 10.7 GB | 81.1 | GSM8K 92.5 | ~60 tok/s |
| Phi-4 14B | 14B | Q5_K_M | ~12 GB | 84.8 | MATH 80.4 | 25-35 tok/s |
| Gemma 3 12B QAT | 12B | INT4 | 6.6 GB | - | 128K context | ~55 tok/s |
Winner: GPT-OSS 20B
OpenAI’s first open-weight model is a plot twist nobody expected. GPT-OSS 20B delivers results comparable to o3-mini on standard benchmarks while using just 14GB of VRAM through Ollama. The reported 140 tok/s on a 16GB GPU is exceptional - faster than most 8B models on lesser hardware.
If you have 16GB, this is the model to try first. It represents OpenAI’s inference optimization expertise applied to an open-weight model, and the speed advantage is dramatic.
Runner-up: Qwen 3 14B at Q5_K_M. With 16GB you can afford higher quantization (Q5 instead of Q4), which recovers 1-2 points on most benchmarks. The extra quality is noticeable in factual accuracy.
STEM specialist: Phi-4 14B. Microsoft’s model scores 80.4% on MATH (competition-level problems) and 82.6% HumanEval. If your chat needs lean technical, Phi-4 is worth the slower speed. Run it at Q5_K_M for best quality at this VRAM tier.
Context champion: Gemma 3 12B QAT at just 6.6GB VRAM leaves nearly 10GB for context - enough for pasting in entire documents and having a conversation about them.
For all use cases at this level, see our 16GB VRAM complete guide.
24GB VRAM {#24gb}
GPUs: RTX 3090, RTX 4090
This is where local models start competing with cloud APIs on quality. The 27-32B tier was previously the exclusive domain of enterprise hardware. Now a single consumer GPU handles it.
Candidates
| Model | Params | Quant | VRAM Used | MMLU | Key Score | Speed (RTX 4090) |
|---|---|---|---|---|---|---|
| Qwen 3 30B MoE | 30B (MoE) | Q4_K_M | ~18 GB | - | - | 196 tok/s |
| Qwen 3 32B | 32B | Q4_K_M | 22.2 GB | 83.6 | GPQA 49.5 | 34 tok/s |
| EXAONE 4.0 32B | 32B | Q4_K_M | ~22 GB | 92.3 (Redux) | strong coding | ~30 tok/s |
| Gemma 3 27B QAT | 27B | INT4 | 14.1 GB | 78.6 | multimodal | ~40 tok/s |
| DeepSeek-R1 32B | 32B | Q4_K_M | ~22 GB | ~82 | reasoning chains | ~30 tok/s |
Winner: Qwen 3 30B MoE
The game-changer. This mixture-of-experts model delivers 196 tok/s on an RTX 4090 while fitting in 18GB VRAM. That’s faster than most 8B dense models, with 30B-class quality. MoE architecture means only a fraction of the parameters activate per token, so you get big-model knowledge at small-model speed.
This is the model that makes the “local models are too slow” argument obsolete.
Knowledge leader: Qwen 3 32B dense. Higher benchmark scores than the MoE variant (83.6 MMLU, 49.5 GPQA) but at 34 tok/s, it’s roughly 6x slower. Choose this when accuracy matters more than speed - research questions, factual analysis, detailed explanations.
Dark horse: EXAONE 4.0 32B from LG AI Research scores 92.3 on MMLU-Redux, which is competitive with frontier models. It’s newer and less community-tested than Qwen, but the benchmarks are hard to ignore. Worth trying if Qwen 3 32B doesn’t satisfy.
Budget option: Gemma 3 27B QAT fits in just 14.1GB at INT4, leaving 10GB free for context or running other models alongside it. The benchmarks are lower than Qwen, but the multimodal capabilities (it handles images too) and generous context headroom make it uniquely versatile at this tier.
For all use cases at this level, see our 24GB VRAM complete guide.
32GB VRAM {#32gb}
GPUs: RTX 5090
The new frontier. 32GB opens up high-quant 32B models, comfortable 70B at aggressive quantization, and MoE monsters. The RTX 5090 delivers 60-80% faster AI performance than the 4090, with roughly 213 tok/s average inference speed.
Candidates
| Model | Params | Quant | VRAM Used | MMLU | Key Score | Speed (RTX 5090) |
|---|---|---|---|---|---|---|
| Qwen 3 30B MoE | 30B (MoE) | Q8_0 | ~28 GB | - | highest quality MoE | ~320 tok/s |
| Qwen 3 32B | 32B | Q6_K | ~28 GB | 83.6 | near-lossless | ~55 tok/s |
| EXAONE 4.0 32B | 32B | Q6_K | ~28 GB | 92.3 (Redux) | frontier-class | ~50 tok/s |
| Llama 3.3 70B | 70B | Q3_K_M | ~32 GB | 83.6 | 80.5 HumanEval | ~20 tok/s |
Winner: Qwen 3 30B MoE at Q8_0
With 32GB, you can run the MoE model at Q8_0 - nearly lossless quantization - while still having headroom. Estimated speed around 320 tok/s on the RTX 5090. That’s cloud-API-fast with zero latency, zero cost per query, and zero data leaving your machine.
Quality ceiling: Qwen 3 32B or EXAONE 4.0 32B at Q6_K quantization. At this level, quantization artifacts are essentially imperceptible. You’re getting the full quality these models can offer. The EXAONE 4.0’s 92.3 MMLU-Redux score puts it in the same conversation as GPT-4-class models.
Reach pick: Llama 3.3 70B at Q3_K_M squeezes into 32GB. The quantization is aggressive and costs 3-5 MMLU points, but you get a 70B model on a single GPU. At ~20 tok/s it’s not fast, but for complex analysis tasks where you want maximum reasoning depth, it’s an option no other consumer GPU can match.
What about Llama 4 Scout? The MoE architecture (17B active from 109B total) technically has the compute requirements of a small model, but the full weight footprint after quantization still needs 48GB+ for reasonable performance. On a single 32GB card with aggressive quantization, expect 2-3 tok/s - not practical for chat.
For all use cases at this level, see our 32GB VRAM complete guide.
Cross-Tier Summary
| Tier | Best Pick | Runner-Up | Value Pick | Speed Champion |
|---|---|---|---|---|
| 8GB | Qwen 3.5 9B | Qwen 3 8B | Gemma 3 4B QAT | Phi-4-mini (70+ tok/s) |
| 12GB | Qwen 3 14B | DeepSeek-R1 14B | Qwen 3.5 9B | Qwen 3.5 9B (~55 tok/s) |
| 16GB | GPT-OSS 20B | Qwen 3 14B (Q5) | Gemma 3 12B QAT | GPT-OSS 20B (~140 tok/s) |
| 24GB | Qwen 3 30B MoE | Qwen 3 32B | Gemma 3 27B QAT | Qwen 3 30B MoE (196 tok/s) |
| 32GB | Qwen 3 30B MoE Q8 | EXAONE 4.0 32B Q6 | Qwen 3 32B Q6 | Qwen 3 30B MoE (~320 tok/s) |
Quick Start
Every model listed above runs through Ollama. Install it, pull a model, and start chatting:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pick your model (choose based on your VRAM tier above)
ollama pull qwen3.5:9b # 8GB tier
ollama pull qwen3:14b # 12GB tier
ollama pull gpt-oss:20b # 16GB tier
ollama pull qwen3:30b-a3b # 24GB tier (MoE)
# Start chatting
ollama run qwen3.5:9b
For a ChatGPT-style web interface, pair Ollama with Open WebUI. We have a full setup guide that walks through the process.
What These Models Can’t Do
Local chat models have real limits compared to frontier cloud models like Claude, GPT-4.5, or Gemini 2.5 Pro:
- Long, complex instructions - multi-step tasks with many constraints tend to fall apart below 32B parameters
- Cutting-edge knowledge - these models were trained months ago, they don’t know about last week
- Creative writing at scale - short-form is fine, novel-length coherence degrades
- Reliable factual accuracy - they hallucinate, especially on obscure topics. Always verify claims.
The 32B models at high quantization come closest to closing the gap. But for now, local chat is best as a private, instant, free alternative for everyday conversations - not as a replacement for frontier models on hard tasks.