Best Local Models for Chat in 2026: Every VRAM Tier Tested

Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB

Deep dives: Chat | Coding | Translation | Vision | Speech | Agents

A local chatbot that runs on your own hardware, answers your questions without sending data to anyone, and costs nothing per query. That’s the pitch. The reality depends entirely on your GPU.

We tested the major open-weight chat models across five VRAM tiers to find what actually works - not just what looks good on a benchmark chart, but what feels responsive and useful in daily conversations. Every model was run through Ollama at Q4_K_M quantization unless noted otherwise.

How We Compare

The benchmarks that matter for chat:

MMLU / MMLU-Pro - general knowledge and reasoning across dozens of subjects
MT-Bench - multi-turn conversation quality scored by GPT-4
GSM8K - grade school math (basic reasoning sanity check)
GPQA Diamond - graduate-level science questions (ceiling test)
Tokens/second - how fast it actually talks back to you

Anything above 15 tok/s feels responsive for chat. Above 30 tok/s is fast enough that you forget it’s local. Below 10 tok/s gets painful.

8GB VRAM {#8gb}

GPUs: RTX 4060, RTX 3060 8GB, RTX 3070, GTX 1080

The entry tier. You’re limited to models around 4-9B parameters at Q4 quantization. A year ago this meant mediocre responses. In 2026, it means genuinely useful conversations.

Candidates

Model	Params	Quant	VRAM Used	MMLU	GSM8K	Speed (RTX 4060)
Qwen 3.5 9B	9B	Q4_K_M	~7.0 GB	82.5 (Pro)	-	35-45 tok/s
Qwen 3 8B	8B	Q4_K_M	6.5 GB	76.9	89.8	~40 tok/s
Gemma 3 4B QAT	4B	INT4	3.5 GB	43.6	89.2	60+ tok/s
Phi-4-mini	3.8B	Q4_K_M	~2.5 GB	52.8 (Pro)	88.6	70+ tok/s
Llama 3.2 3B	3B	Q4_K_M	~2 GB	-	77.7	80+ tok/s

Winner: Qwen 3.5 9B

The Qwen 3.5 9B scores 82.5 on MMLU-Pro - beating OpenAI’s GPT-OSS-120B (80.8) with a fraction of the parameters. It also hits 81.7 on GPQA Diamond, meaning it handles genuinely difficult questions, not just trivia.

At Q4_K_M it fits in 7GB, leaving about 1GB for context on an 8GB card. That’s tight but workable for normal conversations. If you need longer context or want to run other things alongside it, drop to the Qwen 3.5 4B (79.1 MMLU-Pro, ~3GB VRAM) and enjoy substantially more headroom.

Runner-up: Qwen 3 8B. Slightly lower benchmarks (76.9 MMLU) but battle-tested with a massive Ollama community. If the 3.5 9B gives you any trouble, this is the safe fallback.

Speed pick: Gemma 3 4B QAT. Google’s quantization-aware training means it was trained to be quantized, so quality degradation is minimal. At 3.5GB, you can run it alongside Whisper for voice chat. The MMLU score is lower, but the GSM8K math score (89.2%) shows it reasons better than the headline number suggests.

The honest take: 8GB chat models are good for everyday questions, brainstorming, and simple tasks. They struggle with complex multi-step reasoning, nuanced analysis, and tasks requiring deep world knowledge. If you ask them to plan a trip with detailed logistics, you’ll see the limits.

For all use cases at this level, see our 8GB VRAM complete guide.

12GB VRAM {#12gb}

GPUs: RTX 3060 12GB, RTX 4070

The “comfortable” tier. You can run 8-14B models with room to breathe. This is where local chat goes from “usable” to “actually good.”

Candidates

Model	Params	Quant	VRAM Used	MMLU	GSM8K	Speed (RTX 4070)
Qwen 3 14B	14B	Q4_K_M	10.7 GB	81.1	92.5	~50 tok/s
DeepSeek-R1 14B	14B	Q4_K_M	~9 GB	~79	~90	~45 tok/s
Qwen 3.5 9B	9B	Q4_K_M	~7 GB	82.5 (Pro)	-	~55 tok/s
Phi-4 14B	14B	Q4_K_M	~10 GB	84.8	~91	25-35 tok/s

Winner: Qwen 3 14B

The jump from 8B to 14B is where local models start feeling smart rather than just functional. Qwen 3 14B at Q4_K_M uses about 10.7GB VRAM, leaving 1.3GB for context on a 12GB card. It scores 81.1 MMLU, 92.5 GSM8K, and runs at roughly 50 tok/s on an RTX 4070.

The quality difference from the 8B tier is noticeable in conversations. Longer, more coherent responses. Better handling of follow-up questions. More accurate factual recall.

Runner-up: DeepSeek-R1 14B Distill. If you want to see the model think through problems step-by-step, this is the reasoning specialist. It shows its work - literally prints chain-of-thought traces. Performance: 58.6 tok/s on RTX 4090, around 45 tok/s on RTX 4070 at Q4 quantization (9GB VRAM). The reasoning traces make it slower for simple questions but genuinely useful for debugging logic or working through math.

Surprise pick: Qwen 3.5 9B again. On 12GB it runs with plenty of context headroom and hits ~55 tok/s. Its MMLU-Pro score (82.5) actually beats the Qwen 3 14B on the harder Pro variant, despite being smaller. If speed and context length matter more than raw knowledge breadth, the 9B at 12GB is arguably the better experience.

For all use cases at this level, see our 12GB VRAM complete guide.

16GB VRAM {#16gb}

GPUs: RTX 4060 Ti 16GB, RTX 5060, Intel Arc A770, AMD RX 7800 XT

The sweet spot. Every 14B model fits comfortably, and you start reaching into 20B territory. This is also where MoE models begin making sense.

Candidates

Model	Params	Quant	VRAM Used	MMLU	Key Score	Speed (est.)
GPT-OSS 20B	20B	Q4_K_M	~14 GB	~80	matches o3-mini	~140 tok/s
Qwen 3 14B	14B	Q4_K_M	10.7 GB	81.1	GSM8K 92.5	~60 tok/s
Phi-4 14B	14B	Q5_K_M	~12 GB	84.8	MATH 80.4	25-35 tok/s
Gemma 3 12B QAT	12B	INT4	6.6 GB	-	128K context	~55 tok/s

Winner: GPT-OSS 20B

OpenAI’s first open-weight model is a plot twist nobody expected. GPT-OSS 20B delivers results comparable to o3-mini on standard benchmarks while using just 14GB of VRAM through Ollama. The reported 140 tok/s on a 16GB GPU is exceptional - faster than most 8B models on lesser hardware.

If you have 16GB, this is the model to try first. It represents OpenAI’s inference optimization expertise applied to an open-weight model, and the speed advantage is dramatic.

Runner-up: Qwen 3 14B at Q5_K_M. With 16GB you can afford higher quantization (Q5 instead of Q4), which recovers 1-2 points on most benchmarks. The extra quality is noticeable in factual accuracy.

STEM specialist: Phi-4 14B. Microsoft’s model scores 80.4% on MATH (competition-level problems) and 82.6% HumanEval. If your chat needs lean technical, Phi-4 is worth the slower speed. Run it at Q5_K_M for best quality at this VRAM tier.

Context champion: Gemma 3 12B QAT at just 6.6GB VRAM leaves nearly 10GB for context - enough for pasting in entire documents and having a conversation about them.

For all use cases at this level, see our 16GB VRAM complete guide.

24GB VRAM {#24gb}

GPUs: RTX 3090, RTX 4090

This is where local models start competing with cloud APIs on quality. The 27-32B tier was previously the exclusive domain of enterprise hardware. Now a single consumer GPU handles it.

Candidates

Model	Params	Quant	VRAM Used	MMLU	Key Score	Speed (RTX 4090)
Qwen 3 30B MoE	30B (MoE)	Q4_K_M	~18 GB	-	-	196 tok/s
Qwen 3 32B	32B	Q4_K_M	22.2 GB	83.6	GPQA 49.5	34 tok/s
EXAONE 4.0 32B	32B	Q4_K_M	~22 GB	92.3 (Redux)	strong coding	~30 tok/s
Gemma 3 27B QAT	27B	INT4	14.1 GB	78.6	multimodal	~40 tok/s
DeepSeek-R1 32B	32B	Q4_K_M	~22 GB	~82	reasoning chains	~30 tok/s

Winner: Qwen 3 30B MoE

The game-changer. This mixture-of-experts model delivers 196 tok/s on an RTX 4090 while fitting in 18GB VRAM. That’s faster than most 8B dense models, with 30B-class quality. MoE architecture means only a fraction of the parameters activate per token, so you get big-model knowledge at small-model speed.

This is the model that makes the “local models are too slow” argument obsolete.

Knowledge leader: Qwen 3 32B dense. Higher benchmark scores than the MoE variant (83.6 MMLU, 49.5 GPQA) but at 34 tok/s, it’s roughly 6x slower. Choose this when accuracy matters more than speed - research questions, factual analysis, detailed explanations.

Dark horse: EXAONE 4.0 32B from LG AI Research scores 92.3 on MMLU-Redux, which is competitive with frontier models. It’s newer and less community-tested than Qwen, but the benchmarks are hard to ignore. Worth trying if Qwen 3 32B doesn’t satisfy.

Budget option: Gemma 3 27B QAT fits in just 14.1GB at INT4, leaving 10GB free for context or running other models alongside it. The benchmarks are lower than Qwen, but the multimodal capabilities (it handles images too) and generous context headroom make it uniquely versatile at this tier.

For all use cases at this level, see our 24GB VRAM complete guide.

32GB VRAM {#32gb}

GPUs: RTX 5090

The new frontier. 32GB opens up high-quant 32B models, comfortable 70B at aggressive quantization, and MoE monsters. The RTX 5090 delivers 60-80% faster AI performance than the 4090, with roughly 213 tok/s average inference speed.

Candidates

Model	Params	Quant	VRAM Used	MMLU	Key Score	Speed (RTX 5090)
Qwen 3 30B MoE	30B (MoE)	Q8_0	~28 GB	-	highest quality MoE	~320 tok/s
Qwen 3 32B	32B	Q6_K	~28 GB	83.6	near-lossless	~55 tok/s
EXAONE 4.0 32B	32B	Q6_K	~28 GB	92.3 (Redux)	frontier-class	~50 tok/s
Llama 3.3 70B	70B	Q3_K_M	~32 GB	83.6	80.5 HumanEval	~20 tok/s

Winner: Qwen 3 30B MoE at Q8_0

With 32GB, you can run the MoE model at Q8_0 - nearly lossless quantization - while still having headroom. Estimated speed around 320 tok/s on the RTX 5090. That’s cloud-API-fast with zero latency, zero cost per query, and zero data leaving your machine.

Quality ceiling: Qwen 3 32B or EXAONE 4.0 32B at Q6_K quantization. At this level, quantization artifacts are essentially imperceptible. You’re getting the full quality these models can offer. The EXAONE 4.0’s 92.3 MMLU-Redux score puts it in the same conversation as GPT-4-class models.

Reach pick: Llama 3.3 70B at Q3_K_M squeezes into 32GB. The quantization is aggressive and costs 3-5 MMLU points, but you get a 70B model on a single GPU. At ~20 tok/s it’s not fast, but for complex analysis tasks where you want maximum reasoning depth, it’s an option no other consumer GPU can match.

What about Llama 4 Scout? The MoE architecture (17B active from 109B total) technically has the compute requirements of a small model, but the full weight footprint after quantization still needs 48GB+ for reasonable performance. On a single 32GB card with aggressive quantization, expect 2-3 tok/s - not practical for chat.

For all use cases at this level, see our 32GB VRAM complete guide.

Cross-Tier Summary

Tier	Best Pick	Runner-Up	Value Pick	Speed Champion
8GB	Qwen 3.5 9B	Qwen 3 8B	Gemma 3 4B QAT	Phi-4-mini (70+ tok/s)
12GB	Qwen 3 14B	DeepSeek-R1 14B	Qwen 3.5 9B	Qwen 3.5 9B (~55 tok/s)
16GB	GPT-OSS 20B	Qwen 3 14B (Q5)	Gemma 3 12B QAT	GPT-OSS 20B (~140 tok/s)
24GB	Qwen 3 30B MoE	Qwen 3 32B	Gemma 3 27B QAT	Qwen 3 30B MoE (196 tok/s)
32GB	Qwen 3 30B MoE Q8	EXAONE 4.0 32B Q6	Qwen 3 32B Q6	Qwen 3 30B MoE (~320 tok/s)

Quick Start

Every model listed above runs through Ollama. Install it, pull a model, and start chatting:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pick your model (choose based on your VRAM tier above)
ollama pull qwen3.5:9b          # 8GB tier
ollama pull qwen3:14b           # 12GB tier
ollama pull gpt-oss:20b         # 16GB tier
ollama pull qwen3:30b-a3b       # 24GB tier (MoE)

# Start chatting
ollama run qwen3.5:9b

For a ChatGPT-style web interface, pair Ollama with Open WebUI. We have a full setup guide that walks through the process.

What These Models Can’t Do

Local chat models have real limits compared to frontier cloud models like Claude, GPT-4.5, or Gemini 2.5 Pro:

Long, complex instructions - multi-step tasks with many constraints tend to fall apart below 32B parameters
Cutting-edge knowledge - these models were trained months ago, they don’t know about last week
Creative writing at scale - short-form is fine, novel-length coherence degrades
Reliable factual accuracy - they hallucinate, especially on obscure topics. Always verify claims.

The 32B models at high quantization come closest to closing the gap. But for now, local chat is best as a private, instant, free alternative for everyday conversations - not as a replacement for frontier models on hard tasks.