Best Local Models for Chat in 2026: Every VRAM Tier Tested

Head-to-head comparison of local chat and assistant models from 8GB to 32GB VRAM. Benchmarks, real speeds, and honest assessments of what your GPU can actually run.

Graphics card with RGB lighting inside a computer case

Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB

Deep dives: Chat | Coding | Translation | Vision | Speech | Agents

A local chatbot that runs on your own hardware, answers your questions without sending data to anyone, and costs nothing per query. That’s the pitch. The reality depends entirely on your GPU.

We tested the major open-weight chat models across five VRAM tiers to find what actually works - not just what looks good on a benchmark chart, but what feels responsive and useful in daily conversations. Every model was run through Ollama at Q4_K_M quantization unless noted otherwise.

How We Compare

The benchmarks that matter for chat:

  • MMLU / MMLU-Pro - general knowledge and reasoning across dozens of subjects
  • MT-Bench - multi-turn conversation quality scored by GPT-4
  • GSM8K - grade school math (basic reasoning sanity check)
  • GPQA Diamond - graduate-level science questions (ceiling test)
  • Tokens/second - how fast it actually talks back to you

Anything above 15 tok/s feels responsive for chat. Above 30 tok/s is fast enough that you forget it’s local. Below 10 tok/s gets painful.

8GB VRAM {#8gb}

GPUs: RTX 4060, RTX 3060 8GB, RTX 3070, GTX 1080

The entry tier. You’re limited to models around 4-9B parameters at Q4 quantization. A year ago this meant mediocre responses. In 2026, it means genuinely useful conversations.

Candidates

ModelParamsQuantVRAM UsedMMLUGSM8KSpeed (RTX 4060)
Qwen 3.5 9B9BQ4_K_M~7.0 GB82.5 (Pro)-35-45 tok/s
Qwen 3 8B8BQ4_K_M6.5 GB76.989.8~40 tok/s
Gemma 3 4B QAT4BINT43.5 GB43.689.260+ tok/s
Phi-4-mini3.8BQ4_K_M~2.5 GB52.8 (Pro)88.670+ tok/s
Llama 3.2 3B3BQ4_K_M~2 GB-77.780+ tok/s

Winner: Qwen 3.5 9B

The Qwen 3.5 9B scores 82.5 on MMLU-Pro - beating OpenAI’s GPT-OSS-120B (80.8) with a fraction of the parameters. It also hits 81.7 on GPQA Diamond, meaning it handles genuinely difficult questions, not just trivia.

At Q4_K_M it fits in 7GB, leaving about 1GB for context on an 8GB card. That’s tight but workable for normal conversations. If you need longer context or want to run other things alongside it, drop to the Qwen 3.5 4B (79.1 MMLU-Pro, ~3GB VRAM) and enjoy substantially more headroom.

Runner-up: Qwen 3 8B. Slightly lower benchmarks (76.9 MMLU) but battle-tested with a massive Ollama community. If the 3.5 9B gives you any trouble, this is the safe fallback.

Speed pick: Gemma 3 4B QAT. Google’s quantization-aware training means it was trained to be quantized, so quality degradation is minimal. At 3.5GB, you can run it alongside Whisper for voice chat. The MMLU score is lower, but the GSM8K math score (89.2%) shows it reasons better than the headline number suggests.

The honest take: 8GB chat models are good for everyday questions, brainstorming, and simple tasks. They struggle with complex multi-step reasoning, nuanced analysis, and tasks requiring deep world knowledge. If you ask them to plan a trip with detailed logistics, you’ll see the limits.

For all use cases at this level, see our 8GB VRAM complete guide.

12GB VRAM {#12gb}

GPUs: RTX 3060 12GB, RTX 4070

The “comfortable” tier. You can run 8-14B models with room to breathe. This is where local chat goes from “usable” to “actually good.”

Candidates

ModelParamsQuantVRAM UsedMMLUGSM8KSpeed (RTX 4070)
Qwen 3 14B14BQ4_K_M10.7 GB81.192.5~50 tok/s
DeepSeek-R1 14B14BQ4_K_M~9 GB~79~90~45 tok/s
Qwen 3.5 9B9BQ4_K_M~7 GB82.5 (Pro)-~55 tok/s
Phi-4 14B14BQ4_K_M~10 GB84.8~9125-35 tok/s

Winner: Qwen 3 14B

The jump from 8B to 14B is where local models start feeling smart rather than just functional. Qwen 3 14B at Q4_K_M uses about 10.7GB VRAM, leaving 1.3GB for context on a 12GB card. It scores 81.1 MMLU, 92.5 GSM8K, and runs at roughly 50 tok/s on an RTX 4070.

The quality difference from the 8B tier is noticeable in conversations. Longer, more coherent responses. Better handling of follow-up questions. More accurate factual recall.

Runner-up: DeepSeek-R1 14B Distill. If you want to see the model think through problems step-by-step, this is the reasoning specialist. It shows its work - literally prints chain-of-thought traces. Performance: 58.6 tok/s on RTX 4090, around 45 tok/s on RTX 4070 at Q4 quantization (9GB VRAM). The reasoning traces make it slower for simple questions but genuinely useful for debugging logic or working through math.

Surprise pick: Qwen 3.5 9B again. On 12GB it runs with plenty of context headroom and hits ~55 tok/s. Its MMLU-Pro score (82.5) actually beats the Qwen 3 14B on the harder Pro variant, despite being smaller. If speed and context length matter more than raw knowledge breadth, the 9B at 12GB is arguably the better experience.

For all use cases at this level, see our 12GB VRAM complete guide.

16GB VRAM {#16gb}

GPUs: RTX 4060 Ti 16GB, RTX 5060, Intel Arc A770, AMD RX 7800 XT

The sweet spot. Every 14B model fits comfortably, and you start reaching into 20B territory. This is also where MoE models begin making sense.

Candidates

ModelParamsQuantVRAM UsedMMLUKey ScoreSpeed (est.)
GPT-OSS 20B20BQ4_K_M~14 GB~80matches o3-mini~140 tok/s
Qwen 3 14B14BQ4_K_M10.7 GB81.1GSM8K 92.5~60 tok/s
Phi-4 14B14BQ5_K_M~12 GB84.8MATH 80.425-35 tok/s
Gemma 3 12B QAT12BINT46.6 GB-128K context~55 tok/s

Winner: GPT-OSS 20B

OpenAI’s first open-weight model is a plot twist nobody expected. GPT-OSS 20B delivers results comparable to o3-mini on standard benchmarks while using just 14GB of VRAM through Ollama. The reported 140 tok/s on a 16GB GPU is exceptional - faster than most 8B models on lesser hardware.

If you have 16GB, this is the model to try first. It represents OpenAI’s inference optimization expertise applied to an open-weight model, and the speed advantage is dramatic.

Runner-up: Qwen 3 14B at Q5_K_M. With 16GB you can afford higher quantization (Q5 instead of Q4), which recovers 1-2 points on most benchmarks. The extra quality is noticeable in factual accuracy.

STEM specialist: Phi-4 14B. Microsoft’s model scores 80.4% on MATH (competition-level problems) and 82.6% HumanEval. If your chat needs lean technical, Phi-4 is worth the slower speed. Run it at Q5_K_M for best quality at this VRAM tier.

Context champion: Gemma 3 12B QAT at just 6.6GB VRAM leaves nearly 10GB for context - enough for pasting in entire documents and having a conversation about them.

For all use cases at this level, see our 16GB VRAM complete guide.

24GB VRAM {#24gb}

GPUs: RTX 3090, RTX 4090

This is where local models start competing with cloud APIs on quality. The 27-32B tier was previously the exclusive domain of enterprise hardware. Now a single consumer GPU handles it.

Candidates

ModelParamsQuantVRAM UsedMMLUKey ScoreSpeed (RTX 4090)
Qwen 3 30B MoE30B (MoE)Q4_K_M~18 GB--196 tok/s
Qwen 3 32B32BQ4_K_M22.2 GB83.6GPQA 49.534 tok/s
EXAONE 4.0 32B32BQ4_K_M~22 GB92.3 (Redux)strong coding~30 tok/s
Gemma 3 27B QAT27BINT414.1 GB78.6multimodal~40 tok/s
DeepSeek-R1 32B32BQ4_K_M~22 GB~82reasoning chains~30 tok/s

Winner: Qwen 3 30B MoE

The game-changer. This mixture-of-experts model delivers 196 tok/s on an RTX 4090 while fitting in 18GB VRAM. That’s faster than most 8B dense models, with 30B-class quality. MoE architecture means only a fraction of the parameters activate per token, so you get big-model knowledge at small-model speed.

This is the model that makes the “local models are too slow” argument obsolete.

Knowledge leader: Qwen 3 32B dense. Higher benchmark scores than the MoE variant (83.6 MMLU, 49.5 GPQA) but at 34 tok/s, it’s roughly 6x slower. Choose this when accuracy matters more than speed - research questions, factual analysis, detailed explanations.

Dark horse: EXAONE 4.0 32B from LG AI Research scores 92.3 on MMLU-Redux, which is competitive with frontier models. It’s newer and less community-tested than Qwen, but the benchmarks are hard to ignore. Worth trying if Qwen 3 32B doesn’t satisfy.

Budget option: Gemma 3 27B QAT fits in just 14.1GB at INT4, leaving 10GB free for context or running other models alongside it. The benchmarks are lower than Qwen, but the multimodal capabilities (it handles images too) and generous context headroom make it uniquely versatile at this tier.

For all use cases at this level, see our 24GB VRAM complete guide.

32GB VRAM {#32gb}

GPUs: RTX 5090

The new frontier. 32GB opens up high-quant 32B models, comfortable 70B at aggressive quantization, and MoE monsters. The RTX 5090 delivers 60-80% faster AI performance than the 4090, with roughly 213 tok/s average inference speed.

Candidates

ModelParamsQuantVRAM UsedMMLUKey ScoreSpeed (RTX 5090)
Qwen 3 30B MoE30B (MoE)Q8_0~28 GB-highest quality MoE~320 tok/s
Qwen 3 32B32BQ6_K~28 GB83.6near-lossless~55 tok/s
EXAONE 4.0 32B32BQ6_K~28 GB92.3 (Redux)frontier-class~50 tok/s
Llama 3.3 70B70BQ3_K_M~32 GB83.680.5 HumanEval~20 tok/s

Winner: Qwen 3 30B MoE at Q8_0

With 32GB, you can run the MoE model at Q8_0 - nearly lossless quantization - while still having headroom. Estimated speed around 320 tok/s on the RTX 5090. That’s cloud-API-fast with zero latency, zero cost per query, and zero data leaving your machine.

Quality ceiling: Qwen 3 32B or EXAONE 4.0 32B at Q6_K quantization. At this level, quantization artifacts are essentially imperceptible. You’re getting the full quality these models can offer. The EXAONE 4.0’s 92.3 MMLU-Redux score puts it in the same conversation as GPT-4-class models.

Reach pick: Llama 3.3 70B at Q3_K_M squeezes into 32GB. The quantization is aggressive and costs 3-5 MMLU points, but you get a 70B model on a single GPU. At ~20 tok/s it’s not fast, but for complex analysis tasks where you want maximum reasoning depth, it’s an option no other consumer GPU can match.

What about Llama 4 Scout? The MoE architecture (17B active from 109B total) technically has the compute requirements of a small model, but the full weight footprint after quantization still needs 48GB+ for reasonable performance. On a single 32GB card with aggressive quantization, expect 2-3 tok/s - not practical for chat.

For all use cases at this level, see our 32GB VRAM complete guide.

Cross-Tier Summary

TierBest PickRunner-UpValue PickSpeed Champion
8GBQwen 3.5 9BQwen 3 8BGemma 3 4B QATPhi-4-mini (70+ tok/s)
12GBQwen 3 14BDeepSeek-R1 14BQwen 3.5 9BQwen 3.5 9B (~55 tok/s)
16GBGPT-OSS 20BQwen 3 14B (Q5)Gemma 3 12B QATGPT-OSS 20B (~140 tok/s)
24GBQwen 3 30B MoEQwen 3 32BGemma 3 27B QATQwen 3 30B MoE (196 tok/s)
32GBQwen 3 30B MoE Q8EXAONE 4.0 32B Q6Qwen 3 32B Q6Qwen 3 30B MoE (~320 tok/s)

Quick Start

Every model listed above runs through Ollama. Install it, pull a model, and start chatting:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pick your model (choose based on your VRAM tier above)
ollama pull qwen3.5:9b          # 8GB tier
ollama pull qwen3:14b           # 12GB tier
ollama pull gpt-oss:20b         # 16GB tier
ollama pull qwen3:30b-a3b       # 24GB tier (MoE)

# Start chatting
ollama run qwen3.5:9b

For a ChatGPT-style web interface, pair Ollama with Open WebUI. We have a full setup guide that walks through the process.

What These Models Can’t Do

Local chat models have real limits compared to frontier cloud models like Claude, GPT-4.5, or Gemini 2.5 Pro:

  • Long, complex instructions - multi-step tasks with many constraints tend to fall apart below 32B parameters
  • Cutting-edge knowledge - these models were trained months ago, they don’t know about last week
  • Creative writing at scale - short-form is fine, novel-length coherence degrades
  • Reliable factual accuracy - they hallucinate, especially on obscure topics. Always verify claims.

The 32B models at high quantization come closest to closing the gap. But for now, local chat is best as a private, instant, free alternative for everyday conversations - not as a replacement for frontier models on hard tasks.