Best Local Models for Coding in 2026: Every VRAM Tier Tested

Which open-weight coding model should you run locally? HumanEval, SWE-bench, and real-world tests from 8GB to 32GB GPUs, with setup instructions for IDE integration.

Code on a computer screen with syntax highlighting

Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB

Deep dives: Chat | Coding | Translation | Vision | Speech | Agents

GitHub Copilot costs $19/month, sends your code to Microsoft’s servers, and still gets things wrong. Running a coding model locally costs nothing per query, keeps your proprietary code on your machine, and - depending on your GPU - gets surprisingly close in quality.

The question is which model to run. There are two distinct use cases: autocomplete (fill-in-the-middle / FIM, the inline suggestions as you type) and chat (asking questions about code, debugging, generating functions). Some models do both. Some only do one well. We tested them all.

The Benchmarks That Matter

  • HumanEval - generate a Python function from a docstring. The classic test, though most frontier models now score 90%+
  • SWE-bench Verified - fix a real GitHub issue autonomously. The closest thing to actual software engineering. This is the benchmark that separates toys from tools
  • LiveCodeBench - competitive programming problems. Tests algorithmic reasoning under pressure
  • Aider - code repair benchmark. Can the model fix broken code based on error messages?
  • FIM support - does the model support fill-in-the-middle for autocomplete? Not all chat models do

8GB VRAM {#8gb}

GPUs: RTX 4060, RTX 3060 8GB, RTX 3070, GTX 1080

Candidates

ModelParamsHumanEvalLiveCodeBenchFIMVRAM (Q4)
Qwen 2.5 Coder 7B7B88.4%37.6%Yes~5 GB
Qwen 3.5 9B9B-65.6%No6.6 GB
DeepSeek Coder V2 Lite16B (2.4B active)81.1%-Yes~5 GB
Qwen 3.5 4B4B-55.8%No3.4 GB
StarCoder2 3B3B~34%-Yes~2 GB

For autocomplete: Qwen 2.5 Coder 7B

The Qwen 2.5 Coder 7B at 88.4% HumanEval beats CodeStral-22B and DeepSeek Coder 33B despite being a fraction of the size. It supports FIM natively, handles 128K context (enough to include large files), and covers 92+ programming languages. At ~5GB VRAM, it leaves 3GB free on an 8GB card.

This is your autocomplete model. Plug it into Continue or Tabby and you get inline suggestions that rival Copilot for common patterns.

For chat: Qwen 3.5 9B

If you need to ask questions about code, debug errors, or generate entire functions from descriptions, the Qwen 3.5 9B is the better pick. Its 65.6% LiveCodeBench score shows stronger algorithmic reasoning than the Coder 7B. No FIM support, but for chat-based coding it doesn’t matter.

Budget option: DeepSeek Coder V2 Lite is a hidden gem - an MoE model with 16B total parameters but only 2.4B active. At ~5GB VRAM it fits easily on 8GB, supports FIM, and scores 81.1% HumanEval. The trade-off is a shorter 128K context and less community support than Qwen.

The honest take: 8GB coding models handle autocompletion well but struggle with complex refactoring, multi-file changes, and understanding large codebases. Use them for the 80% of coding that’s routine, and reach for a cloud model for the hard stuff.

For all use cases at this level, see our 8GB VRAM complete guide.

12GB VRAM {#12gb}

GPUs: RTX 3060 12GB, RTX 4070

Candidates

ModelParamsHumanEvalLiveCodeBenchFIMVRAM (Q4)
Qwen 2.5 Coder 14B14B~89%-Yes~9 GB
Codestral 22B22B81%-Yes~11 GB
Qwen 3 14B14B72.2-No10.7 GB
Phi-4 14B14B82.6%-No~10 GB

Winner: Qwen 2.5 Coder 14B

At ~89% HumanEval with FIM support and 128K context, the Qwen 2.5 Coder 14B is the clear pick for this tier. It fits in 9GB, leaving 3GB for context on a 12GB card. This model handles multi-language projects, understands framework-specific patterns (React, Django, Rails), and generates tests alongside implementations.

Autocomplete alternative: Codestral 22B from Mistral. At Q4 quantization it squeezes into ~11GB, which is tight on 12GB but workable. It scores 81% HumanEval and was purpose-built for code with FIM support, 32K context, and strong multi-language performance (73.75% on Kotlin-HumanEval). The wider language coverage makes it worth trying if you work outside Python/JS.

Reasoning upgrade: Phi-4 14B. Microsoft’s model scores 82.6% HumanEval and 80.4% MATH, making it the best choice when your coding problems involve mathematical or algorithmic reasoning. No FIM support - this is a chat-only coding model.

For all use cases at this level, see our 12GB VRAM complete guide.

16GB VRAM {#16gb}

GPUs: RTX 4060 Ti 16GB, RTX 5060, Intel Arc A770, AMD RX 7800 XT

Candidates

ModelParamsHumanEvalSWE-benchFIMVRAM (Q4)
Qwen 2.5 Coder 14B14B~89%-Yes~9 GB
Qwen 3.5 9B9B--No6.6 GB
GPT-OSS 20B20B--No~14 GB
Codestral 22B22B81%-Yes~11 GB

Winner: Qwen 2.5 Coder 14B (with headroom)

Same model as the 12GB tier, but now with 7GB of headroom for context. That extra room matters for coding - you can paste in entire files, error logs, and test outputs alongside your question. The model stays fully in VRAM with room to spare.

Dual model setup: With 16GB, you can realistically run two smaller models. Use Qwen 2.5 Coder 7B (~5GB) for autocomplete and Qwen 3.5 9B (6.6GB) for chat. Together they use about 12GB, leaving 4GB for context. This gives you the best of both worlds - fast inline completions and intelligent code chat.

Alternative: GPT-OSS 20B at ~14GB. While not specifically a coding model, OpenAI’s open-weight release matches o3-mini on benchmarks and its general reasoning makes it a strong code chat companion. The 140 tok/s speed on 16GB GPUs means near-instant responses.

For all use cases at this level, see our 16GB VRAM complete guide.

24GB VRAM {#24gb}

GPUs: RTX 3090, RTX 4090

This tier transforms local coding. You get access to models that genuinely compete with cloud coding assistants.

Candidates

ModelParamsHumanEvalSWE-benchFIMVRAM (Q4)
Qwen 2.5 Coder 32B32B92.7%-Yes~20 GB
Qwen 3.5 27B27B-72.4%No~16 GB
Qwen3-Coder 30B MoE30B (3B active)-50.3%No~18 GB
Qwen 3 32B32B72.1-No22.2 GB

For autocomplete: Qwen 2.5 Coder 32B

92.7% HumanEval. That’s not a typo. The 32B Coder model with FIM support, 128K context, and 92+ language coverage at ~20GB VRAM is the best local autocomplete model available at any price. It also scores 73.7% on Aider (code repair), meaning it can fix its own mistakes when pointed at error messages.

This model is why people with RTX 4090s stopped paying for Copilot.

For agentic coding: Qwen 3.5 27B

If you’re using AI for more than autocomplete - running it through Aider, OpenHands, or similar tools - the Qwen 3.5 27B’s 72.4% SWE-bench Verified (tying GPT-5 mini) means it can actually resolve real GitHub issues autonomously. At ~16GB Q4, it fits with 8GB of context headroom on a 24GB card.

Speed pick: Qwen3-Coder 30B MoE activates only 3B parameters per token, so it runs fast while delivering 30B-class coding capability. The SWE-bench score (50.3%) is lower than the dense 27B, but the speed advantage makes it ideal for interactive coding sessions where you want rapid responses.

For all use cases at this level, see our 24GB VRAM complete guide.

32GB VRAM {#32gb}

GPUs: RTX 5090

Candidates

ModelParamsHumanEvalSWE-benchFIMVRAM (Q4/Q6)
Qwen 2.5 Coder 32B32B92.7%-Yes~28 GB (Q6)
Qwen 3.5 27B27B-72.4%No~22 GB (Q6)
Qwen3-Coder 30B MoE30B (3B active)-50.3%No~25 GB (Q8)

Winner: Qwen 2.5 Coder 32B at Q6_K

Same model as the 24GB tier, but now at Q6_K quantization (~28GB). The quality bump from Q4 to Q6 is measurable - about 1-2 points on HumanEval and noticeably fewer hallucinated function names and incorrect API calls. With the RTX 5090’s throughput, this is the closest thing to a local Copilot replacement.

Dual setup: Run Qwen 2.5 Coder 32B (Q4, ~20GB) for autocomplete alongside Qwen 3.5 27B (Q4, ~16GB) - wait, that’s 36GB. Instead, pair Qwen 2.5 Coder 14B (~9GB) for autocomplete with Qwen 3.5 27B (~16GB) for chat. That’s 25GB total, leaving 7GB for context. Best of both worlds at the highest quality.

The stretch: Qwen3-Coder-Next 80B MoE scores 64.6% on SWE-rebench Pass@5 - the #1 score on the entire leaderboard, beating Claude and GPT-5. But it needs ~38GB+ VRAM, which doesn’t fit on a single 32GB card. If you’re willing to do CPU offloading and accept 5-10 tok/s, it’s technically runnable. Not recommended for interactive coding.

For all use cases at this level, see our 32GB VRAM complete guide.

Cross-Tier Summary

TierAutocomplete PickChat/Agent PickHumanEvalSWE-bench
8GBQwen 2.5 Coder 7BQwen 3.5 9B88.4%-
12GBQwen 2.5 Coder 14BPhi-4 14B~89%-
16GBQwen 2.5 Coder 14BGPT-OSS 20B~89%-
24GBQwen 2.5 Coder 32BQwen 3.5 27B92.7%72.4%
32GBQwen 2.5 Coder 32B (Q6)Qwen 3.5 27B (Q6)92.7%72.4%

IDE Integration

Autocomplete with Continue (VS Code / JetBrains)

# Pull your autocomplete model
ollama pull qwen2.5-coder:7b    # 8GB tier
ollama pull qwen2.5-coder:14b   # 12-16GB tier
ollama pull qwen2.5-coder:32b   # 24GB+ tier

Install Continue in VS Code or JetBrains, point it at your local Ollama server (http://localhost:11434), and enable tab autocomplete. The experience is nearly identical to Copilot.

We have a detailed setup guide: Self-Host Code Completion with Continue + Ollama.

Autocomplete with Tabby

Tabby is purpose-built for self-hosted code completion. It handles model management, indexing your codebase for context-aware suggestions, and has plugins for VS Code, JetBrains, and Vim.

Full setup walkthrough: Self-Host Tabby AI Code Completion.

Chat with Aider

For terminal-based AI coding that edits files directly:

pip install aider-chat
aider --model ollama/qwen3.5:27b

Aider sends your files to the local model, gets proposed changes, and applies them with git integration. The Qwen 3.5 27B at 72.4% SWE-bench means it resolves real issues, not just simple completions.

What Local Coding Models Can’t Do (Yet)

  • Full repository understanding - they see what you paste in, not your entire codebase. Tabby’s indexing helps, but it’s not the same as Copilot Workspace or Cursor
  • Complex multi-file refactoring - models below 27B struggle to maintain consistency across many files
  • Reliable test generation - they generate tests, but the tests often don’t compile or test the wrong things. Always review
  • Unfamiliar frameworks - training data has a long tail. Obscure libraries get hallucinated APIs

The 24GB+ tier with Qwen 2.5 Coder 32B and Qwen 3.5 27B genuinely threatens paid coding assistants for most workflows. Below that, local models are best as a supplement to your own expertise rather than a replacement for it.