Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB
Deep dives: Chat | Coding | Translation | Vision | Speech | Agents
GitHub Copilot costs $19/month, sends your code to Microsoft’s servers, and still gets things wrong. Running a coding model locally costs nothing per query, keeps your proprietary code on your machine, and - depending on your GPU - gets surprisingly close in quality.
The question is which model to run. There are two distinct use cases: autocomplete (fill-in-the-middle / FIM, the inline suggestions as you type) and chat (asking questions about code, debugging, generating functions). Some models do both. Some only do one well. We tested them all.
The Benchmarks That Matter
- HumanEval - generate a Python function from a docstring. The classic test, though most frontier models now score 90%+
- SWE-bench Verified - fix a real GitHub issue autonomously. The closest thing to actual software engineering. This is the benchmark that separates toys from tools
- LiveCodeBench - competitive programming problems. Tests algorithmic reasoning under pressure
- Aider - code repair benchmark. Can the model fix broken code based on error messages?
- FIM support - does the model support fill-in-the-middle for autocomplete? Not all chat models do
8GB VRAM {#8gb}
GPUs: RTX 4060, RTX 3060 8GB, RTX 3070, GTX 1080
Candidates
| Model | Params | HumanEval | LiveCodeBench | FIM | VRAM (Q4) |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 7B | 7B | 88.4% | 37.6% | Yes | ~5 GB |
| Qwen 3.5 9B | 9B | - | 65.6% | No | 6.6 GB |
| DeepSeek Coder V2 Lite | 16B (2.4B active) | 81.1% | - | Yes | ~5 GB |
| Qwen 3.5 4B | 4B | - | 55.8% | No | 3.4 GB |
| StarCoder2 3B | 3B | ~34% | - | Yes | ~2 GB |
For autocomplete: Qwen 2.5 Coder 7B
The Qwen 2.5 Coder 7B at 88.4% HumanEval beats CodeStral-22B and DeepSeek Coder 33B despite being a fraction of the size. It supports FIM natively, handles 128K context (enough to include large files), and covers 92+ programming languages. At ~5GB VRAM, it leaves 3GB free on an 8GB card.
This is your autocomplete model. Plug it into Continue or Tabby and you get inline suggestions that rival Copilot for common patterns.
For chat: Qwen 3.5 9B
If you need to ask questions about code, debug errors, or generate entire functions from descriptions, the Qwen 3.5 9B is the better pick. Its 65.6% LiveCodeBench score shows stronger algorithmic reasoning than the Coder 7B. No FIM support, but for chat-based coding it doesn’t matter.
Budget option: DeepSeek Coder V2 Lite is a hidden gem - an MoE model with 16B total parameters but only 2.4B active. At ~5GB VRAM it fits easily on 8GB, supports FIM, and scores 81.1% HumanEval. The trade-off is a shorter 128K context and less community support than Qwen.
The honest take: 8GB coding models handle autocompletion well but struggle with complex refactoring, multi-file changes, and understanding large codebases. Use them for the 80% of coding that’s routine, and reach for a cloud model for the hard stuff.
For all use cases at this level, see our 8GB VRAM complete guide.
12GB VRAM {#12gb}
GPUs: RTX 3060 12GB, RTX 4070
Candidates
| Model | Params | HumanEval | LiveCodeBench | FIM | VRAM (Q4) |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 14B | 14B | ~89% | - | Yes | ~9 GB |
| Codestral 22B | 22B | 81% | - | Yes | ~11 GB |
| Qwen 3 14B | 14B | 72.2 | - | No | 10.7 GB |
| Phi-4 14B | 14B | 82.6% | - | No | ~10 GB |
Winner: Qwen 2.5 Coder 14B
At ~89% HumanEval with FIM support and 128K context, the Qwen 2.5 Coder 14B is the clear pick for this tier. It fits in 9GB, leaving 3GB for context on a 12GB card. This model handles multi-language projects, understands framework-specific patterns (React, Django, Rails), and generates tests alongside implementations.
Autocomplete alternative: Codestral 22B from Mistral. At Q4 quantization it squeezes into ~11GB, which is tight on 12GB but workable. It scores 81% HumanEval and was purpose-built for code with FIM support, 32K context, and strong multi-language performance (73.75% on Kotlin-HumanEval). The wider language coverage makes it worth trying if you work outside Python/JS.
Reasoning upgrade: Phi-4 14B. Microsoft’s model scores 82.6% HumanEval and 80.4% MATH, making it the best choice when your coding problems involve mathematical or algorithmic reasoning. No FIM support - this is a chat-only coding model.
For all use cases at this level, see our 12GB VRAM complete guide.
16GB VRAM {#16gb}
GPUs: RTX 4060 Ti 16GB, RTX 5060, Intel Arc A770, AMD RX 7800 XT
Candidates
| Model | Params | HumanEval | SWE-bench | FIM | VRAM (Q4) |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 14B | 14B | ~89% | - | Yes | ~9 GB |
| Qwen 3.5 9B | 9B | - | - | No | 6.6 GB |
| GPT-OSS 20B | 20B | - | - | No | ~14 GB |
| Codestral 22B | 22B | 81% | - | Yes | ~11 GB |
Winner: Qwen 2.5 Coder 14B (with headroom)
Same model as the 12GB tier, but now with 7GB of headroom for context. That extra room matters for coding - you can paste in entire files, error logs, and test outputs alongside your question. The model stays fully in VRAM with room to spare.
Dual model setup: With 16GB, you can realistically run two smaller models. Use Qwen 2.5 Coder 7B (~5GB) for autocomplete and Qwen 3.5 9B (6.6GB) for chat. Together they use about 12GB, leaving 4GB for context. This gives you the best of both worlds - fast inline completions and intelligent code chat.
Alternative: GPT-OSS 20B at ~14GB. While not specifically a coding model, OpenAI’s open-weight release matches o3-mini on benchmarks and its general reasoning makes it a strong code chat companion. The 140 tok/s speed on 16GB GPUs means near-instant responses.
For all use cases at this level, see our 16GB VRAM complete guide.
24GB VRAM {#24gb}
GPUs: RTX 3090, RTX 4090
This tier transforms local coding. You get access to models that genuinely compete with cloud coding assistants.
Candidates
| Model | Params | HumanEval | SWE-bench | FIM | VRAM (Q4) |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 32B | 32B | 92.7% | - | Yes | ~20 GB |
| Qwen 3.5 27B | 27B | - | 72.4% | No | ~16 GB |
| Qwen3-Coder 30B MoE | 30B (3B active) | - | 50.3% | No | ~18 GB |
| Qwen 3 32B | 32B | 72.1 | - | No | 22.2 GB |
For autocomplete: Qwen 2.5 Coder 32B
92.7% HumanEval. That’s not a typo. The 32B Coder model with FIM support, 128K context, and 92+ language coverage at ~20GB VRAM is the best local autocomplete model available at any price. It also scores 73.7% on Aider (code repair), meaning it can fix its own mistakes when pointed at error messages.
This model is why people with RTX 4090s stopped paying for Copilot.
For agentic coding: Qwen 3.5 27B
If you’re using AI for more than autocomplete - running it through Aider, OpenHands, or similar tools - the Qwen 3.5 27B’s 72.4% SWE-bench Verified (tying GPT-5 mini) means it can actually resolve real GitHub issues autonomously. At ~16GB Q4, it fits with 8GB of context headroom on a 24GB card.
Speed pick: Qwen3-Coder 30B MoE activates only 3B parameters per token, so it runs fast while delivering 30B-class coding capability. The SWE-bench score (50.3%) is lower than the dense 27B, but the speed advantage makes it ideal for interactive coding sessions where you want rapid responses.
For all use cases at this level, see our 24GB VRAM complete guide.
32GB VRAM {#32gb}
GPUs: RTX 5090
Candidates
| Model | Params | HumanEval | SWE-bench | FIM | VRAM (Q4/Q6) |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 32B | 32B | 92.7% | - | Yes | ~28 GB (Q6) |
| Qwen 3.5 27B | 27B | - | 72.4% | No | ~22 GB (Q6) |
| Qwen3-Coder 30B MoE | 30B (3B active) | - | 50.3% | No | ~25 GB (Q8) |
Winner: Qwen 2.5 Coder 32B at Q6_K
Same model as the 24GB tier, but now at Q6_K quantization (~28GB). The quality bump from Q4 to Q6 is measurable - about 1-2 points on HumanEval and noticeably fewer hallucinated function names and incorrect API calls. With the RTX 5090’s throughput, this is the closest thing to a local Copilot replacement.
Dual setup: Run Qwen 2.5 Coder 32B (Q4, ~20GB) for autocomplete alongside Qwen 3.5 27B (Q4, ~16GB) - wait, that’s 36GB. Instead, pair Qwen 2.5 Coder 14B (~9GB) for autocomplete with Qwen 3.5 27B (~16GB) for chat. That’s 25GB total, leaving 7GB for context. Best of both worlds at the highest quality.
The stretch: Qwen3-Coder-Next 80B MoE scores 64.6% on SWE-rebench Pass@5 - the #1 score on the entire leaderboard, beating Claude and GPT-5. But it needs ~38GB+ VRAM, which doesn’t fit on a single 32GB card. If you’re willing to do CPU offloading and accept 5-10 tok/s, it’s technically runnable. Not recommended for interactive coding.
For all use cases at this level, see our 32GB VRAM complete guide.
Cross-Tier Summary
| Tier | Autocomplete Pick | Chat/Agent Pick | HumanEval | SWE-bench |
|---|---|---|---|---|
| 8GB | Qwen 2.5 Coder 7B | Qwen 3.5 9B | 88.4% | - |
| 12GB | Qwen 2.5 Coder 14B | Phi-4 14B | ~89% | - |
| 16GB | Qwen 2.5 Coder 14B | GPT-OSS 20B | ~89% | - |
| 24GB | Qwen 2.5 Coder 32B | Qwen 3.5 27B | 92.7% | 72.4% |
| 32GB | Qwen 2.5 Coder 32B (Q6) | Qwen 3.5 27B (Q6) | 92.7% | 72.4% |
IDE Integration
Autocomplete with Continue (VS Code / JetBrains)
# Pull your autocomplete model
ollama pull qwen2.5-coder:7b # 8GB tier
ollama pull qwen2.5-coder:14b # 12-16GB tier
ollama pull qwen2.5-coder:32b # 24GB+ tier
Install Continue in VS Code or JetBrains, point it at your local Ollama server (http://localhost:11434), and enable tab autocomplete. The experience is nearly identical to Copilot.
We have a detailed setup guide: Self-Host Code Completion with Continue + Ollama.
Autocomplete with Tabby
Tabby is purpose-built for self-hosted code completion. It handles model management, indexing your codebase for context-aware suggestions, and has plugins for VS Code, JetBrains, and Vim.
Full setup walkthrough: Self-Host Tabby AI Code Completion.
Chat with Aider
For terminal-based AI coding that edits files directly:
pip install aider-chat
aider --model ollama/qwen3.5:27b
Aider sends your files to the local model, gets proposed changes, and applies them with git integration. The Qwen 3.5 27B at 72.4% SWE-bench means it resolves real issues, not just simple completions.
What Local Coding Models Can’t Do (Yet)
- Full repository understanding - they see what you paste in, not your entire codebase. Tabby’s indexing helps, but it’s not the same as Copilot Workspace or Cursor
- Complex multi-file refactoring - models below 27B struggle to maintain consistency across many files
- Reliable test generation - they generate tests, but the tests often don’t compile or test the wrong things. Always review
- Unfamiliar frameworks - training data has a long tail. Obscure libraries get hallucinated APIs
The 24GB+ tier with Qwen 2.5 Coder 32B and Qwen 3.5 27B genuinely threatens paid coding assistants for most workflows. Below that, local models are best as a supplement to your own expertise rather than a replacement for it.