GitHub Copilot costs $10 a month for Pro, $39 for Pro+, and $19 per seat for business teams. Every keystroke you type gets sent to Microsoft’s servers. Every code snippet, every variable name, every half-finished function in your proprietary codebase — all of it processed on infrastructure you don’t control. And starting June 2026, GitHub is switching to usage-based billing, meaning your costs could climb even higher depending on how much you use it.
There’s a free alternative that keeps everything on your machine. Continue.dev is an open-source AI code assistant for VS Code and JetBrains that connects to Ollama running locally. You get tab completions, a chat sidebar, and inline edits — all running on your own hardware, completely offline, with zero data leaving your machine.
This guide gets you from nothing to working AI code completion in about 20 minutes.
What You Need
Minimum hardware:
- 8 GB RAM (for the 1.5B autocomplete model alone)
- Any modern CPU (Apple Silicon, Intel 12th gen+, AMD Ryzen 5000+)
Recommended hardware:
- 16 GB RAM or a GPU with 8+ GB VRAM
- This lets you run a larger chat model alongside the autocomplete model
Software:
- VS Code or a JetBrains IDE
- macOS, Linux, or Windows
No GPU required. The small autocomplete models run fine on CPU, though a GPU makes responses noticeably faster.
Step 1: Install Ollama
Ollama is the engine that runs AI models locally. Installation is one command.
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com.
Start the Ollama service:
ollama serve
On macOS and Windows, the Ollama desktop app starts the service automatically. On Linux, the install script sets up a systemd service.
Verify it’s running:
curl http://localhost:11434
You should see “Ollama is running.”
Step 2: Pull the Models
You need two models: a small, fast one for tab completions and a larger one for chat and edits.
For autocomplete — Qwen2.5-Coder 1.5B is the sweet spot. It’s trained specifically on code, runs fast enough for real-time completions, and needs only about 1.5 GB of memory:
ollama pull qwen2.5-coder:1.5b
For chat and edits — Qwen2.5-Coder 7B gives you a much more capable model for explaining code, writing functions from descriptions, and refactoring. Needs about 4.5 GB:
ollama pull qwen2.5-coder:7b
If you have 16+ GB of RAM or a 12 GB GPU, you can run both simultaneously. Otherwise the models swap in and out as needed — you’ll just notice a brief pause when switching between autocomplete and chat.
Optional upgrade: If your hardware can handle it, qwen2.5-coder:14b or deepseek-coder-v2:16b provide noticeably better results for chat and complex edits. DeepSeek Coder V2 uses a mixture-of-experts architecture, so despite having 16B total parameters, only 14B are active at once, keeping VRAM requirements manageable.
Step 3: Install Continue
Open VS Code and install the Continue extension:
- Press
Ctrl+Shift+X(orCmd+Shift+Xon macOS) to open Extensions - Search for “Continue”
- Install “Continue - Codestral, GPT-4o, Claude, Gemini, Llama, etc.”
- Restart VS Code
You’ll see a new Continue icon in the sidebar — a square with rounded corners.
For JetBrains, search for “Continue” in Settings → Plugins → Marketplace.
Step 4: Configure Continue for Local Models
Click the gear icon in the Continue sidebar to open your configuration. Continue uses a config.yaml file. Replace the default contents with:
name: Local Copilot
version: 0.0.1
schema: v1
models:
- name: Qwen2.5-Coder 7B
provider: ollama
model: qwen2.5-coder:7b
apiBase: http://localhost:11434
roles:
- chat
- edit
- name: Qwen2.5-Coder 1.5B
provider: ollama
model: qwen2.5-coder:1.5b
apiBase: http://localhost:11434
roles:
- autocomplete
autocompleteOptions:
debounceDelay: 250
maxPromptTokens: 1024
multilineCompletions: auto
Save the file. Continue will pick up the changes immediately.
What These Settings Mean
- debounceDelay: 250 — waits 250ms after you stop typing before requesting a completion. Prevents hammering your CPU/GPU on every keystroke.
- maxPromptTokens: 1024 — limits how much surrounding code gets sent as context. Higher values give better suggestions but slow things down.
- multilineCompletions: auto — lets the model decide whether to suggest a single line or a multi-line block based on context.
Step 5: Test It
Open any code file and start typing. After a brief pause, you should see ghost text suggestions appear — just like Copilot. Press Tab to accept.
Try these to verify everything works:
- Tab completion: Type a function signature and pause. You should see the body suggested.
- Chat: Press
Ctrl+Lto open the chat sidebar. Ask “explain this function” with code selected. - Inline edit: Select code and press
Ctrl+I. Type “add error handling” or “convert to TypeScript.”
If completions aren’t appearing, check that Ollama is running (curl http://localhost:11434) and that you pulled both models.
Tuning for Your Hardware
Low-end machines (8 GB RAM, no GPU)
Stick with the 1.5B model for everything:
models:
- name: Qwen2.5-Coder 1.5B
provider: ollama
model: qwen2.5-coder:1.5b
roles:
- chat
- edit
- autocomplete
autocompleteOptions:
debounceDelay: 350
maxPromptTokens: 512
Increase debounceDelay to 350ms and reduce maxPromptTokens to 512 to keep things responsive. Completions will be less context-aware but still useful.
High-end machines (32+ GB RAM or 12+ GB VRAM)
Use a 14B or 32B model for chat and keep the 1.5B for autocomplete:
models:
- name: Qwen2.5-Coder 32B
provider: ollama
model: qwen2.5-coder:32b
roles:
- chat
- edit
- name: Qwen2.5-Coder 1.5B
provider: ollama
model: qwen2.5-coder:1.5b
roles:
- autocomplete
autocompleteOptions:
debounceDelay: 200
maxPromptTokens: 2048
The 32B model scores 88.4% on HumanEval — beating GPT-4’s 87.1% — and runs well on Apple Silicon Macs with 32 GB unified memory.
How It Compares to Copilot
After a week of using this setup alongside Copilot on real projects, the differences become clear:
Where local wins:
- Privacy: zero code leaves your machine
- Cost: $0/month after hardware you already own
- Speed: on a decent GPU, completions often appear faster than Copilot because there’s no network round-trip
- No outages: works offline, on planes, behind corporate firewalls
- No content filtering: suggest whatever the model thinks is correct
Where Copilot still has an edge:
- Multi-file context: Copilot’s larger cloud models consider more of your codebase
- Training data freshness: cloud models update faster with new libraries and APIs
- Zero setup: sign in and go
The gap is narrower than it was a year ago. The Qwen2.5-Coder models were trained on 5.5 trillion code tokens and match GPT-4o on most coding benchmarks. For day-to-day autocomplete — finishing function bodies, generating boilerplate, suggesting variable names — the local models are good enough that you’ll stop noticing the difference.
Going Further
Add Tabby for team use. If you need a shared, self-hosted code completion server for multiple developers, Tabby runs as a Docker container and serves completions over HTTP. One GPU server can handle 15-25 concurrent users.
Index your codebase. Continue supports codebase indexing for context-aware completions. Add to your config:
context:
- provider: codebase
params:
nRetrieve: 25
nFinal: 5
This embeds your project files and retrieves relevant snippets when generating completions — bringing some of Copilot’s multi-file awareness to your local setup.
Try different models. The open-weight model space moves fast. DeepSeek V4-Pro hit 80.6% on SWE-bench Verified, making it the current top open-source coding model. Swap models in your config without changing anything else — that’s the advantage of a modular setup.
What You Can Do
The whole setup takes three installs (Ollama, two models, Continue extension) and one config file. If Copilot’s subscription ever bothered you, or if you work with code you can’t send to Microsoft’s servers, this is the practical alternative. The models are good, the tooling is mature, and the cost is zero.