GitHub Copilot costs $10 a month for individuals and $19 for business users. Every keystroke you make gets sent to Microsoft’s servers, where it trains future models — unless you pay extra and trust the opt-out toggle actually works. If that trade-off bothers you, there is now a stack that gives you inline code completion, chat, and multi-file editing entirely on your own hardware, for free. It is called Continue + Ollama, and you can set it up in about 30 minutes.
This guide walks through the whole process: installing Ollama, pulling the right models, configuring Continue in VS Code, and tuning it until autocomplete feels snappy. Everything runs locally. Nothing phones home.
Why this stack
Continue.dev is an open-source VS Code and JetBrains extension with over 20,000 GitHub stars. It plugs into any OpenAI-compatible backend, which means you can point it at Ollama and get both tab autocomplete and a chat sidebar without any cloud dependency. It supports model-switching per task — small fast model for completions, bigger model for chat — and the config is a single YAML file.
Ollama handles the model runtime. It downloads quantized models, manages GPU offloading, and exposes a local API on port 11434. It runs on macOS, Linux, and Windows, and it supports NVIDIA CUDA, AMD ROCm, and Apple Silicon’s unified memory natively.
The combination gives you something that actually feels like Copilot — ghost-text suggestions as you type, a chat panel for questions, and inline editing — without the subscription or the data leaving your machine.
What you need
Here are the realistic hardware requirements. Do not let anyone tell you a 7B model runs great on 8 GB of RAM — it will run, but slowly.
| Setup | GPU VRAM / Unified Memory | System RAM | What you get |
|---|---|---|---|
| Minimum | 8 GB | 16 GB | 7B chat + 1.5B autocomplete. Usable, occasional pauses |
| Comfortable | 16 GB | 32 GB | 14B chat + 7B autocomplete. Smooth daily driver |
| Ideal | 24 GB+ | 32 GB+ | 32B chat + 7B autocomplete. Near-cloud quality |
Concrete GPU examples: an RTX 3060 12 GB handles 7B models well. An RTX 4070 Ti Super (16 GB) or M2 Pro (16 GB unified) handles the comfortable tier. An RTX 4090 (24 GB) or M3 Max (36 GB+) handles the ideal tier.
You also need VS Code (or a fork like VSCodium) and about 10–20 GB of disk space for models.
Step 1: Install Ollama
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Windows: Download the installer from ollama.com.
Start the service:
ollama serve
On macOS and Windows the desktop app starts this automatically. On Linux you may want to run it as a systemd service. Verify it is working:
curl http://localhost:11434
You should see “Ollama is running.”
Step 2: Pull your models
You need two models: a larger one for chat and editing, and a small fast one for autocomplete. The Qwen 2.5 Coder family is currently the strongest open-weight option for code-specific tasks, consistently outperforming alternatives on HumanEval and fill-in-middle benchmarks at equivalent sizes.
For the comfortable tier (16 GB VRAM):
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:1.5b
For the ideal tier (24 GB+ VRAM):
ollama pull qwen2.5-coder:32b
ollama pull qwen2.5-coder:7b
For the minimum tier (8 GB VRAM):
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:0.5b
Use exact tags. Running ollama pull qwen2.5-coder without a size tag gives you the default, which may not be what you want. Verify your models are ready:
ollama list
What about Qwen 3? Qwen 3 has better reasoning and a larger context window (256K vs 128K), but benchmarks show Qwen 2.5 Coder still leads on pure code generation tasks like fill-in-middle completion, which is what autocomplete uses. Qwen 3 is a solid choice for the chat model if you want better explanations, but for the autocomplete slot, stick with Qwen 2.5 Coder.
Step 3: Install Continue
Open VS Code and install Continue from the extension marketplace. Search for “Continue” by Continue.dev, or run:
ext install Continue.continue
After installation, you will see a Continue icon in the sidebar. Click it and it will create a config directory at ~/.continue/.
Step 4: Configure Continue
Continue uses a config.yaml file. Open it from the Continue sidebar (gear icon) or edit it directly at ~/.continue/config.yaml.
Here is a working config for the comfortable tier:
name: Local Copilot
version: 0.0.1
schema: v1
models:
- name: Qwen 2.5 Coder 7B
provider: ollama
model: qwen2.5-coder:7b
apiBase: http://localhost:11434
roles:
- chat
- edit
- apply
- name: Qwen 2.5 Coder 1.5B
provider: ollama
model: qwen2.5-coder:1.5b
roles:
- autocomplete
For the ideal tier, swap the chat model:
models:
- name: Qwen 2.5 Coder 32B
provider: ollama
model: qwen2.5-coder:32b
apiBase: http://localhost:11434
roles:
- chat
- edit
- apply
- name: Qwen 2.5 Coder 7B
provider: ollama
model: qwen2.5-coder:7b
roles:
- autocomplete
Save the file. Continue should detect the changes and connect to Ollama automatically.
Step 5: Tune autocomplete
The default autocomplete settings work, but a few tweaks make a noticeable difference. Add this to your config:
tabAutocompleteOptions:
debounceDelay: 300
maxPromptTokens: 1024
multilineCompletions: always
onlyMyCode: true
debounceDelay controls how long Continue waits after you stop typing before triggering a completion. 300 ms is a good balance — low enough to feel responsive, high enough to avoid thrashing the GPU on every character. If your hardware is fast, drop it to 200.
maxPromptTokens limits how much context gets sent to the model. 1024 tokens is enough for the surrounding function. Raising it gives more context but slows inference on smaller models.
multilineCompletions: always tells Continue to suggest full blocks of code, not just single-line completions.
onlyMyCode: true prevents Continue from sending library and node_modules code as context.
Step 6: Test it
Open any code file and start typing. You should see ghost-text suggestions appear after a brief pause. Press Tab to accept.
Open the Continue chat panel (Ctrl+L or Cmd+L) and ask it a question about your code. It should respond using the larger model.
If nothing appears:
- Check Ollama is running:
ollama ps - Check the models are pulled:
ollama list - Check Continue’s output panel in VS Code for error messages
- Make sure
apiBasepoints tohttp://localhost:11434
Performance expectations
On an Apple M2 Pro or RTX 3060 with the 1.5B autocomplete model, expect single-line completions in under 350 ms — below the threshold where most people notice a lag. Multi-line completions from the 7B model take 500–800 ms, which is comparable to Copilot’s latency over the network.
The chat model is where you feel the hardware gap. A 7B model on 8 GB VRAM generates about 30–45 tokens per second. A 32B model on 24 GB VRAM gets 15–25 tokens per second. Both are fast enough for interactive use but not instant.
If you want faster chat, consider using Qwen 3 8B instead of Qwen 2.5 Coder 7B for the chat role — it is faster at generation while giving better reasoning in explanations, just slightly worse at raw code completion.
Beyond the basics
Once the core setup works, there are a few worth-it additions:
Context providers. Continue supports adding your codebase as context for chat. Add a @codebase context provider in the config and it will index your project for retrieval-augmented answers. This is where a larger chat model pays off.
Multiple model profiles. You can define several configs and switch between them. One for heavy reasoning work with a 32B model, one for quick completions with a 7B model, one that points at a cloud API for when you need frontier-level quality.
Tabby as an alternative. If you want a more opinionated, team-friendly setup, Tabby is worth looking at. It is a self-contained server that indexes your codebase, supports multiple users with LDAP auth, and ships as a single Docker container. The trade-off is less flexibility in model choice.
What you give up
Honesty matters, so here is what a local setup cannot match today:
- Training data breadth. Copilot’s models are trained on all of GitHub. Your local 7B model knows less. It will miss obscure APIs and niche frameworks more often.
- Speed on complex completions. A 32B local model is good. GPT-4-class cloud models are still better at multi-step reasoning across files.
- Zero setup. Copilot is one click. This is 30 minutes, plus occasional model updates.
What you gain: complete privacy, no subscription costs, offline operation, and the knowledge that your proprietary code stays on your machine. For most day-to-day coding — writing functions, filling in boilerplate, explaining errors, generating tests — a well-configured local setup covers 80–90% of what Copilot does.
What you can do right now
- Install Ollama and pull
qwen2.5-coder:7bandqwen2.5-coder:1.5b - Install Continue in VS Code
- Paste the config above into
~/.continue/config.yaml - Start coding
The whole thing takes about 30 minutes. If you have been on the fence about paying for Copilot — or uncomfortable with where your code goes — this is the weekend project that actually delivers.