Ditch GitHub Copilot: build your own AI coding assistant with Continue, Ollama, and Qwen Coder

A step-by-step guide to running a fully local, private AI code completion setup in VS Code that costs nothing and sends zero data to the cloud.

Close-up of a monitor displaying lines of code with colorful syntax highlighting

GitHub Copilot costs $10 a month for individuals and $19 for business users. Every keystroke you make gets sent to Microsoft’s servers, where it trains future models — unless you pay extra and trust the opt-out toggle actually works. If that trade-off bothers you, there is now a stack that gives you inline code completion, chat, and multi-file editing entirely on your own hardware, for free. It is called Continue + Ollama, and you can set it up in about 30 minutes.

This guide walks through the whole process: installing Ollama, pulling the right models, configuring Continue in VS Code, and tuning it until autocomplete feels snappy. Everything runs locally. Nothing phones home.

Why this stack

Continue.dev is an open-source VS Code and JetBrains extension with over 20,000 GitHub stars. It plugs into any OpenAI-compatible backend, which means you can point it at Ollama and get both tab autocomplete and a chat sidebar without any cloud dependency. It supports model-switching per task — small fast model for completions, bigger model for chat — and the config is a single YAML file.

Ollama handles the model runtime. It downloads quantized models, manages GPU offloading, and exposes a local API on port 11434. It runs on macOS, Linux, and Windows, and it supports NVIDIA CUDA, AMD ROCm, and Apple Silicon’s unified memory natively.

The combination gives you something that actually feels like Copilot — ghost-text suggestions as you type, a chat panel for questions, and inline editing — without the subscription or the data leaving your machine.

What you need

Here are the realistic hardware requirements. Do not let anyone tell you a 7B model runs great on 8 GB of RAM — it will run, but slowly.

SetupGPU VRAM / Unified MemorySystem RAMWhat you get
Minimum8 GB16 GB7B chat + 1.5B autocomplete. Usable, occasional pauses
Comfortable16 GB32 GB14B chat + 7B autocomplete. Smooth daily driver
Ideal24 GB+32 GB+32B chat + 7B autocomplete. Near-cloud quality

Concrete GPU examples: an RTX 3060 12 GB handles 7B models well. An RTX 4070 Ti Super (16 GB) or M2 Pro (16 GB unified) handles the comfortable tier. An RTX 4090 (24 GB) or M3 Max (36 GB+) handles the ideal tier.

You also need VS Code (or a fork like VSCodium) and about 10–20 GB of disk space for models.

Step 1: Install Ollama

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download the installer from ollama.com.

Start the service:

ollama serve

On macOS and Windows the desktop app starts this automatically. On Linux you may want to run it as a systemd service. Verify it is working:

curl http://localhost:11434

You should see “Ollama is running.”

Step 2: Pull your models

You need two models: a larger one for chat and editing, and a small fast one for autocomplete. The Qwen 2.5 Coder family is currently the strongest open-weight option for code-specific tasks, consistently outperforming alternatives on HumanEval and fill-in-middle benchmarks at equivalent sizes.

For the comfortable tier (16 GB VRAM):

ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:1.5b

For the ideal tier (24 GB+ VRAM):

ollama pull qwen2.5-coder:32b
ollama pull qwen2.5-coder:7b

For the minimum tier (8 GB VRAM):

ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:0.5b

Use exact tags. Running ollama pull qwen2.5-coder without a size tag gives you the default, which may not be what you want. Verify your models are ready:

ollama list

What about Qwen 3? Qwen 3 has better reasoning and a larger context window (256K vs 128K), but benchmarks show Qwen 2.5 Coder still leads on pure code generation tasks like fill-in-middle completion, which is what autocomplete uses. Qwen 3 is a solid choice for the chat model if you want better explanations, but for the autocomplete slot, stick with Qwen 2.5 Coder.

Step 3: Install Continue

Open VS Code and install Continue from the extension marketplace. Search for “Continue” by Continue.dev, or run:

ext install Continue.continue

After installation, you will see a Continue icon in the sidebar. Click it and it will create a config directory at ~/.continue/.

Step 4: Configure Continue

Continue uses a config.yaml file. Open it from the Continue sidebar (gear icon) or edit it directly at ~/.continue/config.yaml.

Here is a working config for the comfortable tier:

name: Local Copilot
version: 0.0.1
schema: v1
models:
  - name: Qwen 2.5 Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit
      - apply
  - name: Qwen 2.5 Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete

For the ideal tier, swap the chat model:

models:
  - name: Qwen 2.5 Coder 32B
    provider: ollama
    model: qwen2.5-coder:32b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit
      - apply
  - name: Qwen 2.5 Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    roles:
      - autocomplete

Save the file. Continue should detect the changes and connect to Ollama automatically.

Step 5: Tune autocomplete

The default autocomplete settings work, but a few tweaks make a noticeable difference. Add this to your config:

tabAutocompleteOptions:
  debounceDelay: 300
  maxPromptTokens: 1024
  multilineCompletions: always
  onlyMyCode: true

debounceDelay controls how long Continue waits after you stop typing before triggering a completion. 300 ms is a good balance — low enough to feel responsive, high enough to avoid thrashing the GPU on every character. If your hardware is fast, drop it to 200.

maxPromptTokens limits how much context gets sent to the model. 1024 tokens is enough for the surrounding function. Raising it gives more context but slows inference on smaller models.

multilineCompletions: always tells Continue to suggest full blocks of code, not just single-line completions.

onlyMyCode: true prevents Continue from sending library and node_modules code as context.

Step 6: Test it

Open any code file and start typing. You should see ghost-text suggestions appear after a brief pause. Press Tab to accept.

Open the Continue chat panel (Ctrl+L or Cmd+L) and ask it a question about your code. It should respond using the larger model.

If nothing appears:

  1. Check Ollama is running: ollama ps
  2. Check the models are pulled: ollama list
  3. Check Continue’s output panel in VS Code for error messages
  4. Make sure apiBase points to http://localhost:11434

Performance expectations

On an Apple M2 Pro or RTX 3060 with the 1.5B autocomplete model, expect single-line completions in under 350 ms — below the threshold where most people notice a lag. Multi-line completions from the 7B model take 500–800 ms, which is comparable to Copilot’s latency over the network.

The chat model is where you feel the hardware gap. A 7B model on 8 GB VRAM generates about 30–45 tokens per second. A 32B model on 24 GB VRAM gets 15–25 tokens per second. Both are fast enough for interactive use but not instant.

If you want faster chat, consider using Qwen 3 8B instead of Qwen 2.5 Coder 7B for the chat role — it is faster at generation while giving better reasoning in explanations, just slightly worse at raw code completion.

Beyond the basics

Once the core setup works, there are a few worth-it additions:

Context providers. Continue supports adding your codebase as context for chat. Add a @codebase context provider in the config and it will index your project for retrieval-augmented answers. This is where a larger chat model pays off.

Multiple model profiles. You can define several configs and switch between them. One for heavy reasoning work with a 32B model, one for quick completions with a 7B model, one that points at a cloud API for when you need frontier-level quality.

Tabby as an alternative. If you want a more opinionated, team-friendly setup, Tabby is worth looking at. It is a self-contained server that indexes your codebase, supports multiple users with LDAP auth, and ships as a single Docker container. The trade-off is less flexibility in model choice.

What you give up

Honesty matters, so here is what a local setup cannot match today:

  • Training data breadth. Copilot’s models are trained on all of GitHub. Your local 7B model knows less. It will miss obscure APIs and niche frameworks more often.
  • Speed on complex completions. A 32B local model is good. GPT-4-class cloud models are still better at multi-step reasoning across files.
  • Zero setup. Copilot is one click. This is 30 minutes, plus occasional model updates.

What you gain: complete privacy, no subscription costs, offline operation, and the knowledge that your proprietary code stays on your machine. For most day-to-day coding — writing functions, filling in boilerplate, explaining errors, generating tests — a well-configured local setup covers 80–90% of what Copilot does.

What you can do right now

  1. Install Ollama and pull qwen2.5-coder:7b and qwen2.5-coder:1.5b
  2. Install Continue in VS Code
  3. Paste the config above into ~/.continue/config.yaml
  4. Start coding

The whole thing takes about 30 minutes. If you have been on the fence about paying for Copilot — or uncomfortable with where your code goes — this is the weekend project that actually delivers.