Self-Host Your Own AI Code Completion With Continue and Ollama

GitHub Copilot costs $10 a month for Pro, $39 for Pro+, and $19 per seat for business teams. Every keystroke you type gets sent to Microsoft’s servers. Every code snippet, every variable name, every half-finished function in your proprietary codebase — all of it processed on infrastructure you don’t control. And starting June 2026, GitHub is switching to usage-based billing, meaning your costs could climb even higher depending on how much you use it.

There’s a free alternative that keeps everything on your machine. Continue.dev is an open-source AI code assistant for VS Code and JetBrains that connects to Ollama running locally. You get tab completions, a chat sidebar, and inline edits — all running on your own hardware, completely offline, with zero data leaving your machine.

This guide gets you from nothing to working AI code completion in about 20 minutes.

What You Need

Minimum hardware:

8 GB RAM (for the 1.5B autocomplete model alone)
Any modern CPU (Apple Silicon, Intel 12th gen+, AMD Ryzen 5000+)

Recommended hardware:

16 GB RAM or a GPU with 8+ GB VRAM
This lets you run a larger chat model alongside the autocomplete model

Software:

VS Code or a JetBrains IDE
macOS, Linux, or Windows

No GPU required. The small autocomplete models run fine on CPU, though a GPU makes responses noticeably faster.

Step 1: Install Ollama

Ollama is the engine that runs AI models locally. Installation is one command.

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com.

Start the Ollama service:

ollama serve

On macOS and Windows, the Ollama desktop app starts the service automatically. On Linux, the install script sets up a systemd service.

Verify it’s running:

curl http://localhost:11434

You should see “Ollama is running.”

Step 2: Pull the Models

You need two models: a small, fast one for tab completions and a larger one for chat and edits.

For autocomplete — Qwen2.5-Coder 1.5B is the sweet spot. It’s trained specifically on code, runs fast enough for real-time completions, and needs only about 1.5 GB of memory:

ollama pull qwen2.5-coder:1.5b

For chat and edits — Qwen2.5-Coder 7B gives you a much more capable model for explaining code, writing functions from descriptions, and refactoring. Needs about 4.5 GB:

ollama pull qwen2.5-coder:7b

If you have 16+ GB of RAM or a 12 GB GPU, you can run both simultaneously. Otherwise the models swap in and out as needed — you’ll just notice a brief pause when switching between autocomplete and chat.

Optional upgrade: If your hardware can handle it, qwen2.5-coder:14b or deepseek-coder-v2:16b provide noticeably better results for chat and complex edits. DeepSeek Coder V2 uses a mixture-of-experts architecture, so despite having 16B total parameters, only 14B are active at once, keeping VRAM requirements manageable.

Step 3: Install Continue

Open VS Code and install the Continue extension:

Press Ctrl+Shift+X (or Cmd+Shift+X on macOS) to open Extensions
Search for “Continue”
Install “Continue - Codestral, GPT-4o, Claude, Gemini, Llama, etc.”
Restart VS Code

You’ll see a new Continue icon in the sidebar — a square with rounded corners.

For JetBrains, search for “Continue” in Settings → Plugins → Marketplace.

Step 4: Configure Continue for Local Models

Click the gear icon in the Continue sidebar to open your configuration. Continue uses a config.yaml file. Replace the default contents with:

name: Local Copilot
version: 0.0.1
schema: v1
models:
  - name: Qwen2.5-Coder 7B
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit

  - name: Qwen2.5-Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    apiBase: http://localhost:11434
    roles:
      - autocomplete
    autocompleteOptions:
      debounceDelay: 250
      maxPromptTokens: 1024
      multilineCompletions: auto

Save the file. Continue will pick up the changes immediately.

What These Settings Mean

debounceDelay: 250 — waits 250ms after you stop typing before requesting a completion. Prevents hammering your CPU/GPU on every keystroke.
maxPromptTokens: 1024 — limits how much surrounding code gets sent as context. Higher values give better suggestions but slow things down.
multilineCompletions: auto — lets the model decide whether to suggest a single line or a multi-line block based on context.

Step 5: Test It

Open any code file and start typing. After a brief pause, you should see ghost text suggestions appear — just like Copilot. Press Tab to accept.

Try these to verify everything works:

Tab completion: Type a function signature and pause. You should see the body suggested.
Chat: Press Ctrl+L to open the chat sidebar. Ask “explain this function” with code selected.
Inline edit: Select code and press Ctrl+I. Type “add error handling” or “convert to TypeScript.”

If completions aren’t appearing, check that Ollama is running (curl http://localhost:11434) and that you pulled both models.

Tuning for Your Hardware

Low-end machines (8 GB RAM, no GPU)

Stick with the 1.5B model for everything:

models:
  - name: Qwen2.5-Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - chat
      - edit
      - autocomplete
    autocompleteOptions:
      debounceDelay: 350
      maxPromptTokens: 512

Increase debounceDelay to 350ms and reduce maxPromptTokens to 512 to keep things responsive. Completions will be less context-aware but still useful.

High-end machines (32+ GB RAM or 12+ GB VRAM)

Use a 14B or 32B model for chat and keep the 1.5B for autocomplete:

models:
  - name: Qwen2.5-Coder 32B
    provider: ollama
    model: qwen2.5-coder:32b
    roles:
      - chat
      - edit

  - name: Qwen2.5-Coder 1.5B
    provider: ollama
    model: qwen2.5-coder:1.5b
    roles:
      - autocomplete
    autocompleteOptions:
      debounceDelay: 200
      maxPromptTokens: 2048

The 32B model scores 88.4% on HumanEval — beating GPT-4’s 87.1% — and runs well on Apple Silicon Macs with 32 GB unified memory.

How It Compares to Copilot

After a week of using this setup alongside Copilot on real projects, the differences become clear:

Where local wins:

Privacy: zero code leaves your machine
Cost: $0/month after hardware you already own
Speed: on a decent GPU, completions often appear faster than Copilot because there’s no network round-trip
No outages: works offline, on planes, behind corporate firewalls
No content filtering: suggest whatever the model thinks is correct

Where Copilot still has an edge:

Multi-file context: Copilot’s larger cloud models consider more of your codebase
Training data freshness: cloud models update faster with new libraries and APIs
Zero setup: sign in and go

The gap is narrower than it was a year ago. The Qwen2.5-Coder models were trained on 5.5 trillion code tokens and match GPT-4o on most coding benchmarks. For day-to-day autocomplete — finishing function bodies, generating boilerplate, suggesting variable names — the local models are good enough that you’ll stop noticing the difference.

Going Further

Add Tabby for team use. If you need a shared, self-hosted code completion server for multiple developers, Tabby runs as a Docker container and serves completions over HTTP. One GPU server can handle 15-25 concurrent users.

Index your codebase. Continue supports codebase indexing for context-aware completions. Add to your config:

context:
  - provider: codebase
    params:
      nRetrieve: 25
      nFinal: 5

This embeds your project files and retrieves relevant snippets when generating completions — bringing some of Copilot’s multi-file awareness to your local setup.

Try different models. The open-weight model space moves fast. DeepSeek V4-Pro hit 80.6% on SWE-bench Verified, making it the current top open-source coding model. Swap models in your config without changing anything else — that’s the advantage of a modular setup.

What You Can Do

The whole setup takes three installs (Ollama, two models, Continue extension) and one config file. If Copilot’s subscription ever bothered you, or if you work with code you can’t send to Microsoft’s servers, this is the practical alternative. The models are good, the tooling is mature, and the cost is zero.