Self-Host Your Own Code Completion: Replace GitHub Copilot With Continue and Ollama

Step-by-step guide to setting up free, private AI code completion in VS Code using Continue and Qwen2.5-Coder running locally

GitHub Copilot costs $10-39 per month, sends your code to Microsoft’s servers, and on the free and Pro tiers, likely uses your code to train future models. For anyone working with proprietary code or wanting to keep their work private, that’s a problem.

The alternative: run your own code completion locally. It takes about 15 minutes to set up, costs nothing, and your code never leaves your machine.

What You’ll Get

By the end of this guide, you’ll have tab-autocomplete working in VS Code that:

  • Runs entirely on your hardware
  • Works offline
  • Sends zero data to any external server
  • Costs nothing after initial setup
  • Provides quality comparable to Copilot for most tasks

The tradeoff: you need decent hardware, and completions may be slightly slower than cloud-based alternatives.

Requirements

Minimum hardware:

  • 8GB RAM (16GB recommended)
  • Modern CPU with AVX2 support
  • 10GB free disk space

Recommended for best performance:

  • 16GB+ RAM
  • NVIDIA GPU with 8GB+ VRAM, or Apple Silicon Mac
  • SSD storage

If you have an M1/M2/M3 Mac, you’re in luck. Apple Silicon runs local models extremely well thanks to unified memory architecture.

Step 1: Install Ollama

Ollama is the simplest way to run LLMs locally. It handles model downloads, memory management, and provides an API that Continue can talk to.

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com/download.

After installation, start the Ollama service:

ollama serve

On macOS, Ollama runs as a background service automatically after installation.

Step 2: Download a Code Model

Not all LLMs are good at code completion. You want a model specifically trained for fill-in-the-middle (FIM) tasks, which is what tab-autocomplete requires.

The current best option is Qwen2.5-Coder. It scores 88.4% on HumanEval (a standard coding benchmark) at just 7 billion parameters, matching or beating models 3-5x its size.

Download it:

ollama pull qwen2.5-coder:7b

This downloads about 4.5GB. The model needs roughly 6GB of RAM/VRAM when running.

If 7B is too slow on your hardware, try the smaller versions:

ollama pull qwen2.5-coder:3b    # Faster, still good quality
ollama pull qwen2.5-coder:1.5b  # Fastest, acceptable quality

Step 3: Install Continue

Continue is an open-source VS Code extension that provides AI-assisted coding. Unlike Copilot, it lets you use any model backend, including local ones.

Install it from VS Code:

  1. Open VS Code
  2. Go to Extensions (Ctrl/Cmd+Shift+X)
  3. Search for “Continue”
  4. Install the one by Continue.dev

After installation, you’ll see a Continue icon in your sidebar.

Step 4: Configure Autocomplete

Continue needs to know to use your local Ollama instance for autocomplete. Open the Continue settings:

  1. Click the Continue icon in the sidebar
  2. Click the gear icon (settings)
  3. Click “Open config.json”

Add or modify the tabAutocompleteModel section:

{
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder (Local)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

If you downloaded a different model size, adjust the model name accordingly.

Save the file. Continue will automatically reload the configuration.

Step 5: Enable and Test

Make sure autocomplete is enabled:

  1. Open VS Code Settings (Ctrl/Cmd+,)
  2. Search for “Continue autocomplete”
  3. Ensure “Enable Tab Autocomplete” is checked

Now test it. Open or create a Python file and start typing:

def calculate_fibonacci(n):
    """Calculate the nth Fibonacci number."""

Pause after the docstring. You should see a ghost text suggestion appear. Press Tab to accept it.

If nothing appears, check that:

  • Ollama is running (ollama serve)
  • The model is downloaded (ollama list)
  • Continue is properly configured

Tuning for Speed

If completions feel slow, you have several options.

Use a smaller model: Change your config to use the 3B or 1.5B version. The quality difference is noticeable but acceptable for most autocomplete tasks:

{
  "tabAutocompleteModel": {
    "provider": "ollama",
    "model": "qwen2.5-coder:3b"
  }
}

Adjust debounce timing: Continue waits a moment after you stop typing before requesting completions. You can tune this:

{
  "tabAutocompleteOptions": {
    "debounceDelay": 500
  }
}

Lower values mean faster suggestions but more compute load.

Keep the model loaded: By default, Ollama unloads models after 5 minutes of inactivity. To keep your code model always loaded:

curl http://localhost:11434/api/generate -d '{"model": "qwen2.5-coder:7b", "keep_alive": -1}'

This tells Ollama to keep the model in memory indefinitely.

Adding Chat (Optional)

Autocomplete is only part of what Copilot offers. You might also want an AI to explain code, answer questions, or help with refactoring.

Add a chat model to your Continue config:

{
  "models": [
    {
      "title": "Qwen2.5-Coder Chat",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b-instruct"
    }
  ],
  "tabAutocompleteModel": {
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Download the instruct variant:

ollama pull qwen2.5-coder:7b-instruct

Now you can select code, right-click, and ask Continue to explain or modify it.

How It Compares to Copilot

After a week of daily use, here’s an honest assessment:

Where local wins:

  • Privacy: Your code stays on your machine
  • Cost: Free forever after setup
  • Offline: Works without internet
  • Customization: Use any model, tweak any setting

Where Copilot wins:

  • Speed: Cloud inference is typically faster
  • Context: Copilot has better whole-project awareness
  • Polish: The integration is more seamless
  • Updates: New model improvements happen automatically

Where they’re roughly equal:

  • Basic autocomplete quality for common patterns
  • Function completion from docstrings
  • Boilerplate generation

For proprietary codebases, the privacy benefit alone makes local worth it. For open-source work where you don’t mind cloud processing, Copilot’s polish might be worth $10/month.

Alternative: Tabby for Team Setups

If you need code completion for a team rather than just yourself, look at Tabby. It’s a self-hosted server that multiple developers can connect to.

Tabby advantages:

  • One powerful server, many clients
  • Repository indexing for better context
  • Admin controls and audit logs
  • No need for GPUs on developer machines

The tradeoff is more infrastructure complexity. For solo developers, Continue + Ollama is simpler.

Troubleshooting

Completions aren’t appearing:

  • Check Ollama is running: curl http://localhost:11434/api/tags
  • Restart VS Code
  • Check Continue’s output panel for errors (View > Output > Continue)

Completions are very slow:

  • Use a smaller model
  • Check if something else is using your GPU
  • On laptops, plug in; battery mode throttles performance

Model keeps unloading:

  • Use the keep_alive API call mentioned above
  • Or add OLLAMA_KEEP_ALIVE=-1 to your environment

Out of memory errors:

  • Close other applications
  • Use a quantized model: qwen2.5-coder:7b-q4_K_M
  • Use a smaller model size

What You Can Do

  1. Try it for a week: Give local code completion a real test before deciding. The first day feels slower; by day three, you’ve adapted.

  2. Experiment with models: New code models release regularly. DeepSeek-Coder, Codestral, and others all work with this setup.

  3. Contribute to open source: Both Continue and Ollama are open-source projects that benefit from community contributions and feedback.

  4. Tell your team: If you work somewhere with strict data policies, this setup might be the only way to get AI code assistance approved.

The era of “AI code completion requires sending code to the cloud” is over. The tools are here, they’re free, and they work.