How to Self-Host Tabby: Replace GitHub Copilot with Your Own AI Code Assistant

GitHub Copilot costs $10/month for individuals, $19/user/month for teams. That’s $120-228 per developer per year - and your code gets sent to Microsoft’s servers. There’s another option: run your own AI code assistant locally, with the same inline suggestions and chat features, for the one-time cost of hardware you probably already own.

Tabby is an open-source, self-hosted alternative to GitHub Copilot with over 33,000 GitHub stars. It runs on consumer GPUs, Apple Silicon, and even CPU-only setups. Your code never leaves your machine.

This guide walks through setting up Tabby with Docker, connecting it to VS Code, and choosing the right model for your hardware.

What You’ll Get

After following this guide, you’ll have:

AI code completion that suggests lines and functions as you type
A chat interface for asking questions about your code
Repository context awareness - Tabby can index your codebase for smarter suggestions
Zero monthly fees
Complete data privacy

Hardware Requirements

Tabby scales down surprisingly well. Here’s what you need for different model sizes:

Small models (1B-3B parameters):

NVIDIA T4, GTX 1060+, RTX 2060+, or any 10/20-series GPU
Apple Silicon M1/M2/M3/M4
~4GB VRAM minimum

Medium models (7B parameters):

NVIDIA T4 or better
Apple Silicon with 16GB+ unified memory
~8GB VRAM for int8 quantization

Large models (13B+):

NVIDIA V100, A100, RTX 3090, or 40-series GPUs
~16GB+ VRAM

Don’t have a GPU? Tabby works on CPU too - just expect slower response times. Start with a 1B model and see if it’s usable for your workflow.

Step 1: Install Docker

If you don’t have Docker installed:

macOS:

brew install --cask docker

Ubuntu/Debian:

sudo apt update && sudo apt install docker.io
sudo usermod -aG docker $USER
# Log out and back in

Windows: Download Docker Desktop and install.

Step 2: Run Tabby Server

Choose your command based on your hardware:

NVIDIA GPU (recommended):

docker run -d \
  --name tabby \
  --gpus all \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve \
  --model Qwen2.5-Coder-1.5B \
  --chat-model Qwen2.5-Coder-1.5B-Instruct \
  --device cuda

Apple Silicon (Metal):

docker run -d \
  --name tabby \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve \
  --model Qwen2.5-Coder-1.5B \
  --chat-model Qwen2.5-Coder-1.5B-Instruct \
  --device metal

CPU only:

docker run -d \
  --name tabby \
  -p 8080:8080 \
  -v $HOME/.tabby:/data \
  tabbyml/tabby \
  serve \
  --model Qwen2.5-Coder-0.5B \
  --chat-model Qwen2.5-Coder-0.5B-Instruct \
  --device cpu

The first run downloads the model - expect a few minutes depending on your connection.

Step 3: Create Your Account

Open http://localhost:8080 in your browser
Create an admin account (first user becomes admin)
Click your profile icon
Copy your API token - you’ll need this for the IDE extension

Step 4: Connect VS Code

Open VS Code
Go to Extensions (Ctrl/Cmd+Shift+X)
Search for “Tabby” and install the official extension
Click the Tabby icon in the status bar (bottom right)
Select “Connect to Server”
Enter your server URL: http://localhost:8080
Paste your API token

Test it: open any code file and start typing a function. You should see gray completion suggestions appear. Press Tab to accept.

Choosing the Right Model

Tabby supports dozens of models. Here are the best options for different scenarios:

Best for limited hardware (4-8GB VRAM):

Qwen2.5-Coder-1.5B - Best quality-to-size ratio
StarCoder-1B - Battle-tested, slightly older
CodeGemma-2B - Google’s option, good for general coding

Best for quality (8-16GB VRAM):

Qwen2.5-Coder-7B - Excellent across all languages
DeepSeekCoder-6.7B - Strong on complex completions
CodeLlama-7B - Meta’s coding model, well-tested

Best for power users (16GB+ VRAM):

Qwen2.5-Coder-14B - Near-commercial quality
Codestral-22B - Mistral’s coding specialist
DeepSeek-Coder-V2-Lite - Mixture-of-experts architecture

To switch models, stop and restart Tabby with a different --model flag:

docker stop tabby && docker rm tabby
# Then run the docker command again with your new model

Adding Repository Context

Tabby can index your codebase for context-aware completions. This is where it starts to rival Copilot.

In the Tabby web UI (http://localhost:8080), go to Settings > Repositories
Add your local repository path or connect a GitHub/GitLab repo
Wait for indexing to complete

Now when you code, Tabby understands your project’s patterns, naming conventions, and existing functions.

Troubleshooting

“Connection refused” errors: Check if Tabby is running: docker ps | grep tabby

If not, check logs: docker logs tabby

Slow completions:

Try a smaller model
Ensure you’re using GPU acceleration (check --device flag)
On NVIDIA, verify CUDA works: nvidia-smi

High memory usage: Models load entirely into memory. Close other GPU-heavy applications or use a smaller model.

macOS Metal not working: Ensure you have the latest Docker Desktop with Apple Silicon support. Rosetta mode won’t have Metal access.

Tabby vs. Copilot: What You’re Trading

What you gain:

Complete privacy - code never leaves your machine
No recurring costs after hardware investment
Works offline
Customizable models
Repository-aware context you control

What you lose:

Copilot’s multi-model approach (GPT-4, Claude access)
Slightly lower quality on very complex completions
Initial setup time
Responsibility for updates and maintenance

For most developers writing standard code - APIs, CRUD operations, data processing - you won’t notice the quality difference. Where Copilot still wins is on unusual edge cases and when you need to tap into GPT-4 or Claude for complex reasoning.

What’s Next

Once you’re comfortable with basic setup:

Add team members: Tabby supports multi-user with authentication domains and SSO
Try different chat models: Qwen2-1.5B-Instruct or Qwen3 models work well for the chat interface
Index documentation: Add your internal docs as context sources
Set up repository integration: Connect to GitHub/GitLab for PR context

Running your own AI coding assistant takes an hour to set up and saves you at minimum $120/year - more if you have a team. Your code stays private, you control the models, and you’re not dependent on anyone’s API staying online.

The only question is whether your hardware can keep up. Start with a small model, see if the speed is tolerable, and upgrade models as you find the limits. Most developers with any discrete GPU or recent Mac will find it perfectly usable.