How to Build a Private RAG Chatbot With Open WebUI and Ollama

You have a pile of PDFs — meeting notes, research papers, internal docs, personal journals — and you want to ask questions about them. The obvious move is to upload them to ChatGPT or Claude. But that means sending your private documents through someone else’s servers, hoping their privacy policy holds, and paying $20/month for the privilege.

There’s a better way. You can run a fully private RAG (Retrieval-Augmented Generation) chatbot on your own machine using Open WebUI and Ollama. No cloud. No subscriptions. No data leaving your computer.

Here’s exactly how to set it up.

What You’re Building

RAG stands for Retrieval-Augmented Generation. Instead of relying only on a language model’s training data, RAG pulls relevant chunks from your documents and feeds them to the model as context. The result: answers grounded in your actual files, with the ability to cite where the information came from.

Your stack:

Ollama runs open-weight language models and embedding models locally
Open WebUI provides a polished chat interface with built-in RAG, knowledge bases, and document management
Your documents stay on your machine, period

What You’ll Need

Minimum hardware:

8 GB RAM (16 GB recommended)
SSD with at least 15 GB free space
Any modern CPU (2018 or newer)

For decent performance:

16+ GB RAM
A GPU with 8+ GB VRAM (NVIDIA recommended, Apple Silicon works great)
An SSD

Software:

Docker Desktop or Docker Engine
A terminal

If you’re on a Mac with Apple Silicon, you’re in luck — Ollama runs natively and uses the unified memory architecture efficiently. No GPU setup headaches.

Step 1: Install Ollama

If you haven’t already, install Ollama:

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from ollama.com and run the installer.

Verify it’s working:

ollama --version

Step 2: Pull Your Models

You need two models: one for chatting, one for creating embeddings (turning text into searchable vectors).

Chat model — pick one based on your hardware:

For 8 GB RAM:

ollama pull llama3.2:3b

For 16 GB RAM:

ollama pull qwen3:8b

For 32+ GB RAM or a beefy GPU:

ollama pull qwen3:14b

Embedding model:

ollama pull nomic-embed-text

nomic-embed-text is the go-to choice here. It uses only ~500 MB of memory, supports 8,192 token context, and outperforms OpenAI’s text-embedding-ada-002 on both short and long text tasks. For most RAG setups, it’s all you need.

If you want maximum retrieval precision and have the headroom, mxbai-embed-large is a solid upgrade — it matches text-embedding-3-large performance.

Step 3: Launch Open WebUI

The fastest way is a single Docker command:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

This connects to Ollama running on your host machine. The -v open-webui:/app/backend/data volume is non-negotiable — it stores your database, uploaded documents, and settings. Skip it and you lose everything on restart.

If you prefer Docker Compose (recommended for long-term use), create a docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama:
  open-webui:

Then:

docker compose up -d

Open your browser to http://localhost:3000. Create an admin account on first launch — this stays local, it’s just for the web interface.

Step 4: Configure the Embedding Model

This is the step most guides skip, and it’s why RAG results often disappoint.

Go to Admin Panel (gear icon) → Settings → Documents
Under Embedding Model, select your Ollama instance and choose nomic-embed-text
Set Chunk Size to 500 tokens (the default, and a good starting point)
Set Chunk Overlap to 50-100 tokens (10-20% of chunk size)

These settings control how your documents get sliced up for searching. Smaller chunks (200-300 tokens) work better for precise factual lookups. Larger chunks (800-1000) preserve more context for nuanced topics. Start with 500 and adjust based on your results.

Optional but worthwhile: Enable hybrid search if your version supports it. Hybrid search combines vector search (semantic meaning) with BM25 (keyword matching). This helps enormously with technical documents where specific terms like function names or error codes matter.

Step 5: Create a Knowledge Base

This is where the magic happens.

Click Workspace in the left sidebar
Click Knowledge at the top
Click the + button to create a new knowledge base
Name it something descriptive (e.g., “Project Alpha Docs” or “Tax Records 2025”)
Click the upload icon and add your documents

Open WebUI accepts PDFs, Word documents, markdown files, plain text, and more. It processes them through its document extraction pipeline, chunks them, generates embeddings, and stores everything locally.

For best results:

Group related documents into the same knowledge base
Use descriptive filenames — the filename gets indexed too
Keep PDFs text-based when possible (scanned image PDFs need OCR and produce worse results)

Step 6: Chat With Your Documents

Start a new chat. Select your preferred model. Then type # to see your available knowledge bases. Select the one you want, type your question, and go.

The model will pull relevant chunks from your documents and use them to form its answer. You’re now chatting with your own files, completely offline.

Tips for better results:

Be specific in your questions — “What were the Q3 revenue figures?” works better than “Tell me about revenue”
If the model hallucinates, try lowering the temperature in chat settings
Reference specific document names if you know where the answer lives
Ask follow-up questions to drill deeper into a topic

What This Replaces

This setup directly replaces a ChatGPT Plus subscription for document Q&A. At $20/month for ChatGPT Plus, that’s $240/year. Your local setup costs nothing after the initial hardware (which you probably already own). Over five years, the savings work out to roughly $900-1,080 per person — and that gap only widens for teams.

More importantly, your documents never leave your machine. No training on your data. No privacy policy to trust. No terms of service that can change tomorrow.

Troubleshooting

“Model not found” errors: Make sure Ollama is running (ollama serve) and the models are pulled (ollama list).

Slow responses: The chat model is probably too large for your hardware. Drop down a size — llama3.2:3b runs well even on older machines.

Bad RAG results: Check your chunk size settings. If answers miss key context, increase chunk size. If answers are vague, decrease it. Also verify the embedding model is set to nomic-embed-text, not left on the default.

Docker connection issues: If Open WebUI can’t reach Ollama, make sure you’re using --add-host=host.docker.internal:host-gateway in the Docker run command, or the correct OLLAMA_BASE_URL in Docker Compose.

What You Can Do Next

Once the basics are working:

Add more knowledge bases for different topics or projects
Try different chat models — deepseek-r1:8b is strong for reasoning tasks, qwen3:8b handles multilingual documents well
Set up scheduled automations to process new documents automatically
Enable reranking with a model like BAAI/bge-reranker-v2-m3 for better retrieval accuracy
Share with your household — Open WebUI supports multiple user accounts, all staying local

The whole point is that this grows with you. Start with a few PDFs, then expand as you get comfortable. Your data, your hardware, your rules.