Self-Host Your Own Document AI: Set Up Local RAG with AnythingLLM and Ollama

You’ve got documents you want to search with AI - contracts, research papers, meeting notes, code documentation. But uploading them to ChatGPT or Claude means sending potentially sensitive data to external servers. What if you could run the same kind of intelligent document search entirely on your own machine?

That’s what Retrieval-Augmented Generation (RAG) does, and with the right tools, you can set it up locally in about 30 minutes.

What You’re Building

RAG combines two things: a search system that finds relevant chunks of your documents, and an AI model that answers questions using those chunks as context. When you ask “What was our Q3 revenue?”, the system searches your documents for revenue-related passages, then feeds them to the AI along with your question.

The result is an AI that actually knows your documents - not by training on them, but by looking them up when you ask.

Why Run RAG Locally?

The case for local RAG comes down to what you can’t safely upload to cloud services: medical records, financial documents, legal contracts, proprietary business data, personal journals. The entire process happens on your computer, and files never leave your machine.

There’s also the practical side. Samsung banned ChatGPT after employees accidentally leaked source code. Once data hits a cloud service, it’s stored on external servers, potentially used for training, and difficult to completely delete. Local RAG eliminates that risk.

What You’ll Need

Hardware requirements:

16GB RAM minimum (32GB recommended for larger models)
8GB+ VRAM for GPU acceleration, or a recent Mac with M-series chip
20GB free disk space for models and documents

Software:

Ollama - runs AI models locally
AnythingLLM - provides the RAG interface

AnythingLLM is an all-in-one desktop app that handles document processing, vector storage, and chat - no coding required. It connects to Ollama for the AI models, keeping everything local.

Step 1: Install Ollama

Download Ollama from ollama.com and run the installer. On Mac and Windows, it’s a standard app installation. On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Once installed, pull a model. For most hardware, Llama 3.2 3B is a solid starting point:

ollama pull llama3.2:3b

If you have 16GB+ VRAM or an M2/M3 Mac with 32GB+ unified memory, you can run larger models:

ollama pull llama3.1:8b

You’ll also need an embedding model. This converts your documents into searchable vectors. Nomic-embed-text offers excellent accuracy with 8K token context:

ollama pull nomic-embed-text

For faster processing on limited hardware, all-minilm works well for English documents:

ollama pull all-minilm

Step 2: Install AnythingLLM

Download AnythingLLM Desktop for your operating system. Run the installer and launch the app.

On first launch, you’ll go through a setup wizard. Select “Ollama” as your LLM provider when prompted. The default connection URL is http://127.0.0.1:11434 - AnythingLLM will auto-detect this if Ollama is running.

Select your chat model (the one you pulled earlier, like llama3.2:3b) and your embedding model (nomic-embed-text or all-minilm).

Step 3: Create Your First Workspace

AnythingLLM organizes documents into workspaces. Click “New Workspace” and give it a name - something like “Work Documents” or “Research Papers”.

Each workspace has its own document collection and chat history, so you can keep different projects separate.

Step 4: Add Documents

Click the upload icon in your workspace. AnythingLLM accepts:

PDF files
Word documents (.docx)
Text and markdown files
Code files
Web pages (paste URLs)

Drag in your documents or browse to select them. AnythingLLM will:

Extract text from each document
Split text into chunks (default ~1000 characters)
Generate embeddings for each chunk
Store everything in a local vector database

Processing time depends on document size and your hardware. A 50-page PDF typically takes 30-60 seconds.

Step 5: Start Asking Questions

With documents loaded, just type your question in the chat. Try something specific:

“What are the key terms in the consulting agreement?”
“Summarize the methodology section”
“Find all mentions of budget overruns”

The system will search your documents, find relevant passages, and generate an answer with context. You can see which document chunks were used by clicking the source citations.

Tuning for Better Results

Chunk Size

If answers miss important context, try increasing chunk size in Settings > Documents. Larger chunks (1500-2000 characters) preserve more context but increase processing time. The optimal range is usually 500-1500 characters.

Top-K Results

This controls how many document chunks are sent to the AI. Default is usually 4. Increase to 6-8 if answers seem incomplete; decrease if responses are slow or unfocused.

Model Selection

For simple document Q&A, smaller models (3B-7B parameters) work fine and respond faster. For complex analysis or synthesis across multiple documents, larger models (13B+) produce better results but need more memory.

Alternative Approaches

AnythingLLM is the easiest option, but alternatives exist:

PrivateGPT is more developer-focused, exposing a RAG pipeline as an API. Good if you want to build custom applications on top.

Open WebUI offers similar RAG features with a different interface. Supports multiple vector databases and content extraction engines.

LangChain + ChromaDB gives you full control if you’re comfortable with Python. More setup work, but maximum flexibility.

What This Doesn’t Do

Local RAG has limitations. Your answers are only as good as your documents - it can’t access information that isn’t in your collection. Response quality depends on your hardware; a laptop will be slower than a desktop with a GPU.

And while local models have improved dramatically, they’re still not quite at GPT-4 or Claude 3.5 Opus levels for complex reasoning. For most document Q&A, though, the gap is small enough that privacy makes it worthwhile.

What You Can Do Now

Start with a small document collection - maybe 10-20 files you frequently reference. Test different questions to understand what works well. Add more documents as you get comfortable.

The setup takes about 30 minutes, and once running, you’ve got a private document AI that never sends your data anywhere. That’s worth the effort.