How to Self-Host Whisper: Replace Otter.ai with Free Local Transcription

Otter.ai charges $17/month for transcription, and their free tier caps you at 300 minutes monthly. Every recording you upload goes through their servers. There’s a better option: run OpenAI’s Whisper locally, get unlimited transcription, and keep your audio files on your own hardware.

Whisper is OpenAI’s open-source speech recognition model, trained on 680,000 hours of multilingual audio. It achieves word error rates around 3-4% on clean English speech - competitive with commercial services. The model supports 99 languages and runs entirely offline once installed.

This guide covers three approaches: the quickest setup with Docker, the fastest option for Apple Silicon Macs, and a full-featured solution with speaker identification for meetings.

What You’ll Get

After following this guide:

Unlimited transcription with no monthly fees
Audio never leaves your machine
99 language support
Accuracy matching or beating paid services
Optional speaker identification for meetings and interviews

Choose Your Path

Approach	Best For	Speed	Setup Difficulty
Faster-Whisper (Docker)	NVIDIA GPU users, server deployments	4x faster than base Whisper	Easy
Whisper.cpp	Mac users, CPU-only systems, edge devices	3-12x faster with Metal/CoreML	Medium
WhisperX	Meetings, podcasts, interviews needing speaker labels	Similar to faster-whisper	Medium

Option 1: Faster-Whisper with Docker (NVIDIA GPU)

This is the simplest setup if you have an NVIDIA GPU. The LinuxServer Docker image bundles everything including CUDA libraries.

Requirements:

NVIDIA GPU with 4GB+ VRAM (RTX 3060 or better recommended)
Docker installed
NVIDIA Container Toolkit

Step 1: Install NVIDIA Container Toolkit

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Step 2: Run Faster-Whisper

docker run -d \
  --name=faster-whisper \
  --runtime=nvidia \
  -e PUID=1000 \
  -e PGID=1000 \
  -e TZ=UTC \
  -e WHISPER_MODEL=large-v3-turbo \
  -p 10300:10300 \
  -v /path/to/config:/config \
  --gpus all \
  lscr.io/linuxserver/faster-whisper:gpu

Replace /path/to/config with where you want to store model files.

Step 3: Transcribe

The service exposes a REST API at http://localhost:10300. To transcribe a file:

curl -X POST "http://localhost:10300/transcribe" \
  -F "[email protected]" \
  -F "output=json"

For a simple web interface, you can add Whishper which wraps faster-whisper with a clean UI.

Option 2: Whisper.cpp (Mac/CPU)

Whisper.cpp is a C/C++ port that runs efficiently on CPUs and takes full advantage of Apple Silicon’s Neural Engine. On an M1 Mac, a 10-minute audio file transcribes in about 2-3 minutes using the Medium model.

Step 1: Install via Homebrew

brew install whisper-cpp

Step 2: Download a Model

# Download the large-v3-turbo model (1.5GB)
whisper-cpp-download-ggml-model large-v3-turbo

Available models and their tradeoffs:

Model	Size	VRAM/RAM	Speed	Accuracy
tiny	75MB	~1GB	Fastest	Basic
base	142MB	~1GB	Very fast	Good
small	466MB	~2GB	Fast	Better
medium	1.5GB	~5GB	Moderate	Very good
large-v3-turbo	1.5GB	~5GB	Moderate	Best (practical)
large-v3	3GB	~10GB	Slow	Best

The large-v3-turbo model offers accuracy within 1-2% of the full large-v3 model at 6x faster inference.

Step 3: Transcribe

whisper-cpp -m ~/.cache/whisper-cpp/ggml-large-v3-turbo.bin -f recording.wav

For MP3 or other formats, convert first with FFmpeg:

ffmpeg -i recording.mp3 -ar 16000 -ac 1 -c:a pcm_s16le recording.wav
whisper-cpp -m ~/.cache/whisper-cpp/ggml-large-v3-turbo.bin -f recording.wav

Enable CoreML Acceleration (Mac)

For maximum speed on Apple Silicon, build whisper.cpp with CoreML support:

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
make clean
WHISPER_COREML=1 make -j

This can achieve 8-12x faster performance compared to CPU-only mode.

Option 3: WhisperX (Speaker Diarization)

WhisperX adds speaker identification - it tells you who said what. This is essential for meeting transcriptions, interviews, and podcasts.

Requirements:

NVIDIA GPU with CUDA 12.x
Hugging Face account (free, needed for speaker diarization model)

Step 1: Install WhisperX

pip install whisperx

Or with conda:

conda create -n whisperx python=3.10
conda activate whisperx
pip install whisperx

Step 2: Get Hugging Face Token

Create an account at huggingface.co
Go to Settings > Access Tokens
Create a new token with read access
Accept the user agreement for pyannote/speaker-diarization-3.1

Step 3: Transcribe with Speaker Labels

whisperx recording.mp3 \
  --model large-v3-turbo \
  --hf_token YOUR_HF_TOKEN \
  --diarize \
  --min_speakers 2 \
  --max_speakers 4 \
  --language en

The output includes speaker labels:

[SPEAKER_00] 0:00 - 0:15: "Welcome to the meeting. Let's start with the quarterly review."
[SPEAKER_01] 0:16 - 0:28: "Thanks. I'll share the sales numbers first..."

Docker Deployment

For a containerized WhisperX API:

git clone https://github.com/murtaza-nasir/whisperx-asr-service
cd whisperx-asr-service
docker build -t whisperx-asr-service .
docker run -p 9000:9000 --gpus all whisperx-asr-service

Access the API documentation at http://localhost:9000/docs.

Hardware Requirements Summary

Minimum (CPU only):

Any modern CPU
4GB RAM for small models, 8GB for medium/large
Works, but expect 5-10x realtime (1 hour audio = 5-10 hours to transcribe)

Recommended (NVIDIA GPU):

RTX 3060 or better
8GB+ VRAM for large models
Expect 1-2 minutes for 10 minutes of audio

Apple Silicon:

Any M1/M2/M3/M4 Mac
8GB unified memory minimum
With CoreML: 2-3 minutes for 10 minutes of audio

Cost Comparison

Service	Monthly Cost	Annual Cost
Otter.ai Pro	$17	$100 (annual billing)
Otter.ai Business	$30	$240 (annual billing)
Self-hosted Whisper	$0	$0

The only cost for self-hosting is the one-time hardware investment - and you likely already have suitable hardware.

Limitations

Self-hosted Whisper has tradeoffs compared to cloud services like Otter.ai:

No mobile app: You’ll need a computer to run transcriptions
No real-time transcription: Whisper processes recordings, not live audio
Manual file handling: No automatic meeting integrations
No cloud sync: Files stay on your machine (this is also a privacy benefit)

For live meeting transcription with calendar integration, cloud services still have an edge. But for transcribing recordings - lectures, interviews, voice memos, podcasts - self-hosted Whisper matches or beats paid services at zero ongoing cost.

Tips for Best Results

Audio quality matters more than model size. A clean recording with the small model often beats a noisy recording with large-v3.

Use the right model for your hardware:

CPU only: stick with small or base
4GB VRAM: medium model
8GB+ VRAM: large-v3-turbo
12GB+ VRAM: large-v3

For non-English languages, always use multilingual models (not .en variants) and specify the language with --language for better accuracy.

Speaker diarization requires GPU. WhisperX’s pyannote-audio backend needs CUDA. CPU-only setups should use faster-whisper or whisper.cpp and manually note speaker changes.

What’s Next

Once you have transcription working, consider:

Setting up a web interface with Whishper
Automating transcription of new files with a folder watcher
Piping transcripts to local LLMs for summarization
Building a searchable archive of your recordings

Your recordings, your hardware, your data. No subscription required.