How to Self-Host Whisper: Replace Otter.ai with Free Local Transcription

Step-by-step guide to running Whisper locally for speech-to-text. No monthly fees, no data leaving your machine, better accuracy than most paid services.

Professional microphone in recording studio for audio transcription

Otter.ai charges $17/month for transcription, and their free tier caps you at 300 minutes monthly. Every recording you upload goes through their servers. There’s a better option: run OpenAI’s Whisper locally, get unlimited transcription, and keep your audio files on your own hardware.

Whisper is OpenAI’s open-source speech recognition model, trained on 680,000 hours of multilingual audio. It achieves word error rates around 3-4% on clean English speech - competitive with commercial services. The model supports 99 languages and runs entirely offline once installed.

This guide covers three approaches: the quickest setup with Docker, the fastest option for Apple Silicon Macs, and a full-featured solution with speaker identification for meetings.

What You’ll Get

After following this guide:

  • Unlimited transcription with no monthly fees
  • Audio never leaves your machine
  • 99 language support
  • Accuracy matching or beating paid services
  • Optional speaker identification for meetings and interviews

Choose Your Path

ApproachBest ForSpeedSetup Difficulty
Faster-Whisper (Docker)NVIDIA GPU users, server deployments4x faster than base WhisperEasy
Whisper.cppMac users, CPU-only systems, edge devices3-12x faster with Metal/CoreMLMedium
WhisperXMeetings, podcasts, interviews needing speaker labelsSimilar to faster-whisperMedium

Option 1: Faster-Whisper with Docker (NVIDIA GPU)

This is the simplest setup if you have an NVIDIA GPU. The LinuxServer Docker image bundles everything including CUDA libraries.

Requirements:

  • NVIDIA GPU with 4GB+ VRAM (RTX 3060 or better recommended)
  • Docker installed
  • NVIDIA Container Toolkit

Step 1: Install NVIDIA Container Toolkit

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Step 2: Run Faster-Whisper

docker run -d \
  --name=faster-whisper \
  --runtime=nvidia \
  -e PUID=1000 \
  -e PGID=1000 \
  -e TZ=UTC \
  -e WHISPER_MODEL=large-v3-turbo \
  -p 10300:10300 \
  -v /path/to/config:/config \
  --gpus all \
  lscr.io/linuxserver/faster-whisper:gpu

Replace /path/to/config with where you want to store model files.

Step 3: Transcribe

The service exposes a REST API at http://localhost:10300. To transcribe a file:

curl -X POST "http://localhost:10300/transcribe" \
  -F "[email protected]" \
  -F "output=json"

For a simple web interface, you can add Whishper which wraps faster-whisper with a clean UI.

Option 2: Whisper.cpp (Mac/CPU)

Whisper.cpp is a C/C++ port that runs efficiently on CPUs and takes full advantage of Apple Silicon’s Neural Engine. On an M1 Mac, a 10-minute audio file transcribes in about 2-3 minutes using the Medium model.

Step 1: Install via Homebrew

brew install whisper-cpp

Step 2: Download a Model

# Download the large-v3-turbo model (1.5GB)
whisper-cpp-download-ggml-model large-v3-turbo

Available models and their tradeoffs:

ModelSizeVRAM/RAMSpeedAccuracy
tiny75MB~1GBFastestBasic
base142MB~1GBVery fastGood
small466MB~2GBFastBetter
medium1.5GB~5GBModerateVery good
large-v3-turbo1.5GB~5GBModerateBest (practical)
large-v33GB~10GBSlowBest

The large-v3-turbo model offers accuracy within 1-2% of the full large-v3 model at 6x faster inference.

Step 3: Transcribe

whisper-cpp -m ~/.cache/whisper-cpp/ggml-large-v3-turbo.bin -f recording.wav

For MP3 or other formats, convert first with FFmpeg:

ffmpeg -i recording.mp3 -ar 16000 -ac 1 -c:a pcm_s16le recording.wav
whisper-cpp -m ~/.cache/whisper-cpp/ggml-large-v3-turbo.bin -f recording.wav

Enable CoreML Acceleration (Mac)

For maximum speed on Apple Silicon, build whisper.cpp with CoreML support:

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
make clean
WHISPER_COREML=1 make -j

This can achieve 8-12x faster performance compared to CPU-only mode.

Option 3: WhisperX (Speaker Diarization)

WhisperX adds speaker identification - it tells you who said what. This is essential for meeting transcriptions, interviews, and podcasts.

Requirements:

  • NVIDIA GPU with CUDA 12.x
  • Hugging Face account (free, needed for speaker diarization model)

Step 1: Install WhisperX

pip install whisperx

Or with conda:

conda create -n whisperx python=3.10
conda activate whisperx
pip install whisperx

Step 2: Get Hugging Face Token

  1. Create an account at huggingface.co
  2. Go to Settings > Access Tokens
  3. Create a new token with read access
  4. Accept the user agreement for pyannote/speaker-diarization-3.1

Step 3: Transcribe with Speaker Labels

whisperx recording.mp3 \
  --model large-v3-turbo \
  --hf_token YOUR_HF_TOKEN \
  --diarize \
  --min_speakers 2 \
  --max_speakers 4 \
  --language en

The output includes speaker labels:

[SPEAKER_00] 0:00 - 0:15: "Welcome to the meeting. Let's start with the quarterly review."
[SPEAKER_01] 0:16 - 0:28: "Thanks. I'll share the sales numbers first..."

Docker Deployment

For a containerized WhisperX API:

git clone https://github.com/murtaza-nasir/whisperx-asr-service
cd whisperx-asr-service
docker build -t whisperx-asr-service .
docker run -p 9000:9000 --gpus all whisperx-asr-service

Access the API documentation at http://localhost:9000/docs.

Hardware Requirements Summary

Minimum (CPU only):

  • Any modern CPU
  • 4GB RAM for small models, 8GB for medium/large
  • Works, but expect 5-10x realtime (1 hour audio = 5-10 hours to transcribe)

Recommended (NVIDIA GPU):

  • RTX 3060 or better
  • 8GB+ VRAM for large models
  • Expect 1-2 minutes for 10 minutes of audio

Apple Silicon:

  • Any M1/M2/M3/M4 Mac
  • 8GB unified memory minimum
  • With CoreML: 2-3 minutes for 10 minutes of audio

Cost Comparison

ServiceMonthly CostAnnual Cost
Otter.ai Pro$17$100 (annual billing)
Otter.ai Business$30$240 (annual billing)
Self-hosted Whisper$0$0

The only cost for self-hosting is the one-time hardware investment - and you likely already have suitable hardware.

Limitations

Self-hosted Whisper has tradeoffs compared to cloud services like Otter.ai:

  • No mobile app: You’ll need a computer to run transcriptions
  • No real-time transcription: Whisper processes recordings, not live audio
  • Manual file handling: No automatic meeting integrations
  • No cloud sync: Files stay on your machine (this is also a privacy benefit)

For live meeting transcription with calendar integration, cloud services still have an edge. But for transcribing recordings - lectures, interviews, voice memos, podcasts - self-hosted Whisper matches or beats paid services at zero ongoing cost.

Tips for Best Results

Audio quality matters more than model size. A clean recording with the small model often beats a noisy recording with large-v3.

Use the right model for your hardware:

  • CPU only: stick with small or base
  • 4GB VRAM: medium model
  • 8GB+ VRAM: large-v3-turbo
  • 12GB+ VRAM: large-v3

For non-English languages, always use multilingual models (not .en variants) and specify the language with --language for better accuracy.

Speaker diarization requires GPU. WhisperX’s pyannote-audio backend needs CUDA. CPU-only setups should use faster-whisper or whisper.cpp and manually note speaker changes.

What’s Next

Once you have transcription working, consider:

  • Setting up a web interface with Whishper
  • Automating transcription of new files with a folder watcher
  • Piping transcripts to local LLMs for summarization
  • Building a searchable archive of your recordings

Your recordings, your hardware, your data. No subscription required.