Otter.ai charges $17/month for transcription, and their free tier caps you at 300 minutes monthly. Every recording you upload goes through their servers. There’s a better option: run OpenAI’s Whisper locally, get unlimited transcription, and keep your audio files on your own hardware.
Whisper is OpenAI’s open-source speech recognition model, trained on 680,000 hours of multilingual audio. It achieves word error rates around 3-4% on clean English speech - competitive with commercial services. The model supports 99 languages and runs entirely offline once installed.
This guide covers three approaches: the quickest setup with Docker, the fastest option for Apple Silicon Macs, and a full-featured solution with speaker identification for meetings.
What You’ll Get
After following this guide:
- Unlimited transcription with no monthly fees
- Audio never leaves your machine
- 99 language support
- Accuracy matching or beating paid services
- Optional speaker identification for meetings and interviews
Choose Your Path
| Approach | Best For | Speed | Setup Difficulty |
|---|---|---|---|
| Faster-Whisper (Docker) | NVIDIA GPU users, server deployments | 4x faster than base Whisper | Easy |
| Whisper.cpp | Mac users, CPU-only systems, edge devices | 3-12x faster with Metal/CoreML | Medium |
| WhisperX | Meetings, podcasts, interviews needing speaker labels | Similar to faster-whisper | Medium |
Option 1: Faster-Whisper with Docker (NVIDIA GPU)
This is the simplest setup if you have an NVIDIA GPU. The LinuxServer Docker image bundles everything including CUDA libraries.
Requirements:
- NVIDIA GPU with 4GB+ VRAM (RTX 3060 or better recommended)
- Docker installed
- NVIDIA Container Toolkit
Step 1: Install NVIDIA Container Toolkit
# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Step 2: Run Faster-Whisper
docker run -d \
--name=faster-whisper \
--runtime=nvidia \
-e PUID=1000 \
-e PGID=1000 \
-e TZ=UTC \
-e WHISPER_MODEL=large-v3-turbo \
-p 10300:10300 \
-v /path/to/config:/config \
--gpus all \
lscr.io/linuxserver/faster-whisper:gpu
Replace /path/to/config with where you want to store model files.
Step 3: Transcribe
The service exposes a REST API at http://localhost:10300. To transcribe a file:
curl -X POST "http://localhost:10300/transcribe" \
-F "[email protected]" \
-F "output=json"
For a simple web interface, you can add Whishper which wraps faster-whisper with a clean UI.
Option 2: Whisper.cpp (Mac/CPU)
Whisper.cpp is a C/C++ port that runs efficiently on CPUs and takes full advantage of Apple Silicon’s Neural Engine. On an M1 Mac, a 10-minute audio file transcribes in about 2-3 minutes using the Medium model.
Step 1: Install via Homebrew
brew install whisper-cpp
Step 2: Download a Model
# Download the large-v3-turbo model (1.5GB)
whisper-cpp-download-ggml-model large-v3-turbo
Available models and their tradeoffs:
| Model | Size | VRAM/RAM | Speed | Accuracy |
|---|---|---|---|---|
| tiny | 75MB | ~1GB | Fastest | Basic |
| base | 142MB | ~1GB | Very fast | Good |
| small | 466MB | ~2GB | Fast | Better |
| medium | 1.5GB | ~5GB | Moderate | Very good |
| large-v3-turbo | 1.5GB | ~5GB | Moderate | Best (practical) |
| large-v3 | 3GB | ~10GB | Slow | Best |
The large-v3-turbo model offers accuracy within 1-2% of the full large-v3 model at 6x faster inference.
Step 3: Transcribe
whisper-cpp -m ~/.cache/whisper-cpp/ggml-large-v3-turbo.bin -f recording.wav
For MP3 or other formats, convert first with FFmpeg:
ffmpeg -i recording.mp3 -ar 16000 -ac 1 -c:a pcm_s16le recording.wav
whisper-cpp -m ~/.cache/whisper-cpp/ggml-large-v3-turbo.bin -f recording.wav
Enable CoreML Acceleration (Mac)
For maximum speed on Apple Silicon, build whisper.cpp with CoreML support:
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
make clean
WHISPER_COREML=1 make -j
This can achieve 8-12x faster performance compared to CPU-only mode.
Option 3: WhisperX (Speaker Diarization)
WhisperX adds speaker identification - it tells you who said what. This is essential for meeting transcriptions, interviews, and podcasts.
Requirements:
- NVIDIA GPU with CUDA 12.x
- Hugging Face account (free, needed for speaker diarization model)
Step 1: Install WhisperX
pip install whisperx
Or with conda:
conda create -n whisperx python=3.10
conda activate whisperx
pip install whisperx
Step 2: Get Hugging Face Token
- Create an account at huggingface.co
- Go to Settings > Access Tokens
- Create a new token with read access
- Accept the user agreement for pyannote/speaker-diarization-3.1
Step 3: Transcribe with Speaker Labels
whisperx recording.mp3 \
--model large-v3-turbo \
--hf_token YOUR_HF_TOKEN \
--diarize \
--min_speakers 2 \
--max_speakers 4 \
--language en
The output includes speaker labels:
[SPEAKER_00] 0:00 - 0:15: "Welcome to the meeting. Let's start with the quarterly review."
[SPEAKER_01] 0:16 - 0:28: "Thanks. I'll share the sales numbers first..."
Docker Deployment
For a containerized WhisperX API:
git clone https://github.com/murtaza-nasir/whisperx-asr-service
cd whisperx-asr-service
docker build -t whisperx-asr-service .
docker run -p 9000:9000 --gpus all whisperx-asr-service
Access the API documentation at http://localhost:9000/docs.
Hardware Requirements Summary
Minimum (CPU only):
- Any modern CPU
- 4GB RAM for small models, 8GB for medium/large
- Works, but expect 5-10x realtime (1 hour audio = 5-10 hours to transcribe)
Recommended (NVIDIA GPU):
- RTX 3060 or better
- 8GB+ VRAM for large models
- Expect 1-2 minutes for 10 minutes of audio
Apple Silicon:
- Any M1/M2/M3/M4 Mac
- 8GB unified memory minimum
- With CoreML: 2-3 minutes for 10 minutes of audio
Cost Comparison
| Service | Monthly Cost | Annual Cost |
|---|---|---|
| Otter.ai Pro | $17 | $100 (annual billing) |
| Otter.ai Business | $30 | $240 (annual billing) |
| Self-hosted Whisper | $0 | $0 |
The only cost for self-hosting is the one-time hardware investment - and you likely already have suitable hardware.
Limitations
Self-hosted Whisper has tradeoffs compared to cloud services like Otter.ai:
- No mobile app: You’ll need a computer to run transcriptions
- No real-time transcription: Whisper processes recordings, not live audio
- Manual file handling: No automatic meeting integrations
- No cloud sync: Files stay on your machine (this is also a privacy benefit)
For live meeting transcription with calendar integration, cloud services still have an edge. But for transcribing recordings - lectures, interviews, voice memos, podcasts - self-hosted Whisper matches or beats paid services at zero ongoing cost.
Tips for Best Results
Audio quality matters more than model size. A clean recording with the small model often beats a noisy recording with large-v3.
Use the right model for your hardware:
- CPU only: stick with small or base
- 4GB VRAM: medium model
- 8GB+ VRAM: large-v3-turbo
- 12GB+ VRAM: large-v3
For non-English languages, always use multilingual models (not .en variants) and specify the language with --language for better accuracy.
Speaker diarization requires GPU. WhisperX’s pyannote-audio backend needs CUDA. CPU-only setups should use faster-whisper or whisper.cpp and manually note speaker changes.
What’s Next
Once you have transcription working, consider:
- Setting up a web interface with Whishper
- Automating transcription of new files with a folder watcher
- Piping transcripts to local LLMs for summarization
- Building a searchable archive of your recordings
Your recordings, your hardware, your data. No subscription required.