Every word you speak into Otter.ai gets uploaded to their servers. Every meeting, every interview, every private conversation — processed, stored, and used to “improve their services.” Rev.com charges $0.25 per minute for AI transcription or $1.99 per minute for human-assisted, and your audio still goes through their cloud. Otter’s Pro plan runs $16.99/month, and even then, you’re limited to 1,200 minutes.
There’s a better way. OpenAI’s Whisper is a free, open-source speech recognition model that supports 99 languages and matches commercial transcription accuracy. Run it on your own machine and your recordings never leave your hard drive. No subscriptions, no per-minute fees, no privacy trade-offs.
Here are three ways to set it up, from quickest to most full-featured.
Pick Your Model First
Before choosing an approach, you need to pick a Whisper model size. This matters for both accuracy and speed.
| Model | Parameters | VRAM | Disk | Speed vs Large | Best For |
|---|---|---|---|---|---|
| tiny | 39M | ~1 GB | 75 MB | ~10x faster | Quick drafts, real-time |
| base | 74M | ~1 GB | 142 MB | ~7x faster | Decent quality, fast |
| small | 244M | ~2 GB | 466 MB | ~4x faster | Good balance |
| medium | 769M | ~5 GB | 1.5 GB | ~2x faster | High accuracy |
| large-v3 | 1.55B | ~10 GB | 2.9 GB | 1x (baseline) | Maximum accuracy |
| turbo | 809M | ~6 GB | 1.5 GB | ~8x faster | Best all-rounder |
The turbo model is the standout pick for most people. It’s a distilled version of large-v3 that slashes the decoder layers from 32 to just 4, hitting accuracy comparable to large-v2 at 8x the speed. Unless you need maximum accuracy for professional transcripts or you’re translating non-English audio to English (turbo doesn’t support translation tasks), turbo is the one to use.
Running on CPU only? Start with small or base — you can always move up once you know your hardware handles it.
Option 1: faster-whisper (Quickest Setup)
faster-whisper is a reimplementation of Whisper using the CTranslate2 inference engine. It runs up to 4x faster than OpenAI’s original implementation while using less memory, and unlike vanilla Whisper, it doesn’t require FFmpeg to be installed separately.
Requirements: Python 3.8+, pip. GPU optional (NVIDIA with CUDA 12 + cuDNN 9).
Install
pip install faster-whisper
For GPU users who need NVIDIA libraries:
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12
Transcribe
Create a file called transcribe.py:
from faster_whisper import WhisperModel
# GPU: device="cuda", compute_type="float16"
# CPU: device="cpu", compute_type="int8"
model = WhisperModel("turbo", device="cpu", compute_type="int8")
segments, info = model.transcribe("your-audio.mp3", beam_size=5)
print(f"Detected language: {info.language} ({info.language_probability:.0%})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Run it:
python transcribe.py
That’s it. The first run downloads the model (turbo is about 1.5 GB), then subsequent runs start immediately. You get timestamped segments out of the box.
Batch Transcription
For transcribing multiple files at once:
#!/bin/bash
for file in *.mp3 *.wav *.m4a; do
[ -f "$file" ] || continue
echo "Transcribing: $file"
python -c "
from faster_whisper import WhisperModel
model = WhisperModel('turbo', device='cpu', compute_type='int8')
segments, info = model.transcribe('$file', beam_size=5)
with open('${file%.*}.txt', 'w') as f:
for seg in segments:
f.write(f'[{seg.start:.2f}s -> {seg.end:.2f}s] {seg.text}\n')
print('Done: $file')
"
done
When to choose this: You want the fastest path from audio file to text file. No web UI, no Docker — just Python and a pip install.
Option 2: whisper.cpp (No Python, Best for Apple Silicon)
whisper.cpp is a pure C/C++ port of Whisper. Zero Python dependencies, native Metal acceleration on Macs, and it runs on basically anything — even a Raspberry Pi. If you’re on Apple Silicon, this is the fastest path to transcription.
Requirements: CMake, C compiler (Xcode on macOS, gcc/g++ on Linux). No Python needed.
Build on macOS (with Metal GPU acceleration)
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build -DWHISPER_METAL=ON
cmake --build build -j --config Release
Build on Linux
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release
For NVIDIA GPU support on Linux, add -DWHISPER_CUDA=ON to the cmake command.
Download a Model
sh ./models/download-ggml-model.sh large-v3-turbo
Other options: tiny, base, small, medium, large-v3. Add .en suffix for English-only models (e.g., base.en).
Transcribe
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f your-audio.wav
whisper.cpp expects 16kHz WAV input. Convert other formats with FFmpeg first:
ffmpeg -i your-audio.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
Quick Test
The fastest way to verify everything works:
make base.en
This downloads the base English model and runs inference on the included sample files.
When to choose this: You’re on a Mac and want native Metal performance. Or you want zero Python dependencies and maximum portability.
Option 3: Scriberr (Full Web UI With Docker)
If you want a proper application rather than a command-line tool, Scriberr wraps Whisper in a polished web interface with file management, speaker diarization, automatic folder watching, and even AI-powered summarization through Ollama.
Requirements: Docker and Docker Compose.
CPU Setup
Create a docker-compose.yml:
services:
scriberr:
image: ghcr.io/rishikanthc/scriberr:latest
ports:
- "3000:3000"
volumes:
- ./scriberr-data:/app/data
- ./scriberr-models:/app/models
environment:
- PUID=1000
- PGID=1000
Start it:
docker compose up -d
NVIDIA GPU Setup
For GPU acceleration, use the CUDA image:
services:
scriberr:
image: ghcr.io/rishikanthc/scriberr:latest-cuda
ports:
- "3000:3000"
volumes:
- ./scriberr-data:/app/data
- ./scriberr-models:/app/models
environment:
- PUID=1000
- PGID=1000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Open http://localhost:3000 in your browser. Upload an audio file and Scriberr handles the rest.
Key Features
- Drag-and-drop uploads — MP3, WAV, M4A, FLAC, and more
- Speaker diarization — Identifies who said what in multi-speaker recordings
- Folder watcher — Drop files into a directory and they’re transcribed automatically
- Ollama integration — Summarize transcripts locally using your own LLM
- API access — Integrate into automation workflows
- YouTube download — Paste a URL and transcribe the audio directly
When to choose this: You want a polished app experience, process lots of files, or need speaker identification.
Alternative: Whishper (Transcription + Translation)
Worth a mention: Whishper is another self-hosted option that bundles transcription, subtitle generation, and translation (via LibreTranslate) into a single Docker stack. It adds a subtitle editor to the web UI, so you can clean up transcripts without leaving the browser. Heavier setup than Scriberr (MongoDB + Nginx + LibreTranslate containers), but useful if you regularly work with multilingual content or need subtitle export.
The Privacy Calculation
Here’s what these cloud services know about you:
Otter.ai processes every recording on their servers. Their privacy policy allows them to use your data to “improve and develop” their products. Your meetings, interviews, and conversations become training data.
Rev.com uploads your audio for processing — and for their human transcription tier, actual people listen to your recordings.
Self-hosted Whisper sends nothing anywhere. Your audio stays on your disk, gets processed by your CPU or GPU, and the text output stays on your disk. There’s no account to create, no terms to accept, no data pipeline to worry about.
For anyone transcribing sensitive material — legal interviews, medical notes, confidential meetings, personal journals — the choice is straightforward.
Cost Comparison
| Otter.ai Pro | Rev AI | Self-Hosted | |
|---|---|---|---|
| Monthly cost | $16.99/mo | $0.25/min | Electricity only |
| 10 hours/month | $16.99 | $150 | $0 |
| 50 hours/month | $30/mo (Business) | $750 | $0 |
| Privacy | Cloud-processed | Cloud-processed | Fully local |
| Languages | ~35 | 36+ | 99 |
| Limits | 1,200 min (Pro) | Pay per minute | Your hardware |
The break-even point hits quickly. If you’re transcribing more than an hour or two per month, self-hosting saves money from day one — with zero ongoing fees and complete data control.
Hardware Recommendations
Already own a recent Mac? You’re set. An M1 MacBook Air with 8 GB RAM runs the turbo model comfortably via whisper.cpp with Metal acceleration. Expect roughly real-time transcription speed (1 hour of audio in about 1 hour of processing).
Linux desktop with an NVIDIA GPU? Even a GTX 1060 with 6 GB VRAM handles the turbo model. An RTX 3060 or newer processes audio significantly faster than real-time.
Old laptop or Raspberry Pi? Stick with the tiny or base models. They’re slower and less accurate, but they work. A Raspberry Pi 4 can transcribe — just not quickly.
No GPU, no Apple Silicon? Use faster-whisper with int8 quantization on CPU. The small model on a modern Intel or AMD processor handles most transcription jobs at reasonable speeds.
What You Can Do
- Start with faster-whisper if you just want text from audio files. One
pip install, three lines of Python, done. - Try whisper.cpp if you’re on a Mac and want native Metal performance without Python.
- Deploy Scriberr if you want a proper web app for your household or small team, especially for meeting transcription with speaker identification.
- Pick the turbo model unless you have a specific reason not to — it’s the sweet spot of speed and accuracy.
- Keep recordings local. Once you’ve set up any of these tools, there’s no reason to upload private audio to someone else’s server again.