Self-Host Whisper: Replace Otter.ai and Rev With Private Speech-to-Text That Never Leaves Your Machine

Every word you speak into Otter.ai gets uploaded to their servers. Every meeting, every interview, every private conversation — processed, stored, and used to “improve their services.” Rev.com charges $0.25 per minute for AI transcription or $1.99 per minute for human-assisted, and your audio still goes through their cloud. Otter’s Pro plan runs $16.99/month, and even then, you’re limited to 1,200 minutes.

There’s a better way. OpenAI’s Whisper is a free, open-source speech recognition model that supports 99 languages and matches commercial transcription accuracy. Run it on your own machine and your recordings never leave your hard drive. No subscriptions, no per-minute fees, no privacy trade-offs.

Here are three ways to set it up, from quickest to most full-featured.

Pick Your Model First

Before choosing an approach, you need to pick a Whisper model size. This matters for both accuracy and speed.

Model	Parameters	VRAM	Disk	Speed vs Large	Best For
tiny	39M	~1 GB	75 MB	~10x faster	Quick drafts, real-time
base	74M	~1 GB	142 MB	~7x faster	Decent quality, fast
small	244M	~2 GB	466 MB	~4x faster	Good balance
medium	769M	~5 GB	1.5 GB	~2x faster	High accuracy
large-v3	1.55B	~10 GB	2.9 GB	1x (baseline)	Maximum accuracy
turbo	809M	~6 GB	1.5 GB	~8x faster	Best all-rounder

The turbo model is the standout pick for most people. It’s a distilled version of large-v3 that slashes the decoder layers from 32 to just 4, hitting accuracy comparable to large-v2 at 8x the speed. Unless you need maximum accuracy for professional transcripts or you’re translating non-English audio to English (turbo doesn’t support translation tasks), turbo is the one to use.

Running on CPU only? Start with small or base — you can always move up once you know your hardware handles it.

Option 1: faster-whisper (Quickest Setup)

faster-whisper is a reimplementation of Whisper using the CTranslate2 inference engine. It runs up to 4x faster than OpenAI’s original implementation while using less memory, and unlike vanilla Whisper, it doesn’t require FFmpeg to be installed separately.

Requirements: Python 3.8+, pip. GPU optional (NVIDIA with CUDA 12 + cuDNN 9).

Install

pip install faster-whisper

For GPU users who need NVIDIA libraries:

pip install nvidia-cublas-cu12 nvidia-cudnn-cu12

Transcribe

Create a file called transcribe.py:

from faster_whisper import WhisperModel

# GPU: device="cuda", compute_type="float16"
# CPU: device="cpu", compute_type="int8"
model = WhisperModel("turbo", device="cpu", compute_type="int8")

segments, info = model.transcribe("your-audio.mp3", beam_size=5)

print(f"Detected language: {info.language} ({info.language_probability:.0%})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Run it:

python transcribe.py

That’s it. The first run downloads the model (turbo is about 1.5 GB), then subsequent runs start immediately. You get timestamped segments out of the box.

Batch Transcription

For transcribing multiple files at once:

#!/bin/bash
for file in *.mp3 *.wav *.m4a; do
    [ -f "$file" ] || continue
    echo "Transcribing: $file"
    python -c "
from faster_whisper import WhisperModel
model = WhisperModel('turbo', device='cpu', compute_type='int8')
segments, info = model.transcribe('$file', beam_size=5)
with open('${file%.*}.txt', 'w') as f:
    for seg in segments:
        f.write(f'[{seg.start:.2f}s -> {seg.end:.2f}s] {seg.text}\n')
print('Done: $file')
"
done

When to choose this: You want the fastest path from audio file to text file. No web UI, no Docker — just Python and a pip install.

Option 2: whisper.cpp (No Python, Best for Apple Silicon)

whisper.cpp is a pure C/C++ port of Whisper. Zero Python dependencies, native Metal acceleration on Macs, and it runs on basically anything — even a Raspberry Pi. If you’re on Apple Silicon, this is the fastest path to transcription.

Requirements: CMake, C compiler (Xcode on macOS, gcc/g++ on Linux). No Python needed.

Build on macOS (with Metal GPU acceleration)

git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build -DWHISPER_METAL=ON
cmake --build build -j --config Release

Build on Linux

git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release

For NVIDIA GPU support on Linux, add -DWHISPER_CUDA=ON to the cmake command.

Download a Model

sh ./models/download-ggml-model.sh large-v3-turbo

Other options: tiny, base, small, medium, large-v3. Add .en suffix for English-only models (e.g., base.en).

Transcribe

./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f your-audio.wav

whisper.cpp expects 16kHz WAV input. Convert other formats with FFmpeg first:

ffmpeg -i your-audio.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

Quick Test

The fastest way to verify everything works:

make base.en

This downloads the base English model and runs inference on the included sample files.

When to choose this: You’re on a Mac and want native Metal performance. Or you want zero Python dependencies and maximum portability.

Option 3: Scriberr (Full Web UI With Docker)

If you want a proper application rather than a command-line tool, Scriberr wraps Whisper in a polished web interface with file management, speaker diarization, automatic folder watching, and even AI-powered summarization through Ollama.

Requirements: Docker and Docker Compose.

CPU Setup

Create a docker-compose.yml:

services:
  scriberr:
    image: ghcr.io/rishikanthc/scriberr:latest
    ports:
      - "3000:3000"
    volumes:
      - ./scriberr-data:/app/data
      - ./scriberr-models:/app/models
    environment:
      - PUID=1000
      - PGID=1000

Start it:

docker compose up -d

NVIDIA GPU Setup

For GPU acceleration, use the CUDA image:

services:
  scriberr:
    image: ghcr.io/rishikanthc/scriberr:latest-cuda
    ports:
      - "3000:3000"
    volumes:
      - ./scriberr-data:/app/data
      - ./scriberr-models:/app/models
    environment:
      - PUID=1000
      - PGID=1000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Open http://localhost:3000 in your browser. Upload an audio file and Scriberr handles the rest.

Key Features

Drag-and-drop uploads — MP3, WAV, M4A, FLAC, and more
Speaker diarization — Identifies who said what in multi-speaker recordings
Folder watcher — Drop files into a directory and they’re transcribed automatically
Ollama integration — Summarize transcripts locally using your own LLM
API access — Integrate into automation workflows
YouTube download — Paste a URL and transcribe the audio directly

When to choose this: You want a polished app experience, process lots of files, or need speaker identification.

Alternative: Whishper (Transcription + Translation)

Worth a mention: Whishper is another self-hosted option that bundles transcription, subtitle generation, and translation (via LibreTranslate) into a single Docker stack. It adds a subtitle editor to the web UI, so you can clean up transcripts without leaving the browser. Heavier setup than Scriberr (MongoDB + Nginx + LibreTranslate containers), but useful if you regularly work with multilingual content or need subtitle export.

The Privacy Calculation

Here’s what these cloud services know about you:

Otter.ai processes every recording on their servers. Their privacy policy allows them to use your data to “improve and develop” their products. Your meetings, interviews, and conversations become training data.

Rev.com uploads your audio for processing — and for their human transcription tier, actual people listen to your recordings.

Self-hosted Whisper sends nothing anywhere. Your audio stays on your disk, gets processed by your CPU or GPU, and the text output stays on your disk. There’s no account to create, no terms to accept, no data pipeline to worry about.

For anyone transcribing sensitive material — legal interviews, medical notes, confidential meetings, personal journals — the choice is straightforward.

Cost Comparison

	Otter.ai Pro	Rev AI	Self-Hosted
Monthly cost	$16.99/mo	$0.25/min	Electricity only
10 hours/month	$16.99	$150	$0
50 hours/month	$30/mo (Business)	$750	$0
Privacy	Cloud-processed	Cloud-processed	Fully local
Languages	~35	36+	99
Limits	1,200 min (Pro)	Pay per minute	Your hardware

The break-even point hits quickly. If you’re transcribing more than an hour or two per month, self-hosting saves money from day one — with zero ongoing fees and complete data control.

Hardware Recommendations

Already own a recent Mac? You’re set. An M1 MacBook Air with 8 GB RAM runs the turbo model comfortably via whisper.cpp with Metal acceleration. Expect roughly real-time transcription speed (1 hour of audio in about 1 hour of processing).

Linux desktop with an NVIDIA GPU? Even a GTX 1060 with 6 GB VRAM handles the turbo model. An RTX 3060 or newer processes audio significantly faster than real-time.

Old laptop or Raspberry Pi? Stick with the tiny or base models. They’re slower and less accurate, but they work. A Raspberry Pi 4 can transcribe — just not quickly.

No GPU, no Apple Silicon? Use faster-whisper with int8 quantization on CPU. The small model on a modern Intel or AMD processor handles most transcription jobs at reasonable speeds.

What You Can Do

Start with faster-whisper if you just want text from audio files. One pip install, three lines of Python, done.
Try whisper.cpp if you’re on a Mac and want native Metal performance without Python.
Deploy Scriberr if you want a proper web app for your household or small team, especially for meeting transcription with speaker identification.
Pick the turbo model unless you have a specific reason not to — it’s the sweet spot of speed and accuracy.
Keep recordings local. Once you’ve set up any of these tools, there’s no reason to upload private audio to someone else’s server again.