Self-Host Whisper and Stop Paying for Transcription Services

Every minute of audio you upload to Otter.ai, Rev, or any cloud transcription service becomes data you don’t control. Your voice is biometric data - unlike a password, you can’t change it if it gets exposed. Running Whisper locally gives you professional-grade transcription that never leaves your machine.

Here’s how to set it up.

Why Local Transcription Matters

Cloud transcription services charge per minute and keep your recordings on their servers. OpenAI’s Whisper API costs $0.006 per minute. That adds up: transcribing 10 hours of meetings monthly costs $3.60 in API fees alone. Otter.ai’s premium tier runs $16.99/month.

Self-hosted Whisper costs nothing after the initial setup. More importantly, your audio stays on your hardware. No third-party retention policies, no training data pipelines, no breach exposure.

Local processing works everywhere - on airplanes, in secure facilities, anywhere without internet. And for most use cases, local Whisper matches or beats cloud speed.

Choosing Your Whisper Variant

OpenAI released Whisper as open source. Multiple optimized versions exist:

faster-whisper is the safest default. It uses CTranslate2, a C++ inference engine, running up to 4x faster than the original while using less memory. Works well on both CPU and GPU.

whisper.cpp is a pure C/C++ reimplementation with minimal dependencies. Optimal for Mac with Metal and Core ML acceleration - 8-12x faster than CPU-only on Apple Silicon.

insanely-fast-whisper pushes maximum throughput on high-end NVIDIA GPUs. Overkill for most personal use.

WhisperX adds word-level timestamps and speaker diarization (who said what). Pick this if you need those features.

For most people: use faster-whisper on Windows/Linux with a GPU, or whisper.cpp on Mac.

Model Selection: Size vs Speed

Whisper comes in multiple sizes. Larger models are more accurate but slower:

Model	Parameters	VRAM	Speed	Use Case
tiny	39M	~1GB	Very fast	Quick drafts, English only
base	74M	~1GB	Fast	Casual transcription
small	244M	~2GB	Moderate	Good accuracy, reasonable speed
medium	769M	~5GB	Slower	High accuracy
large-v3	1550M	~10GB	Slowest	Best accuracy
large-v3-turbo	809M	~6GB	Fast	Near large-v3 accuracy, 6x faster

Recommendation: Start with large-v3-turbo. It delivers within 1-2% accuracy of large-v3 while being 6x faster. If you’re on a CPU or have limited VRAM, use small or medium.

Setup Option 1: faster-whisper (GPU/CPU)

Best for Windows and Linux users with NVIDIA GPUs or those running on CPU.

Requirements

Python 3.8+
ffmpeg installed
For GPU: NVIDIA GPU with CUDA 12 and cuDNN 9

Installation

pip install faster-whisper

For GPU support, install the NVIDIA libraries:

pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9

On Linux, set the library path:

export LD_LIBRARY_PATH=`python -c 'import nvidia.cublas.lib; import nvidia.cudnn.lib; print(":".join([nvidia.cublas.lib.__path__[0], nvidia.cudnn.lib.__path__[0]]))'`:$LD_LIBRARY_PATH

Basic Transcription Script

Create transcribe.py:

from faster_whisper import WhisperModel

# For GPU with 6GB+ VRAM
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

# For CPU (slower but works everywhere)
# model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")

def transcribe(audio_file):
    segments, info = model.transcribe(audio_file, beam_size=5)
    print(f"Detected language: {info.language} ({info.language_probability:.0%})")

    for segment in segments:
        print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

if __name__ == "__main__":
    import sys
    transcribe(sys.argv[1])

Run it:

python transcribe.py meeting-recording.mp3

Word-Level Timestamps

Need timestamps for each word? Add word_timestamps=True:

segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
    for word in segment.words:
        print(f"[{word.start:.2f}s -> {word.end:.2f}s] {word.word}")

Voice Activity Detection

Filter out silence and reduce processing time:

segments, _ = model.transcribe(
    "audio.mp3",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500)
)

Setup Option 2: whisper.cpp (Mac)

Best for Apple Silicon Macs. Takes advantage of Metal and Core ML acceleration.

Requirements

macOS Ventura 13.5 or later
Xcode Command Line Tools
At least 8GB unified memory (16GB recommended for large models)

Installation

# Install Xcode tools if needed
xcode-select --install

# Clone and build
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
make

# Download a model
bash ./models/download-ggml-model.sh large-v3-turbo

Running Transcription

./main -m models/ggml-large-v3-turbo.bin -f audio.wav

For other audio formats, convert first:

ffmpeg -i recording.mp3 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav
./main -m models/ggml-large-v3-turbo.bin -f audio.wav

Enable Core ML (Optional)

For maximum speed on Apple Silicon, rebuild with Core ML:

make clean
WHISPER_COREML=1 make -j

This enables the Apple Neural Engine, achieving 8-12x speedup compared to CPU-only.

Setup Option 3: Docker with Web UI

For a browser-based interface, use Whishper - a complete Docker solution with upload interface and subtitle generation.

Create docker-compose.yml:

version: "3.8"
services:
  whishper:
    image: pluja/whishper:latest
    container_name: whishper
    environment:
      - WHISPER_MODEL=large-v3-turbo
    volumes:
      - ./uploads:/app/uploads
      - ./models:/app/models
    ports:
      - "5001:5001"
    restart: unless-stopped

Start it:

docker compose up -d

Open http://localhost:5001 in your browser. Upload audio files and get transcripts with a clean interface.

Performance Expectations

Real-world speeds on common hardware:

Hardware	Model	Speed
M2 MacBook Air (Metal)	large-v3-turbo	~10x real-time
RTX 4070	large-v3-turbo	~15x real-time
CPU (8 cores)	small	~2x real-time
CPU (8 cores)	large-v3-turbo	~0.5x real-time

“10x real-time” means a 10-minute recording transcribes in 1 minute.

For CPU-only systems, use the int8 compute type and stick with small or medium models. Quality is still good, just slower.

Tips for Best Results

Audio quality matters more than model size. A clean recording with the small model beats a noisy recording with large-v3. Use a decent microphone and minimize background noise.

Normalize your audio if it’s too quiet or has varying volume:

ffmpeg -i input.mp3 -af "loudnorm=I=-16:TP=-1.5:LRA=11" output.mp3

Use VAD filtering for recordings with long pauses. It significantly speeds up processing by skipping silent sections.

Batch process overnight for large archives. Point the script at a folder and let it run:

import os
from pathlib import Path

audio_dir = Path("./recordings")
for audio_file in audio_dir.glob("*.mp3"):
    output_file = audio_file.with_suffix(".txt")
    # transcribe and save...

What You’re Getting vs Cloud Services

Feature	Self-Hosted Whisper	Otter.ai	OpenAI API
Monthly cost	$0	$16.99+	Per minute
Data privacy	Complete	Their servers	Their servers
Offline use	Yes	No	No
Speed	Depends on hardware	Fast	Fast
Accuracy	Excellent	Excellent	Excellent
Speaker ID	With WhisperX	Yes	No

The accuracy is essentially identical - they all use the same underlying Whisper models. The difference is where your data goes and what you pay.

What You Can Do

Start simple. Install faster-whisper, run the basic script, see how it works.
Pick the right model. Try large-v3-turbo first. Drop to small if it’s too slow on your hardware.
Set up a workflow. Create a folder where you drop recordings; write a script that auto-transcribes new files.
Consider the Docker UI if you want a polished browser interface or need to share with non-technical users.

Your voice recordings contain sensitive information - client calls, strategy discussions, personal notes. There’s no good reason to upload them to servers you don’t control when local transcription works this well.