Stop sending your voice to the cloud: self-host speech-to-text with Whisper in under 20 minutes

Every time you upload audio to Otter.ai, Rev, or any cloud transcription service, your voice — meetings, medical notes, personal memos, legal consultations — lands on someone else’s server. Most of these services retain your data for model training unless you pay for enterprise tiers with opt-outs. Some keep it indefinitely.

You do not need any of them. OpenAI’s Whisper model is open-source, runs on consumer hardware, and transcribes English at under 8% word error rate — better than most paid services. This guide covers two ways to set it up: whisper.cpp for people who want a fast, lightweight command-line tool, and faster-whisper for Python users who need scripting flexibility. Both run entirely on your machine.

Which tool should you pick

There are three main ways to run Whisper locally. Here is how they compare:

Tool	Language	Best for	Speed vs. original	GPU support
whisper.cpp	C/C++	CLI transcription, Apple Silicon	~2x faster	Metal, CUDA, OpenVINO
faster-whisper	Python	Scripting, batch jobs, pipelines	~4x faster	CUDA (NVIDIA)
Original Whisper	Python	Compatibility, research	Baseline	CUDA

Short version: If you have a Mac, use whisper.cpp — it has native Metal acceleration and needs no Python environment. If you have an NVIDIA GPU and want to build transcription into a script or pipeline, use faster-whisper. Skip the original Whisper implementation unless you specifically need it for research; it is the slowest of the three.

Hardware requirements

Be honest with yourself about what your machine can handle. Bigger models produce better transcriptions but need more resources.

Model	Parameters	VRAM / Memory	Speed (CPU)	Speed (GPU)	Accuracy (WER)
tiny	39M	~1 GB	Fast	Very fast	~14%
base	74M	~1 GB	Fast	Very fast	~11%
small	244M	~2 GB	Moderate	Fast	~9%
medium	769M	~5 GB	Slow	Moderate	~8%
large-v3	1.55B	~10 GB	Very slow	Moderate	~7.4%
large-v3-turbo	809M	~6 GB	Slow	Fast	~7.75%

The large-v3-turbo model is the sweet spot for most people. It is 5–6x faster than the full large-v3 with nearly identical accuracy, and it fits in 6 GB of VRAM. If you are CPU-only, stick with small or base — anything bigger becomes painfully slow without a GPU.

Option 1: whisper.cpp (recommended for Mac and CPU users)

Install dependencies

macOS:

xcode-select --install  # if you haven't already
brew install cmake ffmpeg

Ubuntu/Debian:

sudo apt install build-essential cmake ffmpeg

Fedora:

sudo dnf install gcc-c++ cmake ffmpeg

Build whisper.cpp

git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release

On Apple Silicon Macs, Metal acceleration is enabled automatically. You do not need to pass any extra flags — the build system detects it. For NVIDIA GPUs on Linux, add -DGGML_CUDA=1 to the cmake configure step:

cmake -B build -DGGML_CUDA=1
cmake --build build -j --config Release

Download a model

sh ./models/download-ggml-model.sh large-v3-turbo

Other options: tiny, base, small, medium, large-v3. For English-only use, append .en (e.g., base.en) for slightly better accuracy on English audio.

Transcribe your first file

Whisper expects 16-bit WAV audio at 16 kHz. Convert anything else with ffmpeg first:

ffmpeg -i recording.mp3 -ar 16000 -ac 1 -c:a pcm_s16le recording.wav

Then transcribe:

./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f recording.wav

That is it. Output prints to stdout. Add -otxt for a plain text file, -osrt for SRT subtitles, or -ovtt for WebVTT.

Useful flags

# Output to SRT subtitle file
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f recording.wav -osrt

# Use 8 threads (adjust to your CPU core count)
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f recording.wav -t 8

# Translate from another language to English
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f spanish_audio.wav -l es -tr

# Print timestamps
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f recording.wav -pp

Option 2: faster-whisper (recommended for NVIDIA GPUs and Python users)

Install

pip install faster-whisper

For GPU acceleration, you need NVIDIA drivers and CUDA 12. The pip package includes the necessary cuBLAS and cuDNN libraries on Linux, so no separate CUDA toolkit installation is needed.

Basic transcription

Create a file called transcribe.py:

from faster_whisper import WhisperModel

# For NVIDIA GPU:
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

# For CPU-only:
# model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")

segments, info = model.transcribe("recording.mp3", beam_size=5)

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Run it:

python transcribe.py

faster-whisper handles audio format conversion internally — you can feed it MP3, M4A, FLAC, or WAV directly. No ffmpeg pre-processing needed.

Batch transcription

Here is a script to transcribe every audio file in a directory:

import sys
from pathlib import Path
from faster_whisper import WhisperModel

audio_dir = Path(sys.argv[1]) if len(sys.argv) > 1 else Path(".")
extensions = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".webm"}

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

for audio_file in sorted(audio_dir.iterdir()):
    if audio_file.suffix.lower() not in extensions:
        continue

    print(f"\n{'=' * 60}")
    print(f"Transcribing: {audio_file.name}")
    print(f"{'=' * 60}")

    segments, info = model.transcribe(str(audio_file), beam_size=5)

    output_file = audio_file.with_suffix(".txt")
    with open(output_file, "w") as f:
        for segment in segments:
            line = f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}"
            print(line)
            f.write(segment.text + "\n")

    print(f"Saved to: {output_file}")

Save it as batch_transcribe.py and run with python batch_transcribe.py /path/to/audio/files.

Performance benchmarks

On a 13-minute audio sample, faster-whisper compared to the alternatives:

Setup	Time
faster-whisper, RTX 3070 Ti, FP16	1m 03s
whisper.cpp, RTX 3070 Ti	1m 05s
OpenAI Whisper, RTX 3070 Ti	2m 23s
faster-whisper, i7-12700K, INT8	1m 42s
OpenAI Whisper, i7-12700K	6m 58s

faster-whisper on CPU is almost as fast as the original Whisper on GPU. That is a 4x speedup just from using a better inference engine.

Make it a shell command

Typing long commands every time is annoying. Create a wrapper script:

#!/bin/bash
# Save as ~/bin/transcribe and chmod +x ~/bin/transcribe

WHISPER_DIR="$HOME/whisper.cpp"
MODEL="$WHISPER_DIR/models/ggml-large-v3-turbo.bin"

if [ -z "$1" ]; then
    echo "Usage: transcribe <audio-file> [output-format]"
    echo "Formats: txt, srt, vtt (default: txt)"
    exit 1
fi

INPUT="$1"
FORMAT="${2:-txt}"
TEMP_WAV=$(mktemp /tmp/whisper_XXXXXX.wav)

# Convert to 16kHz mono WAV
ffmpeg -i "$INPUT" -ar 16000 -ac 1 -c:a pcm_s16le "$TEMP_WAV" -y -loglevel error

# Transcribe
"$WHISPER_DIR/build/bin/whisper-cli" -m "$MODEL" -f "$TEMP_WAV" -o"$FORMAT" -of "${INPUT%.*}"

rm "$TEMP_WAV"
echo "Done: ${INPUT%.*}.$FORMAT"

Now you can run transcribe meeting.mp3 srt from anywhere.

Beyond Whisper: what else is out there

Whisper is the most mature option, but the field is moving. Here is what else you should know about:

NVIDIA Canary-Qwen-2.5B currently tops the Open ASR Leaderboard with 5.63% WER — better than Whisper’s 7.4%. But it is English-only, requires NeMo toolkit setup, and needs substantially more VRAM. Worth watching for English-heavy workloads if you have a beefy GPU.

Distil-Whisper strips the Whisper large-v3 down to 756M parameters while staying within 1% of its accuracy. It runs 5–6x faster than the full model and works as a drop-in replacement in faster-whisper. Good if you want Whisper-level quality but your GPU has limited VRAM.

OpenWhispr wraps Whisper and NVIDIA Parakeet in a desktop app with a clean GUI. If you want transcription without touching a terminal, this is the easiest path. Open source, works offline, available on macOS, Windows, and Linux.

Moonshine by Useful Sensors is built for edge devices — the smallest model is just 27 MB. If you need transcription on a Raspberry Pi or similar constrained hardware, this is worth a look.

What this means for privacy

Cloud transcription services process some of the most sensitive audio that exists: medical dictation, legal depositions, therapy sessions, journalism interviews, business negotiations. When you upload to a cloud service, you are trusting that company with the full content of those conversations — and trusting that their data handling, retention, and training policies actually match what they claim.

Running transcription locally eliminates that trust requirement entirely. Your audio stays on your hardware. There is no upload, no API call, no server-side logging, no fine print about model training. The accuracy is competitive with paid services, and the setup takes less time than reading most services’ privacy policies.

What you can do

Start with whisper.cpp and the large-v3-turbo model — it covers 90% of use cases and runs well on most hardware from the last few years.
Create the shell wrapper so you can transcribe files in one command.
If you process audio regularly, set up faster-whisper with the batch script to handle whole directories at once.
Audit what you are currently sending to cloud services. If you use Otter, Rev, or any cloud transcription, check their data retention policies. You might be surprised.
For meetings, look into WhisperX — it adds speaker diarization (identifying who said what) on top of Whisper, also fully local.