Every time you upload audio to Otter.ai, Rev, or any cloud transcription service, your voice — meetings, medical notes, personal memos, legal consultations — lands on someone else’s server. Most of these services retain your data for model training unless you pay for enterprise tiers with opt-outs. Some keep it indefinitely.
You do not need any of them. OpenAI’s Whisper model is open-source, runs on consumer hardware, and transcribes English at under 8% word error rate — better than most paid services. This guide covers two ways to set it up: whisper.cpp for people who want a fast, lightweight command-line tool, and faster-whisper for Python users who need scripting flexibility. Both run entirely on your machine.
Which tool should you pick
There are three main ways to run Whisper locally. Here is how they compare:
| Tool | Language | Best for | Speed vs. original | GPU support |
|---|---|---|---|---|
| whisper.cpp | C/C++ | CLI transcription, Apple Silicon | ~2x faster | Metal, CUDA, OpenVINO |
| faster-whisper | Python | Scripting, batch jobs, pipelines | ~4x faster | CUDA (NVIDIA) |
| Original Whisper | Python | Compatibility, research | Baseline | CUDA |
Short version: If you have a Mac, use whisper.cpp — it has native Metal acceleration and needs no Python environment. If you have an NVIDIA GPU and want to build transcription into a script or pipeline, use faster-whisper. Skip the original Whisper implementation unless you specifically need it for research; it is the slowest of the three.
Hardware requirements
Be honest with yourself about what your machine can handle. Bigger models produce better transcriptions but need more resources.
| Model | Parameters | VRAM / Memory | Speed (CPU) | Speed (GPU) | Accuracy (WER) |
|---|---|---|---|---|---|
| tiny | 39M | ~1 GB | Fast | Very fast | ~14% |
| base | 74M | ~1 GB | Fast | Very fast | ~11% |
| small | 244M | ~2 GB | Moderate | Fast | ~9% |
| medium | 769M | ~5 GB | Slow | Moderate | ~8% |
| large-v3 | 1.55B | ~10 GB | Very slow | Moderate | ~7.4% |
| large-v3-turbo | 809M | ~6 GB | Slow | Fast | ~7.75% |
The large-v3-turbo model is the sweet spot for most people. It is 5–6x faster than the full large-v3 with nearly identical accuracy, and it fits in 6 GB of VRAM. If you are CPU-only, stick with small or base — anything bigger becomes painfully slow without a GPU.
Option 1: whisper.cpp (recommended for Mac and CPU users)
Install dependencies
macOS:
xcode-select --install # if you haven't already
brew install cmake ffmpeg
Ubuntu/Debian:
sudo apt install build-essential cmake ffmpeg
Fedora:
sudo dnf install gcc-c++ cmake ffmpeg
Build whisper.cpp
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release
On Apple Silicon Macs, Metal acceleration is enabled automatically. You do not need to pass any extra flags — the build system detects it. For NVIDIA GPUs on Linux, add -DGGML_CUDA=1 to the cmake configure step:
cmake -B build -DGGML_CUDA=1
cmake --build build -j --config Release
Download a model
sh ./models/download-ggml-model.sh large-v3-turbo
Other options: tiny, base, small, medium, large-v3. For English-only use, append .en (e.g., base.en) for slightly better accuracy on English audio.
Transcribe your first file
Whisper expects 16-bit WAV audio at 16 kHz. Convert anything else with ffmpeg first:
ffmpeg -i recording.mp3 -ar 16000 -ac 1 -c:a pcm_s16le recording.wav
Then transcribe:
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f recording.wav
That is it. Output prints to stdout. Add -otxt for a plain text file, -osrt for SRT subtitles, or -ovtt for WebVTT.
Useful flags
# Output to SRT subtitle file
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f recording.wav -osrt
# Use 8 threads (adjust to your CPU core count)
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f recording.wav -t 8
# Translate from another language to English
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f spanish_audio.wav -l es -tr
# Print timestamps
./build/bin/whisper-cli -m models/ggml-large-v3-turbo.bin -f recording.wav -pp
Option 2: faster-whisper (recommended for NVIDIA GPUs and Python users)
Install
pip install faster-whisper
For GPU acceleration, you need NVIDIA drivers and CUDA 12. The pip package includes the necessary cuBLAS and cuDNN libraries on Linux, so no separate CUDA toolkit installation is needed.
Basic transcription
Create a file called transcribe.py:
from faster_whisper import WhisperModel
# For NVIDIA GPU:
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
# For CPU-only:
# model = WhisperModel("large-v3-turbo", device="cpu", compute_type="int8")
segments, info = model.transcribe("recording.mp3", beam_size=5)
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Run it:
python transcribe.py
faster-whisper handles audio format conversion internally — you can feed it MP3, M4A, FLAC, or WAV directly. No ffmpeg pre-processing needed.
Batch transcription
Here is a script to transcribe every audio file in a directory:
import sys
from pathlib import Path
from faster_whisper import WhisperModel
audio_dir = Path(sys.argv[1]) if len(sys.argv) > 1 else Path(".")
extensions = {".mp3", ".wav", ".m4a", ".flac", ".ogg", ".webm"}
model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
for audio_file in sorted(audio_dir.iterdir()):
if audio_file.suffix.lower() not in extensions:
continue
print(f"\n{'=' * 60}")
print(f"Transcribing: {audio_file.name}")
print(f"{'=' * 60}")
segments, info = model.transcribe(str(audio_file), beam_size=5)
output_file = audio_file.with_suffix(".txt")
with open(output_file, "w") as f:
for segment in segments:
line = f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}"
print(line)
f.write(segment.text + "\n")
print(f"Saved to: {output_file}")
Save it as batch_transcribe.py and run with python batch_transcribe.py /path/to/audio/files.
Performance benchmarks
On a 13-minute audio sample, faster-whisper compared to the alternatives:
| Setup | Time |
|---|---|
| faster-whisper, RTX 3070 Ti, FP16 | 1m 03s |
| whisper.cpp, RTX 3070 Ti | 1m 05s |
| OpenAI Whisper, RTX 3070 Ti | 2m 23s |
| faster-whisper, i7-12700K, INT8 | 1m 42s |
| OpenAI Whisper, i7-12700K | 6m 58s |
faster-whisper on CPU is almost as fast as the original Whisper on GPU. That is a 4x speedup just from using a better inference engine.
Make it a shell command
Typing long commands every time is annoying. Create a wrapper script:
#!/bin/bash
# Save as ~/bin/transcribe and chmod +x ~/bin/transcribe
WHISPER_DIR="$HOME/whisper.cpp"
MODEL="$WHISPER_DIR/models/ggml-large-v3-turbo.bin"
if [ -z "$1" ]; then
echo "Usage: transcribe <audio-file> [output-format]"
echo "Formats: txt, srt, vtt (default: txt)"
exit 1
fi
INPUT="$1"
FORMAT="${2:-txt}"
TEMP_WAV=$(mktemp /tmp/whisper_XXXXXX.wav)
# Convert to 16kHz mono WAV
ffmpeg -i "$INPUT" -ar 16000 -ac 1 -c:a pcm_s16le "$TEMP_WAV" -y -loglevel error
# Transcribe
"$WHISPER_DIR/build/bin/whisper-cli" -m "$MODEL" -f "$TEMP_WAV" -o"$FORMAT" -of "${INPUT%.*}"
rm "$TEMP_WAV"
echo "Done: ${INPUT%.*}.$FORMAT"
Now you can run transcribe meeting.mp3 srt from anywhere.
Beyond Whisper: what else is out there
Whisper is the most mature option, but the field is moving. Here is what else you should know about:
NVIDIA Canary-Qwen-2.5B currently tops the Open ASR Leaderboard with 5.63% WER — better than Whisper’s 7.4%. But it is English-only, requires NeMo toolkit setup, and needs substantially more VRAM. Worth watching for English-heavy workloads if you have a beefy GPU.
Distil-Whisper strips the Whisper large-v3 down to 756M parameters while staying within 1% of its accuracy. It runs 5–6x faster than the full model and works as a drop-in replacement in faster-whisper. Good if you want Whisper-level quality but your GPU has limited VRAM.
OpenWhispr wraps Whisper and NVIDIA Parakeet in a desktop app with a clean GUI. If you want transcription without touching a terminal, this is the easiest path. Open source, works offline, available on macOS, Windows, and Linux.
Moonshine by Useful Sensors is built for edge devices — the smallest model is just 27 MB. If you need transcription on a Raspberry Pi or similar constrained hardware, this is worth a look.
What this means for privacy
Cloud transcription services process some of the most sensitive audio that exists: medical dictation, legal depositions, therapy sessions, journalism interviews, business negotiations. When you upload to a cloud service, you are trusting that company with the full content of those conversations — and trusting that their data handling, retention, and training policies actually match what they claim.
Running transcription locally eliminates that trust requirement entirely. Your audio stays on your hardware. There is no upload, no API call, no server-side logging, no fine print about model training. The accuracy is competitive with paid services, and the setup takes less time than reading most services’ privacy policies.
What you can do
- Start with whisper.cpp and the large-v3-turbo model — it covers 90% of use cases and runs well on most hardware from the last few years.
- Create the shell wrapper so you can transcribe files in one command.
- If you process audio regularly, set up faster-whisper with the batch script to handle whole directories at once.
- Audit what you are currently sending to cloud services. If you use Otter, Rev, or any cloud transcription, check their data retention policies. You might be surprised.
- For meetings, look into WhisperX — it adds speaker diarization (identifying who said what) on top of Whisper, also fully local.