Self-Host Your Own AI Transcription With Faster-Whisper

Otter.ai charges $16.99 a month for its Pro plan. You get 1,200 minutes of transcription and the privilege of uploading every meeting, interview, and voice note to someone else’s servers. Rev charges $0.25 per minute — which adds up fast if you transcribe regularly.

Meanwhile, OpenAI open-sourced Whisper under the MIT license. It transcribes 99 languages, translates speech to English, and generates timestamped subtitles. Run it locally, and every recording stays on your machine. No minute caps. No monthly fees. No audio uploaded anywhere.

The catch with vanilla Whisper is speed. That’s where faster-whisper comes in — a reimplementation using the CTranslate2 engine that delivers identical accuracy at 4x the speed while using less memory. On an RTX 4090, it transcribes a one-hour podcast in 50 seconds.

This guide gets you from zero to working local transcription in about 15 minutes.

What You Need

Hardware:

8 GB RAM, no GPU — Runs the small model on CPU. A one-hour file takes around 15 minutes. Fine for occasional use.
A GPU with 4-6 GB VRAM — Runs the small model with INT8 quantization. Same file finishes in about a minute.
A GPU with 8+ GB VRAM — Runs large-v3 at full quality. One hour of audio in under 5 minutes. This is the sweet spot.

Apple Silicon Macs work well too. An M4 Max transcribes a one-hour podcast in about 2.5 minutes using the large-v3 model.

Software:

Python 3.10 or 3.11
A terminal

Step 1: Install Faster-Whisper

Create a virtual environment and install:

python3.11 -m venv ~/venvs/whisper
source ~/venvs/whisper/bin/activate
pip install faster-whisper

If you have an NVIDIA GPU, install the CUDA dependencies for GPU acceleration:

pip install nvidia-cudnn-cu12==9.0.0

That’s the entire installation. No Docker required, no complex build chains.

Step 2: Pick Your Model

Faster-whisper downloads models automatically on first use. Here’s what to choose:

Model	Size	Word Error Rate	Speed (1hr audio, RTX 3090)	Best For
tiny	75 MB	~7.7%	12 seconds	Real-time, embedded
base	142 MB	~5.4%	18 seconds	Quick drafts
small	466 MB	3.4%	52 seconds	Good balance
medium	1.5 GB	2.9%	2 min	Most use cases
large-v3	2.9 GB	2.0%	4 min 40 sec	Maximum accuracy
large-v3-turbo	1.5 GB	3.0%	~1 min 45 sec	Fast + accurate

The large-v3-turbo model is the standout here. It cuts the decoder from 32 layers to 4, making it roughly 5x faster than large-v3 with only a 1% accuracy difference on clean English audio. For most real-world transcription — meetings, podcasts, interviews — you won’t notice the difference.

Step 3: Transcribe Something

Create a file called transcribe.py:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")

segments, info = model.transcribe("recording.mp3", beam_size=5)

print(f"Detected language: {info.language} ({info.language_probability:.0%})")
print()

for segment in segments:
    print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text.strip()}")

Run it:

python transcribe.py

No GPU? Change device="cuda" to device="cpu" and compute_type="float16" to compute_type="int8". Use a smaller model like small or medium.

Apple Silicon? Use device="auto" and compute_type="float16" — faster-whisper auto-detects Metal acceleration.

The model downloads automatically on first run (about 1.5 GB for large-v3-turbo). After that, it loads from cache in seconds.

Step 4: Get Useful Output

Plain timestamps are fine, but you probably want subtitle files or clean text. Here are three practical output scripts.

SRT subtitles (for video):

from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
segments, info = model.transcribe("video.mp4", beam_size=5)

with open("subtitles.srt", "w") as f:
    for i, seg in enumerate(segments, 1):
        start = f"{int(seg.start//3600):02d}:{int(seg.start%3600//60):02d}:{seg.start%60:06.3f}".replace(".", ",")
        end = f"{int(seg.end//3600):02d}:{int(seg.end%3600//60):02d}:{seg.end%60:06.3f}".replace(".", ",")
        f.write(f"{i}\n{start} --> {end}\n{seg.text.strip()}\n\n")

Clean transcript (meeting notes, articles):

from faster_whisper import WhisperModel

model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
segments, info = model.transcribe("meeting.mp3", beam_size=5, vad_filter=True)

text = " ".join(seg.text.strip() for seg in segments)
with open("transcript.txt", "w") as f:
    f.write(text)

The vad_filter=True flag enables voice activity detection — it skips silence, background noise, and music, giving you cleaner output and faster processing.

Word-level timestamps (for precise alignment):

segments, info = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print(f"{word.start:.2f} - {word.end:.2f}: {word.word}")

Step 5: Batch Processing

If you have a folder of recordings to process:

#!/bin/bash
mkdir -p transcripts

for file in recordings/*.mp3 recordings/*.wav recordings/*.m4a; do
    [ -f "$file" ] || continue
    name=$(basename "$file" | sed 's/\.[^.]*$//')
    echo "Transcribing: $file"
    python -c "
from faster_whisper import WhisperModel
model = WhisperModel('large-v3-turbo', device='cuda', compute_type='float16')
segments, _ = model.transcribe('$file', beam_size=5, vad_filter=True)
with open('transcripts/${name}.txt', 'w') as f:
    for seg in segments:
        f.write(f'[{seg.start:.1f}s] {seg.text.strip()}\n')
"
done
echo "Done."

For heavy batch work, faster-whisper also supports batched inference that processes multiple audio chunks in parallel — useful if you have a beefy GPU and dozens of files to churn through.

Add a Web Interface

If you prefer clicking over typing, Whisper-WebUI gives you a browser-based interface with drag-and-drop file upload, real-time progress, and output format selection. It uses faster-whisper under the hood.

git clone https://github.com/jhj0517/Whisper-WebUI.git
cd Whisper-WebUI
pip install -r requirements.txt
python app.py

Open http://localhost:7860 in your browser. Upload a file, pick a model, hit transcribe. It also handles YouTube URLs, microphone input, and speaker diarization — identifying who said what in multi-speaker recordings.

What You Save

Otter.ai Pro: $16.99/month, 1,200 minutes. Otter.ai Business: $30/user/month. Rev AI: $0.25/minute. For a team transcribing 20 hours of meetings per month, that’s $300 on Rev or $30+ per person on Otter.

This setup: $0/month. Unlimited minutes. The electricity cost is negligible — an RTX 3060 draws 25-170W during transcription, and each hour of audio takes under 3 minutes to process.

But the cost savings are secondary to what really matters: your audio never leaves your machine. Legal depositions, medical dictation, confidential interviews, proprietary meeting recordings — none of it gets uploaded to a third-party server. There’s no data retention policy to parse, no GDPR cross-border transfer concern, no risk of a cloud provider breach exposing your recordings. The privacy guarantee is architectural, not contractual.

What You Lose

The accuracy gap between local Whisper and cloud transcription services has narrowed dramatically. Large-v3 hits a 2.0% word error rate on clean English, which is competitive with most commercial offerings.

Where cloud services still have an edge: speaker diarization (knowing who said what) works but isn’t perfect locally, real-time live transcription needs beefy hardware to keep up, and heavy accents or domain-specific jargon can trip up the base models. Cloud services with custom vocabularies handle niche terminology slightly better.

For most practical use cases — meeting recordings, podcast transcription, lecture notes, interview processing — local Whisper delivers results that are more than good enough. And the models keep improving. Large-v3-turbo didn’t exist a year ago.

Going Further

Translate while transcribing. Whisper can translate any of its 99 supported languages directly to English during transcription. Add task="translate" to your transcribe call.

Combine with a local LLM. Pipe your transcript into Ollama to generate meeting summaries, extract action items, or rewrite rough dictation into polished text — all without any data leaving your machine.

Set up as a service. The Docker one-liner runs faster-whisper as an API server compatible with OpenAI’s transcription endpoint. Point any app that supports the Whisper API at localhost:9000 and it works transparently:

docker run -d -p 9000:9000 \
    -e ASR_MODEL=large-v3-turbo \
    -e ASR_ENGINE=faster_whisper \
    --gpus all \
    onerahmet/openai-whisper-asr-webservice:latest-gpu

Your recordings. Your hardware. Your transcripts. Nobody else needs to be involved.