Best Local Speech Models in 2026: TTS and STT on Every GPU Tier

Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB

Deep dives: Chat | Coding | Translation | Vision | Speech | Agents

ElevenLabs charges $22/month for voice cloning. Otter.ai charges $17/month for transcription. Both send your audio to their servers. Running speech models locally gives you unlimited voice cloning, unlimited transcription, and complete privacy - all for the one-time cost of the GPU you already own.

This guide covers both halves of local speech: speech-to-text (STT / transcription) and text-to-speech (TTS / voice generation). The two tasks have very different hardware requirements, and you can often run both simultaneously.

Part 1: Speech-to-Text (Transcription)

The Models

OpenAI’s Whisper dominates local STT. It’s open-source, supports 99 languages, and has been optimized by the community into multiple fast variants. Here’s how the sizes compare:

Model	Parameters	VRAM	WER (English)	Speed vs. Real-Time
Whisper tiny	39M	~1 GB	~12%	30x+
Whisper base	74M	~1 GB	~10%	20x+
Whisper small	244M	~2 GB	~8%	10x+
Whisper medium	769M	~5 GB	~7%	5x+
Whisper large-v3-turbo	809M	~6 GB	~7.75%	6x faster than large-v3
Whisper large-v3	1.5B	~10 GB	~7.88%	1x (baseline)

Word Error Rate (WER) is measured on standard English benchmarks. Lower is better. For reference, commercial services like AssemblyAI’s Universal-2 hit about 6.68% WER.

The Recommendation: Whisper large-v3-turbo

For nearly everyone, large-v3-turbo is the answer. It achieves almost identical accuracy to the full large-v3 (7.75% vs 7.88% WER) while being 6x faster, by reducing decoder layers from 32 to 4. At ~6GB VRAM, it fits on any 8GB+ GPU with room to spare.

Use Faster-Whisper for the best performance - it’s a CTranslate2-optimized implementation that’s 4x faster than OpenAI’s original code with the same accuracy.

STT by VRAM Tier

Tier	Best Whisper Model	VRAM Used	Room for TTS?
8GB	large-v3-turbo	~6 GB	Tight - Piper/Kokoro only
12GB	large-v3-turbo	~6 GB	Yes - Chatterbox Turbo fits
16GB	large-v3 (full)	~10 GB	Yes - most TTS models
24GB	large-v3 (full)	~10 GB	Yes - everything fits
32GB	large-v3 (full)	~10 GB	Yes - full pipeline + LLM

We have a detailed Whisper setup guide: How to Self-Host Whisper: Replace Otter.ai.

Part 2: Text-to-Speech (Voice Generation)

TTS is where the real variety is. Models range from 82M parameters (runs on a potato) to 1.7B (needs a real GPU), with dramatic differences in quality, speed, and voice cloning capability.

The Landscape

Model	Params	VRAM	Voice Clone?	Languages	Speed	Quality
Kokoro-82M	82M	<1 GB	No	8	96x real-time (GPU)	Good
Piper	Varies	<1 GB	No (pre-trained)	20+	Real-time on CPU	Good
Chatterbox Turbo	350M	~4 GB	Yes	English	Sub-200ms latency	Excellent
Chatterbox Original	500M	~6 GB	Yes	English	Moderate	Excellent
Chatterbox Multi	500M	~6 GB	Yes	10+	Moderate	Excellent
F5-TTS	~335M	~6 GB	Yes (zero-shot)	Multi	Sub-7s processing	Excellent
Qwen3-TTS 0.6B	600M	4-6 GB	Yes (3s sample)	10	97ms latency	Very Good
Qwen3-TTS 1.7B	1.7B	6-8 GB	Yes (3s sample)	10	97ms latency	Excellent

Model Breakdown

Kokoro-82M - The speed king. At just 82M parameters, this model runs at 96x real-time on a basic GPU and 3-5x real-time on CPU alone. It ranked #1 on the TTS Spaces Arena, beating models 5-15x its size. No voice cloning, but the built-in voices are natural enough for most uses. If you just need to read text aloud, this is it.

Piper - The edge champion. Piper runs on a Raspberry Pi at real-time speed using ~500MB RAM. It’s purpose-built for home automation (reading notifications, smart home responses) and doesn’t need a GPU at all. Quality is a step below the neural models, but the zero-VRAM requirement means you can run it alongside anything.

Chatterbox - The quality leader. Resemble AI’s Chatterbox family beat ElevenLabs in blind evaluations with a 63.75% preference rate. The Turbo variant (350M) is the sweet spot - one-step generation at sub-200ms latency with voice cloning from a short sample. MIT licensed.

F5-TTS - The voice cloning specialist. F5-TTS does zero-shot voice cloning without fine-tuning - give it a reference clip and it reproduces the voice. At ~6GB VRAM it’s accessible on 8GB GPUs, and the cloning quality rivals commercial services.

Qwen3-TTS - The multilingual powerhouse. Alibaba’s TTS family supports 10 languages with 3-second voice cloning and streaming output. The 1.7B model at 6-8GB VRAM beats ElevenLabs and MiniMax on quality benchmarks. The 0.6B variant at 4-6GB brings the same capabilities to lower hardware.

TTS by VRAM Tier {#tts-tiers}

8GB VRAM {#8gb}

GPUs: RTX 4060, RTX 3060 8GB, RTX 3070

Combo	STT	TTS	Total VRAM	Use Case
Speed focus	Whisper turbo (6GB)	Kokoro-82M (<1GB)	~7 GB	Fast transcription + readback
Voice clone	Whisper medium (5GB)	Chatterbox Turbo (4GB)	~9 GB	Won’t fit simultaneously
Multilingual	Whisper turbo (6GB)	Piper (<1GB)	~7 GB	20+ language support

Best combo: Whisper large-v3-turbo + Kokoro-82M. Both fit comfortably in 8GB and give you near-commercial-quality transcription with fast, natural readback. If you need voice cloning, run Chatterbox Turbo alone (the STT model needs to be unloaded first) or drop to Whisper medium.

TTS-only pick: Qwen3-TTS 0.6B at 4-6GB gives you voice cloning in 10 languages on an 8GB card, though you won’t have room for simultaneous Whisper.

12GB VRAM {#12gb}

GPUs: RTX 3060 12GB, RTX 4070

Now things get comfortable. You can run Whisper and a voice cloning TTS model simultaneously.

Combo	STT	TTS	Total VRAM
Best quality	Whisper turbo (6GB)	Chatterbox Turbo (4GB)	~10 GB
Multilingual clone	Whisper turbo (6GB)	Qwen3-TTS 0.6B (5GB)	~11 GB
Max quality TTS	Whisper medium (5GB)	Qwen3-TTS 1.7B (7GB)	~12 GB

Best combo: Whisper turbo + Chatterbox Turbo. Commercial-grade transcription and voice cloning running simultaneously with 2GB headroom.

16GB VRAM {#16gb}

GPUs: RTX 4060 Ti 16GB, RTX 5060, Arc A770

Combo	STT	TTS	Total VRAM
Full pipeline	Whisper large-v3 (10GB)	Chatterbox Turbo (4GB)	~14 GB
Voice assistant	Whisper turbo (6GB)	Qwen3-TTS 1.7B (7GB)	~13 GB
Everything + LLM	Whisper turbo (6GB)	Kokoro (<1GB)	~7 GB (+9GB for LLM)

Best combo: Whisper large-v3 (full quality) + Chatterbox Turbo. The full-size Whisper model is marginally more accurate than turbo on difficult audio (accented speech, background noise), and 16GB gives you room to run both.

Voice assistant setup: Use the remaining ~3GB for context on an LLM to build a complete voice-in, AI-process, voice-out pipeline. Whisper transcribes your speech, a small LLM generates a response, Kokoro reads it back. All local, all real-time.

24GB VRAM {#24gb}

GPUs: RTX 3090, RTX 4090

Everything fits. The question is what to do with the extra room.

Combo	STT	TTS	LLM	Total VRAM
Voice assistant	Whisper turbo (6GB)	Chatterbox (6GB)	Qwen 3 8B (6.5GB)	~18.5 GB
Power pipeline	Whisper large-v3 (10GB)	Qwen3-TTS 1.7B (7GB)	-	~17 GB
Full stack	Whisper turbo (6GB)	Kokoro (<1GB)	Qwen 3 14B (10.7GB)	~18 GB

Best combo: Whisper turbo + Chatterbox Turbo + Qwen 3 8B. A complete voice assistant running locally - speak your question, get an AI response read back in a cloned voice. Total VRAM under 17GB, leaving 7GB free.

32GB VRAM {#32gb}

GPUs: RTX 5090

Combo	STT	TTS	LLM	Total VRAM
Premium assistant	Whisper large-v3 (10GB)	Qwen3-TTS 1.7B (7GB)	Qwen 3 14B (10.7GB)	~28 GB
Speed focus	Whisper turbo (6GB)	Chatterbox Turbo (4GB)	GPT-OSS 20B (14GB)	~24 GB

Best combo: Whisper large-v3 + Qwen3-TTS 1.7B + Qwen 3 14B. Maximum quality across all three components. Ten-language voice cloning, accurate transcription, and a genuinely smart conversational AI - all running on a single GPU with 4GB to spare.

Cross-Tier Summary

Best STT by Tier

Tier	Model	WER	Speed
8GB+	Whisper large-v3-turbo	~7.75%	6x faster than large-v3
16GB+	Whisper large-v3	~7.88%	Baseline

Best TTS by Tier

Tier	Model	Voice Clone	Quality	Speed
Any (CPU)	Piper	No	Good	Real-time on Pi
Any (CPU/GPU)	Kokoro-82M	No	Good	96x real-time
8GB	Qwen3-TTS 0.6B	Yes	Very Good	97ms latency
12GB+	Chatterbox Turbo	Yes	Excellent	Sub-200ms
16GB+	Qwen3-TTS 1.7B	Yes	Excellent	97ms latency

Quick Start

Transcription with Faster-Whisper

pip install faster-whisper

# Python
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cuda")
segments, info = model.transcribe("meeting.mp3")
for segment in segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

TTS with Chatterbox

pip install chatterbox-tts

# Python
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained("cuda")
audio = model.generate("Hello, this is running entirely on my own GPU.",
                       audio_prompt="reference_voice.wav")

TTS with Kokoro (zero GPU needed)

pip install kokoro-onnx

# Works on CPU - no GPU required
from kokoro_onnx import Kokoro
kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")
audio, sr = kokoro.create("Your text here", voice="af_heart")

For complete setup guides with Docker, speaker identification, and production configurations: