Best Local Speech Models in 2026: TTS and STT on Every GPU Tier

Voice cloning, transcription, and text-to-speech without the cloud. Whisper, Chatterbox, Qwen3-TTS, Piper, and Kokoro tested from 8GB to 32GB VRAM.

Professional microphone in a recording studio

Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB

Deep dives: Chat | Coding | Translation | Vision | Speech | Agents

ElevenLabs charges $22/month for voice cloning. Otter.ai charges $17/month for transcription. Both send your audio to their servers. Running speech models locally gives you unlimited voice cloning, unlimited transcription, and complete privacy - all for the one-time cost of the GPU you already own.

This guide covers both halves of local speech: speech-to-text (STT / transcription) and text-to-speech (TTS / voice generation). The two tasks have very different hardware requirements, and you can often run both simultaneously.

Part 1: Speech-to-Text (Transcription)

The Models

OpenAI’s Whisper dominates local STT. It’s open-source, supports 99 languages, and has been optimized by the community into multiple fast variants. Here’s how the sizes compare:

ModelParametersVRAMWER (English)Speed vs. Real-Time
Whisper tiny39M~1 GB~12%30x+
Whisper base74M~1 GB~10%20x+
Whisper small244M~2 GB~8%10x+
Whisper medium769M~5 GB~7%5x+
Whisper large-v3-turbo809M~6 GB~7.75%6x faster than large-v3
Whisper large-v31.5B~10 GB~7.88%1x (baseline)

Word Error Rate (WER) is measured on standard English benchmarks. Lower is better. For reference, commercial services like AssemblyAI’s Universal-2 hit about 6.68% WER.

The Recommendation: Whisper large-v3-turbo

For nearly everyone, large-v3-turbo is the answer. It achieves almost identical accuracy to the full large-v3 (7.75% vs 7.88% WER) while being 6x faster, by reducing decoder layers from 32 to 4. At ~6GB VRAM, it fits on any 8GB+ GPU with room to spare.

Use Faster-Whisper for the best performance - it’s a CTranslate2-optimized implementation that’s 4x faster than OpenAI’s original code with the same accuracy.

STT by VRAM Tier

TierBest Whisper ModelVRAM UsedRoom for TTS?
8GBlarge-v3-turbo~6 GBTight - Piper/Kokoro only
12GBlarge-v3-turbo~6 GBYes - Chatterbox Turbo fits
16GBlarge-v3 (full)~10 GBYes - most TTS models
24GBlarge-v3 (full)~10 GBYes - everything fits
32GBlarge-v3 (full)~10 GBYes - full pipeline + LLM

We have a detailed Whisper setup guide: How to Self-Host Whisper: Replace Otter.ai.

Part 2: Text-to-Speech (Voice Generation)

TTS is where the real variety is. Models range from 82M parameters (runs on a potato) to 1.7B (needs a real GPU), with dramatic differences in quality, speed, and voice cloning capability.

The Landscape

ModelParamsVRAMVoice Clone?LanguagesSpeedQuality
Kokoro-82M82M<1 GBNo896x real-time (GPU)Good
PiperVaries<1 GBNo (pre-trained)20+Real-time on CPUGood
Chatterbox Turbo350M~4 GBYesEnglishSub-200ms latencyExcellent
Chatterbox Original500M~6 GBYesEnglishModerateExcellent
Chatterbox Multi500M~6 GBYes10+ModerateExcellent
F5-TTS~335M~6 GBYes (zero-shot)MultiSub-7s processingExcellent
Qwen3-TTS 0.6B600M4-6 GBYes (3s sample)1097ms latencyVery Good
Qwen3-TTS 1.7B1.7B6-8 GBYes (3s sample)1097ms latencyExcellent

Model Breakdown

Kokoro-82M - The speed king. At just 82M parameters, this model runs at 96x real-time on a basic GPU and 3-5x real-time on CPU alone. It ranked #1 on the TTS Spaces Arena, beating models 5-15x its size. No voice cloning, but the built-in voices are natural enough for most uses. If you just need to read text aloud, this is it.

Piper - The edge champion. Piper runs on a Raspberry Pi at real-time speed using ~500MB RAM. It’s purpose-built for home automation (reading notifications, smart home responses) and doesn’t need a GPU at all. Quality is a step below the neural models, but the zero-VRAM requirement means you can run it alongside anything.

Chatterbox - The quality leader. Resemble AI’s Chatterbox family beat ElevenLabs in blind evaluations with a 63.75% preference rate. The Turbo variant (350M) is the sweet spot - one-step generation at sub-200ms latency with voice cloning from a short sample. MIT licensed.

F5-TTS - The voice cloning specialist. F5-TTS does zero-shot voice cloning without fine-tuning - give it a reference clip and it reproduces the voice. At ~6GB VRAM it’s accessible on 8GB GPUs, and the cloning quality rivals commercial services.

Qwen3-TTS - The multilingual powerhouse. Alibaba’s TTS family supports 10 languages with 3-second voice cloning and streaming output. The 1.7B model at 6-8GB VRAM beats ElevenLabs and MiniMax on quality benchmarks. The 0.6B variant at 4-6GB brings the same capabilities to lower hardware.

TTS by VRAM Tier {#tts-tiers}

8GB VRAM {#8gb}

GPUs: RTX 4060, RTX 3060 8GB, RTX 3070

ComboSTTTTSTotal VRAMUse Case
Speed focusWhisper turbo (6GB)Kokoro-82M (<1GB)~7 GBFast transcription + readback
Voice cloneWhisper medium (5GB)Chatterbox Turbo (4GB)~9 GBWon’t fit simultaneously
MultilingualWhisper turbo (6GB)Piper (<1GB)~7 GB20+ language support

Best combo: Whisper large-v3-turbo + Kokoro-82M. Both fit comfortably in 8GB and give you near-commercial-quality transcription with fast, natural readback. If you need voice cloning, run Chatterbox Turbo alone (the STT model needs to be unloaded first) or drop to Whisper medium.

TTS-only pick: Qwen3-TTS 0.6B at 4-6GB gives you voice cloning in 10 languages on an 8GB card, though you won’t have room for simultaneous Whisper.

12GB VRAM {#12gb}

GPUs: RTX 3060 12GB, RTX 4070

Now things get comfortable. You can run Whisper and a voice cloning TTS model simultaneously.

ComboSTTTTSTotal VRAM
Best qualityWhisper turbo (6GB)Chatterbox Turbo (4GB)~10 GB
Multilingual cloneWhisper turbo (6GB)Qwen3-TTS 0.6B (5GB)~11 GB
Max quality TTSWhisper medium (5GB)Qwen3-TTS 1.7B (7GB)~12 GB

Best combo: Whisper turbo + Chatterbox Turbo. Commercial-grade transcription and voice cloning running simultaneously with 2GB headroom.

16GB VRAM {#16gb}

GPUs: RTX 4060 Ti 16GB, RTX 5060, Arc A770

ComboSTTTTSTotal VRAM
Full pipelineWhisper large-v3 (10GB)Chatterbox Turbo (4GB)~14 GB
Voice assistantWhisper turbo (6GB)Qwen3-TTS 1.7B (7GB)~13 GB
Everything + LLMWhisper turbo (6GB)Kokoro (<1GB)~7 GB (+9GB for LLM)

Best combo: Whisper large-v3 (full quality) + Chatterbox Turbo. The full-size Whisper model is marginally more accurate than turbo on difficult audio (accented speech, background noise), and 16GB gives you room to run both.

Voice assistant setup: Use the remaining ~3GB for context on an LLM to build a complete voice-in, AI-process, voice-out pipeline. Whisper transcribes your speech, a small LLM generates a response, Kokoro reads it back. All local, all real-time.

24GB VRAM {#24gb}

GPUs: RTX 3090, RTX 4090

Everything fits. The question is what to do with the extra room.

ComboSTTTTSLLMTotal VRAM
Voice assistantWhisper turbo (6GB)Chatterbox (6GB)Qwen 3 8B (6.5GB)~18.5 GB
Power pipelineWhisper large-v3 (10GB)Qwen3-TTS 1.7B (7GB)-~17 GB
Full stackWhisper turbo (6GB)Kokoro (<1GB)Qwen 3 14B (10.7GB)~18 GB

Best combo: Whisper turbo + Chatterbox Turbo + Qwen 3 8B. A complete voice assistant running locally - speak your question, get an AI response read back in a cloned voice. Total VRAM under 17GB, leaving 7GB free.

32GB VRAM {#32gb}

GPUs: RTX 5090

ComboSTTTTSLLMTotal VRAM
Premium assistantWhisper large-v3 (10GB)Qwen3-TTS 1.7B (7GB)Qwen 3 14B (10.7GB)~28 GB
Speed focusWhisper turbo (6GB)Chatterbox Turbo (4GB)GPT-OSS 20B (14GB)~24 GB

Best combo: Whisper large-v3 + Qwen3-TTS 1.7B + Qwen 3 14B. Maximum quality across all three components. Ten-language voice cloning, accurate transcription, and a genuinely smart conversational AI - all running on a single GPU with 4GB to spare.

Cross-Tier Summary

Best STT by Tier

TierModelWERSpeed
8GB+Whisper large-v3-turbo~7.75%6x faster than large-v3
16GB+Whisper large-v3~7.88%Baseline

Best TTS by Tier

TierModelVoice CloneQualitySpeed
Any (CPU)PiperNoGoodReal-time on Pi
Any (CPU/GPU)Kokoro-82MNoGood96x real-time
8GBQwen3-TTS 0.6BYesVery Good97ms latency
12GB+Chatterbox TurboYesExcellentSub-200ms
16GB+Qwen3-TTS 1.7BYesExcellent97ms latency

Quick Start

Transcription with Faster-Whisper

pip install faster-whisper

# Python
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cuda")
segments, info = model.transcribe("meeting.mp3")
for segment in segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

TTS with Chatterbox

pip install chatterbox-tts

# Python
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained("cuda")
audio = model.generate("Hello, this is running entirely on my own GPU.",
                       audio_prompt="reference_voice.wav")

TTS with Kokoro (zero GPU needed)

pip install kokoro-onnx

# Works on CPU - no GPU required
from kokoro_onnx import Kokoro
kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")
audio, sr = kokoro.create("Your text here", voice="af_heart")

For complete setup guides with Docker, speaker identification, and production configurations: