Local AI by VRAM Tier - 8GB | 12GB | 16GB | 24GB | 32GB
Deep dives: Chat | Coding | Translation | Vision | Speech | Agents
ElevenLabs charges $22/month for voice cloning. Otter.ai charges $17/month for transcription. Both send your audio to their servers. Running speech models locally gives you unlimited voice cloning, unlimited transcription, and complete privacy - all for the one-time cost of the GPU you already own.
This guide covers both halves of local speech: speech-to-text (STT / transcription) and text-to-speech (TTS / voice generation). The two tasks have very different hardware requirements, and you can often run both simultaneously.
Part 1: Speech-to-Text (Transcription)
The Models
OpenAI’s Whisper dominates local STT. It’s open-source, supports 99 languages, and has been optimized by the community into multiple fast variants. Here’s how the sizes compare:
| Model | Parameters | VRAM | WER (English) | Speed vs. Real-Time |
|---|---|---|---|---|
| Whisper tiny | 39M | ~1 GB | ~12% | 30x+ |
| Whisper base | 74M | ~1 GB | ~10% | 20x+ |
| Whisper small | 244M | ~2 GB | ~8% | 10x+ |
| Whisper medium | 769M | ~5 GB | ~7% | 5x+ |
| Whisper large-v3-turbo | 809M | ~6 GB | ~7.75% | 6x faster than large-v3 |
| Whisper large-v3 | 1.5B | ~10 GB | ~7.88% | 1x (baseline) |
Word Error Rate (WER) is measured on standard English benchmarks. Lower is better. For reference, commercial services like AssemblyAI’s Universal-2 hit about 6.68% WER.
The Recommendation: Whisper large-v3-turbo
For nearly everyone, large-v3-turbo is the answer. It achieves almost identical accuracy to the full large-v3 (7.75% vs 7.88% WER) while being 6x faster, by reducing decoder layers from 32 to 4. At ~6GB VRAM, it fits on any 8GB+ GPU with room to spare.
Use Faster-Whisper for the best performance - it’s a CTranslate2-optimized implementation that’s 4x faster than OpenAI’s original code with the same accuracy.
STT by VRAM Tier
| Tier | Best Whisper Model | VRAM Used | Room for TTS? |
|---|---|---|---|
| 8GB | large-v3-turbo | ~6 GB | Tight - Piper/Kokoro only |
| 12GB | large-v3-turbo | ~6 GB | Yes - Chatterbox Turbo fits |
| 16GB | large-v3 (full) | ~10 GB | Yes - most TTS models |
| 24GB | large-v3 (full) | ~10 GB | Yes - everything fits |
| 32GB | large-v3 (full) | ~10 GB | Yes - full pipeline + LLM |
We have a detailed Whisper setup guide: How to Self-Host Whisper: Replace Otter.ai.
Part 2: Text-to-Speech (Voice Generation)
TTS is where the real variety is. Models range from 82M parameters (runs on a potato) to 1.7B (needs a real GPU), with dramatic differences in quality, speed, and voice cloning capability.
The Landscape
| Model | Params | VRAM | Voice Clone? | Languages | Speed | Quality |
|---|---|---|---|---|---|---|
| Kokoro-82M | 82M | <1 GB | No | 8 | 96x real-time (GPU) | Good |
| Piper | Varies | <1 GB | No (pre-trained) | 20+ | Real-time on CPU | Good |
| Chatterbox Turbo | 350M | ~4 GB | Yes | English | Sub-200ms latency | Excellent |
| Chatterbox Original | 500M | ~6 GB | Yes | English | Moderate | Excellent |
| Chatterbox Multi | 500M | ~6 GB | Yes | 10+ | Moderate | Excellent |
| F5-TTS | ~335M | ~6 GB | Yes (zero-shot) | Multi | Sub-7s processing | Excellent |
| Qwen3-TTS 0.6B | 600M | 4-6 GB | Yes (3s sample) | 10 | 97ms latency | Very Good |
| Qwen3-TTS 1.7B | 1.7B | 6-8 GB | Yes (3s sample) | 10 | 97ms latency | Excellent |
Model Breakdown
Kokoro-82M - The speed king. At just 82M parameters, this model runs at 96x real-time on a basic GPU and 3-5x real-time on CPU alone. It ranked #1 on the TTS Spaces Arena, beating models 5-15x its size. No voice cloning, but the built-in voices are natural enough for most uses. If you just need to read text aloud, this is it.
Piper - The edge champion. Piper runs on a Raspberry Pi at real-time speed using ~500MB RAM. It’s purpose-built for home automation (reading notifications, smart home responses) and doesn’t need a GPU at all. Quality is a step below the neural models, but the zero-VRAM requirement means you can run it alongside anything.
Chatterbox - The quality leader. Resemble AI’s Chatterbox family beat ElevenLabs in blind evaluations with a 63.75% preference rate. The Turbo variant (350M) is the sweet spot - one-step generation at sub-200ms latency with voice cloning from a short sample. MIT licensed.
F5-TTS - The voice cloning specialist. F5-TTS does zero-shot voice cloning without fine-tuning - give it a reference clip and it reproduces the voice. At ~6GB VRAM it’s accessible on 8GB GPUs, and the cloning quality rivals commercial services.
Qwen3-TTS - The multilingual powerhouse. Alibaba’s TTS family supports 10 languages with 3-second voice cloning and streaming output. The 1.7B model at 6-8GB VRAM beats ElevenLabs and MiniMax on quality benchmarks. The 0.6B variant at 4-6GB brings the same capabilities to lower hardware.
TTS by VRAM Tier {#tts-tiers}
8GB VRAM {#8gb}
GPUs: RTX 4060, RTX 3060 8GB, RTX 3070
| Combo | STT | TTS | Total VRAM | Use Case |
|---|---|---|---|---|
| Speed focus | Whisper turbo (6GB) | Kokoro-82M (<1GB) | ~7 GB | Fast transcription + readback |
| Voice clone | Whisper medium (5GB) | Chatterbox Turbo (4GB) | ~9 GB | Won’t fit simultaneously |
| Multilingual | Whisper turbo (6GB) | Piper (<1GB) | ~7 GB | 20+ language support |
Best combo: Whisper large-v3-turbo + Kokoro-82M. Both fit comfortably in 8GB and give you near-commercial-quality transcription with fast, natural readback. If you need voice cloning, run Chatterbox Turbo alone (the STT model needs to be unloaded first) or drop to Whisper medium.
TTS-only pick: Qwen3-TTS 0.6B at 4-6GB gives you voice cloning in 10 languages on an 8GB card, though you won’t have room for simultaneous Whisper.
12GB VRAM {#12gb}
GPUs: RTX 3060 12GB, RTX 4070
Now things get comfortable. You can run Whisper and a voice cloning TTS model simultaneously.
| Combo | STT | TTS | Total VRAM |
|---|---|---|---|
| Best quality | Whisper turbo (6GB) | Chatterbox Turbo (4GB) | ~10 GB |
| Multilingual clone | Whisper turbo (6GB) | Qwen3-TTS 0.6B (5GB) | ~11 GB |
| Max quality TTS | Whisper medium (5GB) | Qwen3-TTS 1.7B (7GB) | ~12 GB |
Best combo: Whisper turbo + Chatterbox Turbo. Commercial-grade transcription and voice cloning running simultaneously with 2GB headroom.
16GB VRAM {#16gb}
GPUs: RTX 4060 Ti 16GB, RTX 5060, Arc A770
| Combo | STT | TTS | Total VRAM |
|---|---|---|---|
| Full pipeline | Whisper large-v3 (10GB) | Chatterbox Turbo (4GB) | ~14 GB |
| Voice assistant | Whisper turbo (6GB) | Qwen3-TTS 1.7B (7GB) | ~13 GB |
| Everything + LLM | Whisper turbo (6GB) | Kokoro (<1GB) | ~7 GB (+9GB for LLM) |
Best combo: Whisper large-v3 (full quality) + Chatterbox Turbo. The full-size Whisper model is marginally more accurate than turbo on difficult audio (accented speech, background noise), and 16GB gives you room to run both.
Voice assistant setup: Use the remaining ~3GB for context on an LLM to build a complete voice-in, AI-process, voice-out pipeline. Whisper transcribes your speech, a small LLM generates a response, Kokoro reads it back. All local, all real-time.
24GB VRAM {#24gb}
GPUs: RTX 3090, RTX 4090
Everything fits. The question is what to do with the extra room.
| Combo | STT | TTS | LLM | Total VRAM |
|---|---|---|---|---|
| Voice assistant | Whisper turbo (6GB) | Chatterbox (6GB) | Qwen 3 8B (6.5GB) | ~18.5 GB |
| Power pipeline | Whisper large-v3 (10GB) | Qwen3-TTS 1.7B (7GB) | - | ~17 GB |
| Full stack | Whisper turbo (6GB) | Kokoro (<1GB) | Qwen 3 14B (10.7GB) | ~18 GB |
Best combo: Whisper turbo + Chatterbox Turbo + Qwen 3 8B. A complete voice assistant running locally - speak your question, get an AI response read back in a cloned voice. Total VRAM under 17GB, leaving 7GB free.
32GB VRAM {#32gb}
GPUs: RTX 5090
| Combo | STT | TTS | LLM | Total VRAM |
|---|---|---|---|---|
| Premium assistant | Whisper large-v3 (10GB) | Qwen3-TTS 1.7B (7GB) | Qwen 3 14B (10.7GB) | ~28 GB |
| Speed focus | Whisper turbo (6GB) | Chatterbox Turbo (4GB) | GPT-OSS 20B (14GB) | ~24 GB |
Best combo: Whisper large-v3 + Qwen3-TTS 1.7B + Qwen 3 14B. Maximum quality across all three components. Ten-language voice cloning, accurate transcription, and a genuinely smart conversational AI - all running on a single GPU with 4GB to spare.
Cross-Tier Summary
Best STT by Tier
| Tier | Model | WER | Speed |
|---|---|---|---|
| 8GB+ | Whisper large-v3-turbo | ~7.75% | 6x faster than large-v3 |
| 16GB+ | Whisper large-v3 | ~7.88% | Baseline |
Best TTS by Tier
| Tier | Model | Voice Clone | Quality | Speed |
|---|---|---|---|---|
| Any (CPU) | Piper | No | Good | Real-time on Pi |
| Any (CPU/GPU) | Kokoro-82M | No | Good | 96x real-time |
| 8GB | Qwen3-TTS 0.6B | Yes | Very Good | 97ms latency |
| 12GB+ | Chatterbox Turbo | Yes | Excellent | Sub-200ms |
| 16GB+ | Qwen3-TTS 1.7B | Yes | Excellent | 97ms latency |
Quick Start
Transcription with Faster-Whisper
pip install faster-whisper
# Python
from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", device="cuda")
segments, info = model.transcribe("meeting.mp3")
for segment in segments:
print(f"[{segment.start:.1f}s] {segment.text}")
TTS with Chatterbox
pip install chatterbox-tts
# Python
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained("cuda")
audio = model.generate("Hello, this is running entirely on my own GPU.",
audio_prompt="reference_voice.wav")
TTS with Kokoro (zero GPU needed)
pip install kokoro-onnx
# Works on CPU - no GPU required
from kokoro_onnx import Kokoro
kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")
audio, sr = kokoro.create("Your text here", voice="af_heart")
For complete setup guides with Docker, speaker identification, and production configurations: