ElevenLabs charges $5/month for basic voice cloning with usage caps. Chatterbox TTS is free, runs locally, and beat ElevenLabs in blind listening tests with 63.8% of listeners preferring its output.
Here’s how to set it up on your own hardware in about 15 minutes.
What You Get
Chatterbox is a family of three open-source TTS models from Resemble AI, all MIT licensed:
| Model | Parameters | Languages | Best For |
|---|---|---|---|
| Chatterbox-Turbo | 350M | English | Low-latency, real-time use |
| Chatterbox-Multilingual | 500M | 23 languages | Global applications |
| Chatterbox (original) | 500M | English | Fine-grained emotion control |
All three support zero-shot voice cloning from a single 10-second audio clip. No model training required.
The 23 supported languages include Arabic, Danish, German, Spanish, French, Japanese, Korean, Mandarin Chinese, Portuguese, Russian, Turkish, and more.
What You Need
Minimum:
- 8GB RAM
- 10GB disk space
- Docker installed
Recommended:
- NVIDIA GPU with 8GB+ VRAM (RTX 3060 Ti or better)
- 16GB RAM
Chatterbox-Turbo’s optimized 350M parameter architecture runs reasonably on consumer GPUs. The original model needs 8-16GB VRAM for comfortable use.
CPU-only mode works but expect significantly slower generation times.
Quick Setup With Docker
The Chatterbox TTS Server project provides a web UI and API with Docker support for NVIDIA CUDA, AMD ROCm, and CPU-only operation.
NVIDIA GPU (CUDA 12.1)
git clone https://github.com/devnen/Chatterbox-TTS-Server.git
cd Chatterbox-TTS-Server
docker compose up -d
AMD GPU (ROCm)
docker compose -f docker-compose-rocm.yml up -d
CPU Only
docker compose -f docker-compose-cpu.yml up -d
Newer NVIDIA GPUs (RTX 5090/Blackwell)
docker compose -f docker-compose-cu128.yml up -d
Access the web UI at http://localhost:8004.
Using the Web Interface
The server includes a full-featured web UI:
- Select a model - Switch between Turbo, Multilingual, or Original via the dropdown
- Add reference audio - Upload a 10+ second WAV clip of the voice you want to clone
- Enter text - Type what you want the cloned voice to say
- Adjust parameters - Temperature, exaggeration, CFG weight control output variation
- Generate - Click and wait a few seconds
The interface includes a waveform player for previewing output before downloading.
Voice Cloning Tips
For best results with your reference audio:
Do:
- Use clean recordings with minimal background noise
- Include natural speech patterns (not monotone reading)
- Match the language of your reference to your output language
- Aim for 10-30 seconds of clear speech
Don’t:
- Use compressed audio (low bitrate MP3s)
- Mix multiple speakers in the reference
- Include music or sound effects
Cross-language voice cloning works - you can provide an English reference and generate Japanese speech - but accent transfer is more reliable when languages match.
API Access
The server exposes an API at /tts for programmatic access. Interactive documentation lives at http://localhost:8004/docs.
Basic curl example:
curl -X POST "http://localhost:8004/tts" \
-F "text=Hello from my local voice clone" \
-F "reference_audio=@my_voice_clip.wav" \
--output output.wav
The API accepts parameters for voice selection, chunking, generation settings, and output format.
Alternative: Minimal Python Setup
If you don’t need the web UI, install Chatterbox directly:
pip install chatterbox-tts
Basic usage:
from chatterbox import ChatterboxTurbo
import torchaudio
model = ChatterboxTurbo.from_pretrained(device="cuda")
text = "This is my cloned voice speaking."
wav = model.generate(text, audio_prompt_path="reference.wav")
torchaudio.save("output.wav", wav.cpu(), model.sr)
Add natural expressions with paralinguistic tags:
text = "I can't believe it worked [chuckle] this is amazing!"
Privacy Note
All audio generated by Chatterbox includes PerTh watermarks - imperceptible audio fingerprints for authenticity verification. This is baked into the model for responsible deployment.
Your audio data stays on your hardware. No cloud processing, no uploads, no usage tracking.
Other Options Worth Knowing
Chatterbox isn’t your only choice for local voice cloning:
Kokoro-82M - Only 82M parameters, runs on CPU, Apache licensed. Supports 8 languages and 54 voices. No voice cloning, but includes high-quality preset voices.
OpenVoice V2 - MIT licensed instant voice cloning from MyShell AI. Supports 6 languages natively. Known limitation: accents tend to flatten (British becomes American).
Fish Speech - Needs only 4GB VRAM, fast inference. Voice cloning from 15-second samples. Strong quality scores on TTS Arena.
AllTalk TTS - Multi-engine frontend supporting Coqui XTTS, F5-TTS, Piper, and more. Good if you want to experiment with different backends through one interface.
What This Replaces
Running Chatterbox locally means:
- No ElevenLabs subscription ($5-330/month)
- No per-character or per-minute charges
- No usage caps
- No uploading sensitive audio to third-party servers
- No API rate limits
The tradeoff is hardware costs and setup time. If you have a decent GPU and 15 minutes, it’s worth trying.
What You Can Do
- Clone the Chatterbox TTS Server repo
- Run
docker compose up -dwith the appropriate compose file for your hardware - Open
http://localhost:8004and upload a voice sample - Generate your first clone
The MIT license means you can use this commercially. Just don’t use it to impersonate people without consent or generate misleading content.