Self-Host Voice Cloning: Replace ElevenLabs With Chatterbox TTS

Clone voices locally with Chatterbox TTS. Free, open-source alternative to ElevenLabs that actually wins blind tests. Docker setup included.

ElevenLabs charges $5/month for basic voice cloning with usage caps. Chatterbox TTS is free, runs locally, and beat ElevenLabs in blind listening tests with 63.8% of listeners preferring its output.

Here’s how to set it up on your own hardware in about 15 minutes.

What You Get

Chatterbox is a family of three open-source TTS models from Resemble AI, all MIT licensed:

ModelParametersLanguagesBest For
Chatterbox-Turbo350MEnglishLow-latency, real-time use
Chatterbox-Multilingual500M23 languagesGlobal applications
Chatterbox (original)500MEnglishFine-grained emotion control

All three support zero-shot voice cloning from a single 10-second audio clip. No model training required.

The 23 supported languages include Arabic, Danish, German, Spanish, French, Japanese, Korean, Mandarin Chinese, Portuguese, Russian, Turkish, and more.

What You Need

Minimum:

  • 8GB RAM
  • 10GB disk space
  • Docker installed

Recommended:

  • NVIDIA GPU with 8GB+ VRAM (RTX 3060 Ti or better)
  • 16GB RAM

Chatterbox-Turbo’s optimized 350M parameter architecture runs reasonably on consumer GPUs. The original model needs 8-16GB VRAM for comfortable use.

CPU-only mode works but expect significantly slower generation times.

Quick Setup With Docker

The Chatterbox TTS Server project provides a web UI and API with Docker support for NVIDIA CUDA, AMD ROCm, and CPU-only operation.

NVIDIA GPU (CUDA 12.1)

git clone https://github.com/devnen/Chatterbox-TTS-Server.git
cd Chatterbox-TTS-Server
docker compose up -d

AMD GPU (ROCm)

docker compose -f docker-compose-rocm.yml up -d

CPU Only

docker compose -f docker-compose-cpu.yml up -d

Newer NVIDIA GPUs (RTX 5090/Blackwell)

docker compose -f docker-compose-cu128.yml up -d

Access the web UI at http://localhost:8004.

Using the Web Interface

The server includes a full-featured web UI:

  1. Select a model - Switch between Turbo, Multilingual, or Original via the dropdown
  2. Add reference audio - Upload a 10+ second WAV clip of the voice you want to clone
  3. Enter text - Type what you want the cloned voice to say
  4. Adjust parameters - Temperature, exaggeration, CFG weight control output variation
  5. Generate - Click and wait a few seconds

The interface includes a waveform player for previewing output before downloading.

Voice Cloning Tips

For best results with your reference audio:

Do:

  • Use clean recordings with minimal background noise
  • Include natural speech patterns (not monotone reading)
  • Match the language of your reference to your output language
  • Aim for 10-30 seconds of clear speech

Don’t:

  • Use compressed audio (low bitrate MP3s)
  • Mix multiple speakers in the reference
  • Include music or sound effects

Cross-language voice cloning works - you can provide an English reference and generate Japanese speech - but accent transfer is more reliable when languages match.

API Access

The server exposes an API at /tts for programmatic access. Interactive documentation lives at http://localhost:8004/docs.

Basic curl example:

curl -X POST "http://localhost:8004/tts" \
  -F "text=Hello from my local voice clone" \
  -F "reference_audio=@my_voice_clip.wav" \
  --output output.wav

The API accepts parameters for voice selection, chunking, generation settings, and output format.

Alternative: Minimal Python Setup

If you don’t need the web UI, install Chatterbox directly:

pip install chatterbox-tts

Basic usage:

from chatterbox import ChatterboxTurbo
import torchaudio

model = ChatterboxTurbo.from_pretrained(device="cuda")

text = "This is my cloned voice speaking."
wav = model.generate(text, audio_prompt_path="reference.wav")

torchaudio.save("output.wav", wav.cpu(), model.sr)

Add natural expressions with paralinguistic tags:

text = "I can't believe it worked [chuckle] this is amazing!"

Privacy Note

All audio generated by Chatterbox includes PerTh watermarks - imperceptible audio fingerprints for authenticity verification. This is baked into the model for responsible deployment.

Your audio data stays on your hardware. No cloud processing, no uploads, no usage tracking.

Other Options Worth Knowing

Chatterbox isn’t your only choice for local voice cloning:

Kokoro-82M - Only 82M parameters, runs on CPU, Apache licensed. Supports 8 languages and 54 voices. No voice cloning, but includes high-quality preset voices.

OpenVoice V2 - MIT licensed instant voice cloning from MyShell AI. Supports 6 languages natively. Known limitation: accents tend to flatten (British becomes American).

Fish Speech - Needs only 4GB VRAM, fast inference. Voice cloning from 15-second samples. Strong quality scores on TTS Arena.

AllTalk TTS - Multi-engine frontend supporting Coqui XTTS, F5-TTS, Piper, and more. Good if you want to experiment with different backends through one interface.

What This Replaces

Running Chatterbox locally means:

  • No ElevenLabs subscription ($5-330/month)
  • No per-character or per-minute charges
  • No usage caps
  • No uploading sensitive audio to third-party servers
  • No API rate limits

The tradeoff is hardware costs and setup time. If you have a decent GPU and 15 minutes, it’s worth trying.

What You Can Do

  1. Clone the Chatterbox TTS Server repo
  2. Run docker compose up -d with the appropriate compose file for your hardware
  3. Open http://localhost:8004 and upload a voice sample
  4. Generate your first clone

The MIT license means you can use this commercially. Just don’t use it to impersonate people without consent or generate misleading content.