Self-Host Voice Cloning: Replace ElevenLabs With Chatterbox TTS

ElevenLabs charges $5/month for basic voice cloning with usage caps. Chatterbox TTS is free, runs locally, and beat ElevenLabs in blind listening tests with 63.8% of listeners preferring its output.

Here’s how to set it up on your own hardware in about 15 minutes.

What You Get

Chatterbox is a family of three open-source TTS models from Resemble AI, all MIT licensed:

Model	Parameters	Languages	Best For
Chatterbox-Turbo	350M	English	Low-latency, real-time use
Chatterbox-Multilingual	500M	23 languages	Global applications
Chatterbox (original)	500M	English	Fine-grained emotion control

All three support zero-shot voice cloning from a single 10-second audio clip. No model training required.

The 23 supported languages include Arabic, Danish, German, Spanish, French, Japanese, Korean, Mandarin Chinese, Portuguese, Russian, Turkish, and more.

What You Need

Minimum:

8GB RAM
10GB disk space
Docker installed

Recommended:

NVIDIA GPU with 8GB+ VRAM (RTX 3060 Ti or better)
16GB RAM

Chatterbox-Turbo’s optimized 350M parameter architecture runs reasonably on consumer GPUs. The original model needs 8-16GB VRAM for comfortable use.

CPU-only mode works but expect significantly slower generation times.

Quick Setup With Docker

The Chatterbox TTS Server project provides a web UI and API with Docker support for NVIDIA CUDA, AMD ROCm, and CPU-only operation.

NVIDIA GPU (CUDA 12.1)

git clone https://github.com/devnen/Chatterbox-TTS-Server.git
cd Chatterbox-TTS-Server
docker compose up -d

AMD GPU (ROCm)

docker compose -f docker-compose-rocm.yml up -d

CPU Only

docker compose -f docker-compose-cpu.yml up -d

Newer NVIDIA GPUs (RTX 5090/Blackwell)

docker compose -f docker-compose-cu128.yml up -d

Access the web UI at http://localhost:8004.

Using the Web Interface

The server includes a full-featured web UI:

Select a model - Switch between Turbo, Multilingual, or Original via the dropdown
Add reference audio - Upload a 10+ second WAV clip of the voice you want to clone
Enter text - Type what you want the cloned voice to say
Adjust parameters - Temperature, exaggeration, CFG weight control output variation
Generate - Click and wait a few seconds

The interface includes a waveform player for previewing output before downloading.

Voice Cloning Tips

For best results with your reference audio:

Do:

Use clean recordings with minimal background noise
Include natural speech patterns (not monotone reading)
Match the language of your reference to your output language
Aim for 10-30 seconds of clear speech

Don’t:

Use compressed audio (low bitrate MP3s)
Mix multiple speakers in the reference
Include music or sound effects

Cross-language voice cloning works - you can provide an English reference and generate Japanese speech - but accent transfer is more reliable when languages match.

API Access

The server exposes an API at /tts for programmatic access. Interactive documentation lives at http://localhost:8004/docs.

Basic curl example:

curl -X POST "http://localhost:8004/tts" \
  -F "text=Hello from my local voice clone" \
  -F "reference_audio=@my_voice_clip.wav" \
  --output output.wav

The API accepts parameters for voice selection, chunking, generation settings, and output format.

Alternative: Minimal Python Setup

If you don’t need the web UI, install Chatterbox directly:

pip install chatterbox-tts

Basic usage:

from chatterbox import ChatterboxTurbo
import torchaudio

model = ChatterboxTurbo.from_pretrained(device="cuda")

text = "This is my cloned voice speaking."
wav = model.generate(text, audio_prompt_path="reference.wav")

torchaudio.save("output.wav", wav.cpu(), model.sr)

Add natural expressions with paralinguistic tags:

text = "I can't believe it worked [chuckle] this is amazing!"

Privacy Note

All audio generated by Chatterbox includes PerTh watermarks - imperceptible audio fingerprints for authenticity verification. This is baked into the model for responsible deployment.

Your audio data stays on your hardware. No cloud processing, no uploads, no usage tracking.

Other Options Worth Knowing

Chatterbox isn’t your only choice for local voice cloning:

Kokoro-82M - Only 82M parameters, runs on CPU, Apache licensed. Supports 8 languages and 54 voices. No voice cloning, but includes high-quality preset voices.

OpenVoice V2 - MIT licensed instant voice cloning from MyShell AI. Supports 6 languages natively. Known limitation: accents tend to flatten (British becomes American).

Fish Speech - Needs only 4GB VRAM, fast inference. Voice cloning from 15-second samples. Strong quality scores on TTS Arena.

AllTalk TTS - Multi-engine frontend supporting Coqui XTTS, F5-TTS, Piper, and more. Good if you want to experiment with different backends through one interface.

What This Replaces

Running Chatterbox locally means:

No ElevenLabs subscription ($5-330/month)
No per-character or per-minute charges
No usage caps
No uploading sensitive audio to third-party servers
No API rate limits

The tradeoff is hardware costs and setup time. If you have a decent GPU and 15 minutes, it’s worth trying.

What You Can Do

Clone the Chatterbox TTS Server repo
Run docker compose up -d with the appropriate compose file for your hardware
Open http://localhost:8004 and upload a voice sample
Generate your first clone

The MIT license means you can use this commercially. Just don’t use it to impersonate people without consent or generate misleading content.