Chatterbox TTS
Release, demos, and perceived quality
- Public demos via Hugging Face and a dedicated demo page impress many: natural, expressive speech and convincing zero-shot cloning from short samples.
- Others find the demo cherry-picked: locally they get less emotion, accent drift, or muffled/low-quality results compared to ElevenLabs.
- Some hear artifacts (whooshes, “machine” sounds) and note outputs can get unstable when tweaking CFG/pace.
- A 40-second limit in the public demo is reported but not clearly documented.
Audiobooks and practical use cases
- Multiple users confirm current TTS (including tools using Chatterbox and Kokoro) is “good enough” to narrate whole books, though not at human narrator quality.
- Workflows exist to turn EPUBs into m4b/m4a audiobooks with various open tools; Chatterbox is one more option in that ecosystem.
- People envision future e-books read on-device by AI with richer interactivity (e.g., ask for context mid-book).
Technical characteristics & performance
- Uses an LLM-like backbone over audio tokens from a neural codec; audio generation is framed as next-token prediction, then decoded.
- VRAM reports around 5–7 GB; runs on consumer GPUs (e.g., 2060, 3090) but not yet well optimized, and real-time is borderline on many setups.
- CPU-only is possible in theory but experiences vary; installation is fragile (Python version, PyTorch/torchaudio pins, system packages, CMake issues).
- Some wrappers (Dockerized APIs, Lightning/Truss examples, CLI tools) aim to simplify deployment and enable longer texts than the hosted demo.
Openness and licensing debate
- Weights and inference code are released, but training and fine-tuning code are withheld; fine-tuning is offered via a paid API.
- This sparks a long “how open is open?” argument: critics call it “3/10 open” and a marketing move compared to other semi-open TTS models; defenders argue open weights are still meaningfully open.
- Skeptics claim no one will build community fine-tuning; others immediately point to third-party repos that already implement it, including a German fine-tune.
TTS vs speech recognition
- Several argue TTS quality is no longer the bottleneck; speech-to-text (ASR) and downstream handling are.
- Users report good experiences with newer open ASR models (Whisper variants, NVIDIA Parakeet) and note that LLM post-processing can clean transcripts, infer speaker names, and handle diarization.
- Diarization remains a deployment pain point; WhisperX and whisper-diarization are mentioned, along with practical setup advice.
Language support, accents, and pronunciation
- Chatterbox supports only English; this frustrates users seeking multilingual TTS (French, German, Japanese, etc.).
- Some report good cloning for “common” accents but systematic accent drift (Scottish → Australian, Australian → RP, RP → Yorkshire).
- Pronunciation of heteronyms and vowel pairs remains a general TTS problem; suggestions include prompting for disambiguation or better phonemizer setups.
Watermarking
- Generated audio is said to include an imperceptible watermark, but in this repo it’s a separate post-processing step that can be disabled via a flag or code change.
- Some see it as “CYA” for abuse concerns or a convenience feature for downstream products, but technically it offers little protection in an open-weight setting.
Terminology, UX, and ecosystem
- Several readers complain that “TTS” isn’t expanded in the README; suggestions include basic writing hygiene and even acronym-expanding browser extensions.
- Users compare Chatterbox with Kokoro, ElevenLabs, PlayHT’s PlayDiffusion, MegaTTS3, Seed-VC, Real-Time Voice Cloning, OpenVoice2, and others as they explore the crowded TTS landscape.
Security and societal concerns
- Users highlight the rising risk of voice-based scams (e.g., “friend” needing urgent gift cards), suggesting shared family passphrases or other verification rituals.
- There’s a sense that realistic cloned voices, coupled with cheap access, will significantly amplify phone fraud, even for non-English accents.