Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU
Perceived Audio Quality & Usefulness
- Many praise the project as impressive given its small size and CPU-only constraints; some find it “good enough” and fun to experiment with.
- Others think the demo sounds extremely robotic, distorted, or “warped cassette–like,” and say they cannot imagine anyone mistaking it for a human voice.
- Several note strong dependence on the reference audio and parameters; poor references can yield very rough output.
- Some users report partial similarity to target voices and find it impressive for the low training budget and easy local use.
Comparisons to Other TTS Models
- Chatterbox-TTS is repeatedly cited as a higher-quality alternative, especially with GPU hardware; its outputs are described as “incredible” compared to Sopro’s artifacts.
- Kokoro (82M) is mentioned as another high-quality, lightweight local model, though some browser/latency issues are reported.
- Other open-source options brought up include Vibe Voice, F2/E5, and Higgs-Audio; some consider Vibe Voice the only viable OSS option for high-quality cloning.
- Commercial systems like ElevenLabs are referenced as current quality benchmarks, especially for speech-to-speech.
Zero-shot Voice Cloning Terminology
- A long subthread debates what “zero-shot” means.
- One camp uses the ML definition: zero-shot = no weight updates for unseen speakers; reference audio at inference is just conditioning context.
- Another camp argues that if you must supply an example voice clip, it’s intuitively “one-shot,” and the term is misleading.
- Consensus: terminology is overloaded and confusing, but in this project “zero-shot” means no per-speaker training or fine-tuning.
Use Cases & Ethical Concerns
- Positive use cases: local voice assistants, on-demand audiobooks, accessibility and restoring voices lost to disease, automation of phone chores.
- Strong concern about scams and impersonation (e.g., calls to elderly relatives); some question whether the societal downsides outweigh benefits.
- A philosophical thread debates whether “bad technology” exists or only bad uses, with analogies to weapons technologies.
Technical Details & Constraints
- Author states this is a hobby project trained for roughly a few hundred dollars; community interest might justify a larger, higher-fidelity model.
- Model size (169M) excludes the Mimi codec parameters; uses FiLM for speaker conditioning.
- CPU-speed claims (e.g., ~7.5s to generate ~30s audio) are seen as impressive relative to typical GPU-heavy TTS setups.