2026-01-08

Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

Perceived Audio Quality & Usefulness

Many praise the project as impressive given its small size and CPU-only constraints; some find it “good enough” and fun to experiment with.
Others think the demo sounds extremely robotic, distorted, or “warped cassette–like,” and say they cannot imagine anyone mistaking it for a human voice.
Several note strong dependence on the reference audio and parameters; poor references can yield very rough output.
Some users report partial similarity to target voices and find it impressive for the low training budget and easy local use.

Comparisons to Other TTS Models

Chatterbox-TTS is repeatedly cited as a higher-quality alternative, especially with GPU hardware; its outputs are described as “incredible” compared to Sopro’s artifacts.
Kokoro (82M) is mentioned as another high-quality, lightweight local model, though some browser/latency issues are reported.
Other open-source options brought up include Vibe Voice, F2/E5, and Higgs-Audio; some consider Vibe Voice the only viable OSS option for high-quality cloning.
Commercial systems like ElevenLabs are referenced as current quality benchmarks, especially for speech-to-speech.

Zero-shot Voice Cloning Terminology

A long subthread debates what “zero-shot” means.
One camp uses the ML definition: zero-shot = no weight updates for unseen speakers; reference audio at inference is just conditioning context.
Another camp argues that if you must supply an example voice clip, it’s intuitively “one-shot,” and the term is misleading.
Consensus: terminology is overloaded and confusing, but in this project “zero-shot” means no per-speaker training or fine-tuning.

Use Cases & Ethical Concerns

Positive use cases: local voice assistants, on-demand audiobooks, accessibility and restoring voices lost to disease, automation of phone chores.
Strong concern about scams and impersonation (e.g., calls to elderly relatives); some question whether the societal downsides outweigh benefits.
A philosophical thread debates whether “bad technology” exists or only bad uses, with analogies to weapons technologies.

Technical Details & Constraints

Author states this is a hobby project trained for roughly a few hundred dollars; community interest might justify a larger, higher-fidelity model.
Model size (169M) excludes the Mimi codec parameters; uses FiLM for speaker conditioning.
CPU-speed claims (e.g., ~7.5s to generate ~30s audio) are seen as impressive relative to typical GPU-heavy TTS setups.

Related topics