Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model
Vision: tiny, offline, everywhere
- Many see this as a step toward small, offline ML models that run on cheap, ubiquitous hardware without GPUs or cloud calls.
- Use cases discussed: toys, home assistants, medical devices, language learning tools, navigation, robots, “smart toasters,” and local voice interfaces layered on local LLMs.
- Some contrast this “pay once, runs anywhere” model with subscription/cloud approaches from big tech.
Language support and scope
- Current model is English-only; multilingual models are said to be “in the works.”
- Several commenters dislike that the README doesn’t explicitly state the language.
- Non‑English inputs (Japanese, Thai, etc.) either fail or produce nonsense. Expectation is separate models per language, similar to other TTS projects.
Quality and voice characteristics
- Opinions diverge sharply: some call the quality “amazing for 25MB CPU-only,” others find it metallic, mechanical, “anime/overacted,” or tiring for long listening.
- Web and Reddit demos are generally rated higher quality than many users’ local runs; some suspect different settings or voices.
- The release is described as an early “preview checkpoint,” around 10% trained, with promises of improved 15M and 80M models soon.
- Issues reported: weak punctuation/pauses, occasional mispronunciation, problems with very short phrases, but notably good handling of numbers compared to some LLM-based TTS.
Performance and latency
- Benchmarks on a high-end laptop show ~5× realtime generation once loaded; low‑end CPUs can be around realtime or slower.
- Some compare unfavorably to Piper on a Raspberry Pi, which feels “almost instant.”
- Current demo has no chunking, so long texts can fail; chunking is planned.
- Browser demo uses ONNX Runtime; works well in Chrome, but some report Safari/WebGPU issues.
Dependencies, packaging, and licensing
- Despite a ~25MB model, Python environments often balloon to multiple GB and are fragile across Python versions. Many complaints about “dependency hell.”
- ONNX and phonemizer/espeak-ng preprocessing are the main heavy dependencies; maintainers say they’ll try to reduce this and offer a cleaner SDK.
- While the model is advertised as Apache‑2.0, reliance on a GPL‑3 phonemizer (itself using GPL espeak‑ng) effectively makes the combined project GPL‑3 in practice; there’s a long subthread on GPL compatibility, exceptions, and dual licensing.
Comparisons and alternatives
- Frequently mentioned alternatives: Piper, KokoroTTS, Dia, Chatterbox, SherpaTTS, Coqui XTTS, Fish-Speech, F5‑TTS, Picovoice Orca, plus classic Festival, eSpeak, DECtalk, and SAM.
- Consensus: this model is not yet SOTA in naturalness, but is notable for its combination of tiny size, CPU‑only inference, and permissive licensing tier (subject to the GPL issue).