Pocket TTS: A high quality TTS that gives your CPU a voice
Comparison with Other Local TTS/STT Models
- Thread frequently compares Pocket TTS to Kokoro, Supertonic, Soprano, Chatterbox, Piper, SherpaTTS.
- Some users feel Kokoro is “better TTS” today, especially given its small size, CPU real‑time performance, and ecosystem; others say Pocket’s voice cloning is the big differentiator.
- For STT, Whisper-distill is common; Parakeet/Canary/Nemotron are suggested as much faster alternatives, though often English‑only and with limited language coverage.
Voice Quality, Cloning, and Model Behavior
- Many are impressed by how natural Pocket TTS sounds for a <200M model and how well zero‑shot cloning works with just a few seconds of audio.
- Others note zero‑shot cloning is inherently weaker than fine‑tuned voices for speaker similarity and prosody.
- Several reports of serious text-skipping/reordering bugs (e.g., classic literature passages lose or rearrange clauses; extra repeated phrases in song lyrics).
- A Kyutai contributor attributes this to chunking and suggests shorter inputs as a temporary workaround, with more advanced chunking planned.
- Users also ask for explicit speed control beyond sample rate.
Language Support and Multilingual Needs
- Strong criticism that the model is English‑only; some argue useful TTS for real-world use (especially accessibility/screen readers, navigation, messaging) must:
- Support multiple languages, and
- Automatically switch languages mid‑sentence or even mid‑word.
- Others call that an unreasonably high bar for a tiny CPU‑friendly model and point out that serving 1.5B English speakers is already valuable.
- Long subthread debates how humans actually code‑switch in speech and note that older non‑AI TTS and screen readers have done automatic language switching for years.
Licensing and Legal Ambiguity
- The repo says MIT, but also has a “Prohibited Uses” section (e.g., crime, voice cloning without consent).
- Commenters point out this conflicts with MIT’s “without restriction” language and likely creates a de‑facto custom license with unclear enforceability.
- Some speculate the code might be MIT while models are under a different, more restrictive license, but this remains unclear from the thread.
Integrations, Tooling, and Offline Use
- Multiple quick integrations appear: MCP servers for assistants, an extension-like browser reader, plugins for agent frameworks, and local notification tools.
- People appreciate that it runs locally (e.g.,
uvx pocket-tts serve) and can output WAV to stdout; stdin text support and a small static binary are requested. - There are questions about minimum laptop hardware, emotion-aware TTS, other languages (including Thai), and whether separate non-English models are planned.
Broader Views on TTS and Market Impact
- Some see rapid TTS progress reminiscent of early Stable Diffusion and talk about cheap, self-generated audiobooks threatening platforms like Audible.
- Others counter that users pay for convenience, that e-books plus DIY TTS may not beat subscriptions, and that human narration still adds artistic value.
- One commenter dismisses “AI in TTS” as unnecessary, while another notes neural/vocoder-based TTS has already been standard for years.