Pocket TTS: A high quality TTS that gives your CPU a voice

Comparison with Other Local TTS/STT Models

  • Thread frequently compares Pocket TTS to Kokoro, Supertonic, Soprano, Chatterbox, Piper, SherpaTTS.
  • Some users feel Kokoro is “better TTS” today, especially given its small size, CPU real‑time performance, and ecosystem; others say Pocket’s voice cloning is the big differentiator.
  • For STT, Whisper-distill is common; Parakeet/Canary/Nemotron are suggested as much faster alternatives, though often English‑only and with limited language coverage.

Voice Quality, Cloning, and Model Behavior

  • Many are impressed by how natural Pocket TTS sounds for a <200M model and how well zero‑shot cloning works with just a few seconds of audio.
  • Others note zero‑shot cloning is inherently weaker than fine‑tuned voices for speaker similarity and prosody.
  • Several reports of serious text-skipping/reordering bugs (e.g., classic literature passages lose or rearrange clauses; extra repeated phrases in song lyrics).
  • A Kyutai contributor attributes this to chunking and suggests shorter inputs as a temporary workaround, with more advanced chunking planned.
  • Users also ask for explicit speed control beyond sample rate.

Language Support and Multilingual Needs

  • Strong criticism that the model is English‑only; some argue useful TTS for real-world use (especially accessibility/screen readers, navigation, messaging) must:
    • Support multiple languages, and
    • Automatically switch languages mid‑sentence or even mid‑word.
  • Others call that an unreasonably high bar for a tiny CPU‑friendly model and point out that serving 1.5B English speakers is already valuable.
  • Long subthread debates how humans actually code‑switch in speech and note that older non‑AI TTS and screen readers have done automatic language switching for years.

Licensing and Legal Ambiguity

  • The repo says MIT, but also has a “Prohibited Uses” section (e.g., crime, voice cloning without consent).
  • Commenters point out this conflicts with MIT’s “without restriction” language and likely creates a de‑facto custom license with unclear enforceability.
  • Some speculate the code might be MIT while models are under a different, more restrictive license, but this remains unclear from the thread.

Integrations, Tooling, and Offline Use

  • Multiple quick integrations appear: MCP servers for assistants, an extension-like browser reader, plugins for agent frameworks, and local notification tools.
  • People appreciate that it runs locally (e.g., uvx pocket-tts serve) and can output WAV to stdout; stdin text support and a small static binary are requested.
  • There are questions about minimum laptop hardware, emotion-aware TTS, other languages (including Thai), and whether separate non-English models are planned.

Broader Views on TTS and Market Impact

  • Some see rapid TTS progress reminiscent of early Stable Diffusion and talk about cheap, self-generated audiobooks threatening platforms like Audible.
  • Others counter that users pay for convenience, that e-books plus DIY TTS may not beat subscriptions, and that human narration still adds artistic value.
  • One commenter dismisses “AI in TTS” as unnecessary, while another notes neural/vocoder-based TTS has already been standard for years.