2026-01-15

Pocket TTS: A high quality TTS that gives your CPU a voice

Comparison with Other Local TTS/STT Models

Thread frequently compares Pocket TTS to Kokoro, Supertonic, Soprano, Chatterbox, Piper, SherpaTTS.
Some users feel Kokoro is “better TTS” today, especially given its small size, CPU real‑time performance, and ecosystem; others say Pocket’s voice cloning is the big differentiator.
For STT, Whisper-distill is common; Parakeet/Canary/Nemotron are suggested as much faster alternatives, though often English‑only and with limited language coverage.

Voice Quality, Cloning, and Model Behavior

Many are impressed by how natural Pocket TTS sounds for a <200M model and how well zero‑shot cloning works with just a few seconds of audio.
Others note zero‑shot cloning is inherently weaker than fine‑tuned voices for speaker similarity and prosody.
Several reports of serious text-skipping/reordering bugs (e.g., classic literature passages lose or rearrange clauses; extra repeated phrases in song lyrics).
A Kyutai contributor attributes this to chunking and suggests shorter inputs as a temporary workaround, with more advanced chunking planned.
Users also ask for explicit speed control beyond sample rate.

Language Support and Multilingual Needs

Strong criticism that the model is English‑only; some argue useful TTS for real-world use (especially accessibility/screen readers, navigation, messaging) must:
- Support multiple languages, and
- Automatically switch languages mid‑sentence or even mid‑word.
Others call that an unreasonably high bar for a tiny CPU‑friendly model and point out that serving 1.5B English speakers is already valuable.
Long subthread debates how humans actually code‑switch in speech and note that older non‑AI TTS and screen readers have done automatic language switching for years.

Licensing and Legal Ambiguity

The repo says MIT, but also has a “Prohibited Uses” section (e.g., crime, voice cloning without consent).
Commenters point out this conflicts with MIT’s “without restriction” language and likely creates a de‑facto custom license with unclear enforceability.
Some speculate the code might be MIT while models are under a different, more restrictive license, but this remains unclear from the thread.

Integrations, Tooling, and Offline Use

Multiple quick integrations appear: MCP servers for assistants, an extension-like browser reader, plugins for agent frameworks, and local notification tools.
People appreciate that it runs locally (e.g., uvx pocket-tts serve) and can output WAV to stdout; stdin text support and a small static binary are requested.
There are questions about minimum laptop hardware, emotion-aware TTS, other languages (including Thai), and whether separate non-English models are planned.

Broader Views on TTS and Market Impact

Some see rapid TTS progress reminiscent of early Stable Diffusion and talk about cheap, self-generated audiobooks threatening platforms like Audible.
Others counter that users pay for convenience, that e-books plus DIY TTS may not beat subscriptions, and that human narration still adds artistic value.
One commenter dismisses “AI in TTS” as unnecessary, while another notes neural/vocoder-based TTS has already been standard for years.

Related topics