Show HN: Dia, an open-weights TTS model for generating realistic dialogue

Audio Quality and Expressiveness

  • Many listeners find the demos shockingly good, on par with or better than popular closed models and clearly ahead of “robotic” TTS they’re used to.
  • Standout traits: natural dialogue flow, overlapping-style conversation, emotional range (laughs, coughs, yelling), and convincing “podcast/NPR”‑like delivery.
  • Critiques: sample voices are often too energetic/“ad-like” and lack calm, neutral conversation. Some hear speech that’s noticeably too fast and accelerates over time; one commenter links this to known CFG-induced speed drift from the Parakeet architecture.
  • Users also report artifacts: initial hissing, occasional background “music,” incomplete use of the text prompt, and sometimes cutting off the end of the script. A few prompts (especially with custom non-verbal tags) produce bizarre or profane outputs.

Use Cases and Desired Features

  • Strong interest in multi-voice audiobooks: consistent character voices, LLM-driven casting, and expressive narration. Some see this as approaching or eventually rivaling human narrators; others insist humans still add unique value.
  • Other proposed uses: VR games, dialogue-heavy apps, language practice, medical transcription/education, and AI podcast-like experiences.
  • Frequent requests:
    • More languages (Chinese, Finnish, etc.; currently English-only).
    • More than two speakers per scene.
    • Streaming/low-latency output and real-time use.
    • Word-level timing maps and better control of speaker selection.
    • More reliable voice cloning and possibly fine-tuning, beyond zero-shot prompts.

Model Architecture, Performance, and Tooling

  • Dia generates an entire conversation in a single pass instead of per-turn stitching, which users see as a conceptual advantage.
  • The model is ~1.6B parameters, open-weights, and currently needs ~10GB VRAM, though quantization and optimizations are planned. Community reports it running (slowly) on Apple Silicon and via Hugging Face Spaces (including ZeroGPU).
  • Architecture is acknowledged as closely inspired by Parakeet and SoundStorm, using Descript’s DAC codec and Whisper-D–style tags. Future iterations are planned with MoE and sliding-window attention.
  • Community members are already wrapping it in servers, Docker images, and Unity/VR integrations.

Training Data, Licensing, and Ethics

  • Several commenters press for clarity on training data origin, suspecting heavy use of podcast-style material. This triggers a broader debate about fair use, consent, and the double standard between enforcing FOSS licenses vs. tolerating opaque AI datasets.
  • License is Apache 2.0; additional text about “intended for research and educational use” and forbidden misuse is clarified as guidance, not legally binding extra restrictions.
  • Some raise concerns about voice-cloning misuse and ask whether complementary detection tools will be developed.

Naming, UX, and Ecosystem Context

  • The name “Dia” collides with well-known existing projects, prompting criticism that AI projects often reuse established OSS names without due diligence.
  • Users note rough edges in the Notion-based demo page and intermittent Hugging Face issues.
  • Several see Dia as evidence that small teams can now rival large labs in TTS, and call for an open “Stable Diffusion moment” in speech to challenge expensive proprietary services.