Neural audio codecs: how to get audio into LLMs

Overall reception

  • Thread is highly positive about the article: praised as dense, clear, visually excellent, and a strong conceptual overview of neural audio, tokenization, and codecs.
  • Several people mention sharing it with teams or using it to guide current audio/voice projects.

“Real understanding” and tokenization

  • Some push back on the article’s contrast between speech wrappers (ASR→LLM→TTS) and “real speech understanding,” arguing that text tokenization is also a lossy, non-“real” representation.
  • Others note that “understanding” itself isn’t well defined; current systems are judged by behavioral benchmarks, not mechanistic criteria.
  • Related work is cited on learning tokenization and language modeling end-to-end, including for text, images, and audio.

Audio-first models and data constraints

  • Multiple commenters ask why we don’t just tokenize speech directly and build LLMs on speech tokens.
  • Points raised:
    • Audio tokens are far more numerous than text tokens (at least ~4×), increasing cost.
    • There’s a lot of speech in the world, but still far less normalized, labeled, and linguistically clean than text.
    • Aligning audio with text (timing) used to be a concern but is now mostly solved by modern ASR; huge timestamped corpora have been built with Whisper-like systems.
  • Some expect audio-first models to eventually surpass text-only LLMs in communicative nuance.

Neural codecs vs traditional codecs (MP3/Opus, formants, physics)

  • Core discussion is how to turn continuous audio into discrete tokens suitable for autoregressive models.
  • Neural codecs (VQ-VAE, RVQ) are favored because they:
    • Achieve very low bitrates (≈1–3 kbps) while preserving intelligibility and prosody.
    • Produce categorical, discrete tokens that are easier for transformers than continuous embeddings or heavily compressed bytestreams.
  • Traditional codecs (MP3/Opus, formant/source–filter models) are discussed:
    • Pros: psychoacoustic design, lower CPU cost, decades of engineering.
    • Cons: bitrates still high; bitpacking and psychoacoustic pruning obscure structure that models might need to learn semantics and generalize.
    • Some argue that discarding “inaudible” components may hurt learning, even if humans can’t consciously perceive them.

Pitch, emotion, and non-verbal cues

  • Several users test current voice LLMs and find they often fail at pitch recognition, melody, accent contrasts, and fine-grained prosody.
  • Debate whether this is:
    • A capability/representation issue: audio tokens dominated by text-like information, models trained mostly to map to/from text.
    • Or an alignment/safety issue: restrictions against accent-matching, voice imitation, or music generation may have suppressed capabilities that were present early on.
  • Example: synthetic TTS data used for training carries little meaningful variation in tone, so models may learn to ignore prosody.
  • There is interest in ASR that outputs not only words but metadata on pitch, emotion, and non-verbal sounds; current mainstream ASR usually drops these.

Signal representations: waveforms, spectrograms, and human expertise

  • A side-thread debates whether experienced audio engineers can “read” phonemes or words from raw waveforms.
    • Skeptics say typical DAW waveforms don’t contain enough visible information for that; maybe coarse cues like “um” or word boundaries.
    • Others report being able to visually distinguish certain consonants/vowels with assistance from tools like Melodyne and spectrograms.
  • Historical work on spectrogram reading is mentioned as an analogy for models processing time–frequency representations (e.g., Whisper).

Model architectures and hierarchy

  • Some propose that linear/constant-time sequence models (RWKV, S4) or hierarchical setups might be better suited to audio than full transformers.
    • Idea: a fast, low-level phonetic model plus a slower, high-level transformer that operates on coarser “summary” tokens carrying semantics and emotion.
  • Related existing work is cited (e.g., hierarchical token stacks in music models, patch-based audio models), supporting the general direction.

Alignment, accents, and social issues

  • Discussion touches on whether voice models should match user accents or deliberately avoid it.
  • Some view non-matching as an overcautious, sociopolitical choice; others emphasize avoiding automated inferences about race from voice.
  • There’s concern about models becoming “phrenology machines” if they predict race/ethnicity from audio.

Practical tools, applications, and accessibility

  • Commenters mention existing tools (podcast editors, Descript-style systems) that already mix ASR and audio manipulation, hinting at near-term use cases: automatic filler removal, prosody-aware editing, emotional TTS.
  • Several express excitement about future systems that:
    • Truly understand pronunciation, intonation, and emotion.
    • Can correct second-language accents or respond playfully to how you speak.
  • One commenter criticizes limited public access to some of the discussed tooling (e.g., voice cloning systems requiring short samples), noting that closed deployment slows community experimentation.