Neural audio codecs: how to get audio into LLMs
Overall reception
- Thread is highly positive about the article: praised as dense, clear, visually excellent, and a strong conceptual overview of neural audio, tokenization, and codecs.
- Several people mention sharing it with teams or using it to guide current audio/voice projects.
“Real understanding” and tokenization
- Some push back on the article’s contrast between speech wrappers (ASR→LLM→TTS) and “real speech understanding,” arguing that text tokenization is also a lossy, non-“real” representation.
- Others note that “understanding” itself isn’t well defined; current systems are judged by behavioral benchmarks, not mechanistic criteria.
- Related work is cited on learning tokenization and language modeling end-to-end, including for text, images, and audio.
Audio-first models and data constraints
- Multiple commenters ask why we don’t just tokenize speech directly and build LLMs on speech tokens.
- Points raised:
- Audio tokens are far more numerous than text tokens (at least ~4×), increasing cost.
- There’s a lot of speech in the world, but still far less normalized, labeled, and linguistically clean than text.
- Aligning audio with text (timing) used to be a concern but is now mostly solved by modern ASR; huge timestamped corpora have been built with Whisper-like systems.
- Some expect audio-first models to eventually surpass text-only LLMs in communicative nuance.
Neural codecs vs traditional codecs (MP3/Opus, formants, physics)
- Core discussion is how to turn continuous audio into discrete tokens suitable for autoregressive models.
- Neural codecs (VQ-VAE, RVQ) are favored because they:
- Achieve very low bitrates (≈1–3 kbps) while preserving intelligibility and prosody.
- Produce categorical, discrete tokens that are easier for transformers than continuous embeddings or heavily compressed bytestreams.
- Traditional codecs (MP3/Opus, formant/source–filter models) are discussed:
- Pros: psychoacoustic design, lower CPU cost, decades of engineering.
- Cons: bitrates still high; bitpacking and psychoacoustic pruning obscure structure that models might need to learn semantics and generalize.
- Some argue that discarding “inaudible” components may hurt learning, even if humans can’t consciously perceive them.
Pitch, emotion, and non-verbal cues
- Several users test current voice LLMs and find they often fail at pitch recognition, melody, accent contrasts, and fine-grained prosody.
- Debate whether this is:
- A capability/representation issue: audio tokens dominated by text-like information, models trained mostly to map to/from text.
- Or an alignment/safety issue: restrictions against accent-matching, voice imitation, or music generation may have suppressed capabilities that were present early on.
- Example: synthetic TTS data used for training carries little meaningful variation in tone, so models may learn to ignore prosody.
- There is interest in ASR that outputs not only words but metadata on pitch, emotion, and non-verbal sounds; current mainstream ASR usually drops these.
Signal representations: waveforms, spectrograms, and human expertise
- A side-thread debates whether experienced audio engineers can “read” phonemes or words from raw waveforms.
- Skeptics say typical DAW waveforms don’t contain enough visible information for that; maybe coarse cues like “um” or word boundaries.
- Others report being able to visually distinguish certain consonants/vowels with assistance from tools like Melodyne and spectrograms.
- Historical work on spectrogram reading is mentioned as an analogy for models processing time–frequency representations (e.g., Whisper).
Model architectures and hierarchy
- Some propose that linear/constant-time sequence models (RWKV, S4) or hierarchical setups might be better suited to audio than full transformers.
- Idea: a fast, low-level phonetic model plus a slower, high-level transformer that operates on coarser “summary” tokens carrying semantics and emotion.
- Related existing work is cited (e.g., hierarchical token stacks in music models, patch-based audio models), supporting the general direction.
Alignment, accents, and social issues
- Discussion touches on whether voice models should match user accents or deliberately avoid it.
- Some view non-matching as an overcautious, sociopolitical choice; others emphasize avoiding automated inferences about race from voice.
- There’s concern about models becoming “phrenology machines” if they predict race/ethnicity from audio.
Practical tools, applications, and accessibility
- Commenters mention existing tools (podcast editors, Descript-style systems) that already mix ASR and audio manipulation, hinting at near-term use cases: automatic filler removal, prosody-aware editing, emotional TTS.
- Several express excitement about future systems that:
- Truly understand pronunciation, intonation, and emotion.
- Can correct second-language accents or respond playfully to how you speak.
- One commenter criticizes limited public access to some of the discussed tooling (e.g., voice cloning systems requiring short samples), noting that closed deployment slows community experimentation.