2025-10-21

Neural audio codecs: how to get audio into LLMs

Overall reception

Thread is highly positive about the article: praised as dense, clear, visually excellent, and a strong conceptual overview of neural audio, tokenization, and codecs.
Several people mention sharing it with teams or using it to guide current audio/voice projects.

“Real understanding” and tokenization

Some push back on the article’s contrast between speech wrappers (ASR→LLM→TTS) and “real speech understanding,” arguing that text tokenization is also a lossy, non-“real” representation.
Others note that “understanding” itself isn’t well defined; current systems are judged by behavioral benchmarks, not mechanistic criteria.
Related work is cited on learning tokenization and language modeling end-to-end, including for text, images, and audio.

Audio-first models and data constraints

Multiple commenters ask why we don’t just tokenize speech directly and build LLMs on speech tokens.
Points raised:
- Audio tokens are far more numerous than text tokens (at least ~4×), increasing cost.
- There’s a lot of speech in the world, but still far less normalized, labeled, and linguistically clean than text.
- Aligning audio with text (timing) used to be a concern but is now mostly solved by modern ASR; huge timestamped corpora have been built with Whisper-like systems.
Some expect audio-first models to eventually surpass text-only LLMs in communicative nuance.

Neural codecs vs traditional codecs (MP3/Opus, formants, physics)

Core discussion is how to turn continuous audio into discrete tokens suitable for autoregressive models.
Neural codecs (VQ-VAE, RVQ) are favored because they:
- Achieve very low bitrates (≈1–3 kbps) while preserving intelligibility and prosody.
- Produce categorical, discrete tokens that are easier for transformers than continuous embeddings or heavily compressed bytestreams.
Traditional codecs (MP3/Opus, formant/source–filter models) are discussed:
- Pros: psychoacoustic design, lower CPU cost, decades of engineering.
- Cons: bitrates still high; bitpacking and psychoacoustic pruning obscure structure that models might need to learn semantics and generalize.
- Some argue that discarding “inaudible” components may hurt learning, even if humans can’t consciously perceive them.

Pitch, emotion, and non-verbal cues

Several users test current voice LLMs and find they often fail at pitch recognition, melody, accent contrasts, and fine-grained prosody.
Debate whether this is:
- A capability/representation issue: audio tokens dominated by text-like information, models trained mostly to map to/from text.
- Or an alignment/safety issue: restrictions against accent-matching, voice imitation, or music generation may have suppressed capabilities that were present early on.
Example: synthetic TTS data used for training carries little meaningful variation in tone, so models may learn to ignore prosody.
There is interest in ASR that outputs not only words but metadata on pitch, emotion, and non-verbal sounds; current mainstream ASR usually drops these.

Signal representations: waveforms, spectrograms, and human expertise

A side-thread debates whether experienced audio engineers can “read” phonemes or words from raw waveforms.
- Skeptics say typical DAW waveforms don’t contain enough visible information for that; maybe coarse cues like “um” or word boundaries.
- Others report being able to visually distinguish certain consonants/vowels with assistance from tools like Melodyne and spectrograms.
Historical work on spectrogram reading is mentioned as an analogy for models processing time–frequency representations (e.g., Whisper).

Model architectures and hierarchy

Some propose that linear/constant-time sequence models (RWKV, S4) or hierarchical setups might be better suited to audio than full transformers.
- Idea: a fast, low-level phonetic model plus a slower, high-level transformer that operates on coarser “summary” tokens carrying semantics and emotion.
Related existing work is cited (e.g., hierarchical token stacks in music models, patch-based audio models), supporting the general direction.

Alignment, accents, and social issues

Discussion touches on whether voice models should match user accents or deliberately avoid it.
Some view non-matching as an overcautious, sociopolitical choice; others emphasize avoiding automated inferences about race from voice.
There’s concern about models becoming “phrenology machines” if they predict race/ethnicity from audio.

Practical tools, applications, and accessibility

Commenters mention existing tools (podcast editors, Descript-style systems) that already mix ASR and audio manipulation, hinting at near-term use cases: automatic filler removal, prosody-aware editing, emotional TTS.
Several express excitement about future systems that:
- Truly understand pronunciation, intonation, and emotion.
- Can correct second-language accents or respond playfully to how you speak.
One commenter criticizes limited public access to some of the discussed tooling (e.g., voice cloning systems requiring short samples), noting that closed deployment slows community experimentation.

Related topics