How Language Models Handle Audio
This post focuses on speech and speech-adjacent audio. Non-speech audio — music, environmental sound — shares some of the same representational challenges but has a different alignment structure and is not the primary focus here.
The Four Mismatches
Most audio-LLM designs are solving one of three visible problems: what vectors should represent the sound, how many of them the LLM must process, and whether the output should be text or sound again. Alignment is the latent problem tying the first two together: how variable-duration acoustic structure maps to discrete linguistic units. Less visible once explicit alignment machinery is removed, but still there.
Together these are the four mismatches between the LLM's expectations and what audio actually is. Most design choices in this post are one way of resolving one of them.
The easiest way to see the problem is to look at the same utterance under different compression strategies.
We'll start by looking at how older systems handled these mismatches explicitly, then follow the same structure into the modern era, where the same problems persist but the machinery handling them is less visible.
Section 1What Classical ASR Made Explicit
The alignment and rate mismatches are not new problems. Classical ASR — roughly 1985 to 2015 — had explicit, inspectable solutions to both. Understanding them makes it easier to see what modern systems are doing differently, and what they gave up. Most production ASR systems in that era followed the same skeleton: the audio waveform was chopped into 25ms frames (with a 10ms hop), each frame was transformed through an FFT, passed through a mel-scale filterbank, and compressed via the discrete cosine transform into a set of cepstral coefficients (MFCCs) — commonly 13, though the number varied by system and era. Adding first and second derivatives (deltas and double-deltas) typically gave 39 numbers per frame. That was your feature vector.
The feature extraction stage — MFCCs — occupies the same role in the pipeline as neural audio encoders do today: take a variable-length waveform and produce a sequence of fixed-dimensional vectors, one per frame. The key difference: MFCCs are fixed mathematical transforms. The same mel filterbank and DCT applied to the same audio always give the same coefficients, with each coefficient a DCT basis coefficient over the log-mel spectrum. Whisper or HuBERT features are learned from data and opaque: two systems trained on the same data may produce completely different representations. Same output shape; one is a deterministic transform, the other is learned from data.
The HMM acoustic model had to deal with a problem that every audio-LLM still faces: the same word spoken fast and spoken slowly produces a different number of frames. "Cat" from a fast speaker might be 4 frames. "Caaaaat" from a slow speaker might be 12. Both should decode to the same output. This is the variable-length alignment problem.
The HMM's trick: self-loops and marginalization
The HMM solved this with a specific mechanism. Each phoneme was modeled as a short sequence of hidden states (typically 3: onset, middle, release), and each state had a self-loop — a probability of staying in the same state for the next frame. If the self-loop probability is 0.6, the expected state duration is 2.5 frames, but the model accommodates anything from 1 to many frames in that state.
The forward algorithm computes the total probability of the observed frame sequence by summing over all valid state sequences (every possible alignment of frames to states). It does this in \(O(T \times S^2)\) where \(S\) is the number of states (typically small; and for a left-to-right topology the transition matrix is extremely sparse, so in practice the inner loop touches only 2–3 states per frame). The Viterbi algorithm finds the single best alignment. In both cases, duration variation is handled by the math, not by an architectural choice.
What the HMM brought that attention and CTC don't
Self-attention and CTC both handle variable-length input. What separates the three mechanisms is not capability but what structural prior each encodes:
The HMM has three things attention doesn't. First, built-in monotonicity: the left-to-right topology means state s₂ can only follow s₁, encoding the prior that speech unfolds sequentially. Attention must learn this from data. Second, an implicit duration model: self-loop probabilities induce a geometric distribution over how long each state lasts. The model never explicitly says "phoneme /k/ lasts 80ms," but the probability of staying in a state versus leaving it encodes that information. (This geometric prior was recognized even in the HMM era as a poor fit for actual phoneme durations, which are roughly log-normal; hidden semi-Markov models addressed it with explicit duration distributions, at higher cost.) Attention has no explicit duration model of this kind. The mapping of 50 frames to one linguistic unit is buried in the weights, learned implicitly from data rather than encoded structurally. Third, principled marginalization: the forward algorithm sums over ALL valid paths to each state, integrating out the uncertainty about exactly when each phoneme occurred. Attention does not perform explicit global marginalization over monotone alignments. It learns one set of weights per head, finding weights that work, but alignment structure is implicit rather than structurally guaranteed.
CTC sits between HMMs and attention on this spectrum. It enforces monotonic alignment and marginalizes over alignments via its own forward algorithm with blanks and collapses, but drops the explicit duration model and, in its vanilla form, assumes conditional independence between output labels given the input: each label is chosen independently at the loss level, so vanilla CTC cannot model output label-to-label dependencies. The encoder (a BLSTM or Conformer) does enrich the input representations with frame context, so the independence assumption applies to labels given a richly encoded input — not raw frames — but output label dependencies remain unmodeled at the loss level. CTC inherits HMM alignment structure and grafts it onto neural features.
CTC-guided adapters (Section 3) do something specific: they run CTC compression before the LLM ever sees the audio, normalizing the frame rate from ~50 frames/sec (typical for SSL encoders like HuBERT/wav2vec 2.0 at 20ms stride; EnCodec/SoundStream commonly operate at 50–75 fps; some newer codecs at 12.5–25 fps) down to roughly the text token rate (~4–8 tokens/sec). The LLM then operates in its native regime, having to learn much less of the rate mismatch because CTC already removed most of it. But CTC carries its own limitations: conditional independence between labels, dependence on a pre-trained CTC model, noisy segment boundaries, monotonic alignment only, and speech-specificity (it won't generalize to music or environmental audio). Most teams choose simpler adapters, betting that attention plus enough data can learn what CTC gives you structurally.
The Two-Fold Problem
A text LLM eats and emits discrete tokens from a fixed vocabulary of ~32K–100K symbols.
Each token is a discrete symbolic unit with a dedicated embedding: "cat"
maps to token ID 3857, which maps to a row in the embedding table, which the
transformer processes. You can read the token and know what it represents. Audio has
neither property. The entire field of audio-language models is an answer to two
entangled problems.
Problem 1: Content is implicit
When someone says "the cat sat on the mat," the waveform that reaches the microphone contains not just the words but the speaker's identity (male, female, child), their emotional state (calm, angry, sarcastic), the room acoustics (reverb, echo), background noise (traffic, music), prosody (which words are stressed, the intonation contour), and breathing patterns. All of these are superimposed in the same signal. The linguistic content — the actual words — is implicit, buried under everything else.
Text tokens carry none of this. The token "cat" doesn't encode
who typed it, what font they used, or what room they were in. The entire audio
encoder — Whisper, HuBERT, WavLM, or a neural codec — exists because audio does:
it must separate or at least organize these superimposed signals into a
representation the LLM can use.
Problem 2: Tokenization must be invented
Text arrives at a language model already symbolized. Writing systems — alphabets, syllabaries, logographs — are pre-existing symbolic interfaces that humans developed over millennia. BPE tokenization isn't discovering some natural ontology of language; it's compressing a stream of symbols that were already symbols. The tokenizer approximates a pre-existing structure. You can read a BPE token because the structure it represents was already readable before the tokenizer ran.
Audio has no such pre-existing symbolic interface. No writing system encodes waveforms at the level of acoustic detail a model needs. When you apply k-means clustering or residual vector quantization to audio features, you're not approximating something that was already there. You're inventing a symbolic structure from scratch, or choosing to avoid one entirely. What does EnCodec codebook entry 4,371 "mean"? Nothing stable. It's a point in a learned latent space that reconstructs a particular acoustic pattern. The token doesn't contain content the way text tokens do; it's a compression coordinate that was never anything else first. There was no pre-linguistic layer to approximate.
The rate mismatch compounds everything. "The cat sat on the mat" is 6 text tokens. The same sentence spoken at a single-codebook semantic level produces 50–100 audio tokens depending on the encoder and speaking rate — already 10–20× more than text. With a multi-codebook codec (the norm for high-quality audio), multiply by the number of levels: 8 codebooks × 75 fps for 2 seconds gives ~1,200 tokens for the same utterance. The LLM's attention must handle a structure that audio demands and text doesn't.
The discrete tokenization landscape
| Method | Training Objective | Quantizer | Codebooks | Optimized For | Per-Frame Output |
|---|---|---|---|---|---|
| Text BPE reference |
N/A — statistical compression | Byte-pair merges | 1 flat | Compress text corpus | "cat" ← you can read it |
| HuBERT 2021 |
Masked prediction | k-means on learned features |
1 flat | Linguistic / phonetic structure | 412 ← single cluster ID |
| wav2vec 2.0 2020 |
Contrastive | Gumbel-softmax product quantization |
2 groups × 320 entries each |
Discriminative features | G₁58 × G₂214 |
| SoundStream Google · 2021 |
Reconstruction | Residual VQ progressive residuals |
variable Nq via structured dropout |
Faithful audio reconstruction (16 kHz) | CB₁312 ₂54 ₃91 … |
| EnCodec Meta AI · 2022 |
Reconstruction | Residual VQ progressive residuals |
2–32 hierarchical | Faithful audio reconstruction (24 kHz) | CB₁847 ₂92 ₃6 … |
| SpeechTokenizer 2023 |
Reconstruction + semantic distillation |
RVQ + HuBERT loss on CB₁ |
8 hierarchical | Content/acoustic separation | CB₁503 ₂71 … |
Two patterns stand out in this table. First, every audio method produces opaque tokens: integers that index into a learned codebook. You cannot read an audio token and know what it means. Second, the training objective shapes what the tokens capture: HuBERT (masked prediction) and wav2vec 2.0 (contrastive) produce tokens more concentrated on linguistic structure, because that's what their objectives reward. SoundStream and EnCodec (both reconstruction-trained) faithfully preserve the entire signal, including speaker identity, room reverb, and breathing that the LLM doesn't need.
- Waveform in → integer indices from a learned codebook out. No "audio alphabet" exists.
- Tokens are opaque — you can't inspect them and read their meaning.
- Substantially more tokens/sec than text. Single-codebook semantic systems run at 25–75 frames/sec (~6–18× text rate). Multi-codebook codecs multiply by N levels: 8 codebooks × 75 fps = 600 tokens/sec vs. ~4 text tokens/sec.
- The codebook is always finite (256–16,384 entries) but entangled — what each entry "means" isn't decomposable.
◆ ◆ ◆
Section 3Discrete vs. Continuous: Where the Fork Actually Happens
Most audio encoders — Whisper, HuBERT, WavLM, wav2vec 2.0 — produce continuous feature vectors. The discrete/continuous split is not about which encoder you use; it is about what happens after the encoder: do you quantize those features into codebook indices, or project them directly into the LLM's embedding space?
What "discrete" and "continuous" actually mean
This distinction is subtler than it first appears. Discrete means selecting an index from a finite set of learned centroids: the input feature gets "snapped" to the nearest codebook entry. Information is lost at the snap. Training is also complicated: argmin is not differentiable, so gradients cannot flow back through the selection step directly. The common workaround (the straight-through estimator) simply pretends the gradient of the discrete selection equals that of the continuous input; good enough in practice, but an approximation. Continuous means multiplying the feature vector by a learned weight matrix W (written W×) to project it into the LLM's embedding space. No quantization, no information loss, and the whole operation is fully differentiable, so standard backprop works end-to-end.
The subtle case: even in the discrete path, some systems use the codebook vector (not just the index) downstream. The values are continuous, but constrained to a finite set of learned points. The distinction that matters: is the representation space unconstrained (any point in \(\mathbb{R}^D\)) or restricted to \(K\) learned centroids?
W projects the feature to any point in \(\mathbb{R}^D\) — no quantization, no information loss, fully differentiable end-to-end. The discrete path produces a shorter description (one integer); the continuous path preserves more signal.
What actually enters the transformer
Even for text, tokens are immediately mapped to continuous vectors; the real interface to the transformer is always a sequence of vectors in \(\mathbb{R}^D\). A discrete token is just a lookup key. The model never sees the integer. Every path has one thing in common: what enters the transformer is always a vector. The difference is how that vector was obtained and whether a discrete bottleneck sits in the middle. Tracing each path end-to-end makes this concrete:
From spectrograms to encoder features
Encoder features are the latest step in a lineage, not a new kind of object. A mel spectrogram is a matrix: mel frequency bins × time frames. The encoder output has the same shape: learned feature dimensions × time frames. It is a virtual spectrogram in a learned feature space — one whose dimensions are trained to surface structure the downstream task needs. The adapter then projects this into a virtual spectrogram in LLM embedding space. Same matrix shape at every stage, different coordinate system.
The LLM and the encoder operate in different geometric spaces. The LLM expects \(D_{\text{llm}}\)-dimensional vectors trained on a particular distributional geometry; the encoder produces features in a geometry learned from audio. The adapter maps between them, with the encoder's training objective determining what structure the features carry into that translation.
The assumption is that a pretrained audio encoder has already extracted the right structure: the information the LLM needs is accessible in the encoder's feature space, even if not in the right geometry yet. This is often true, but it breaks when the encoder's pretraining task diverges from what the LLM task needs. An encoder trained for reconstruction (EnCodec) faithfully preserves reverb and breathing; an encoder trained for masked prediction (HuBERT) concentrates on phonetic structure. Neither is universally right. The adapter can partially compensate, but it can't recover information the encoder discarded. What gets thrown away at the encoder stage is gone.
The adapter design space
After the encoder runs, you have a matrix of shape \(T \times D_{\text{enc}}\): \(T\) time frames, each represented as a \(D_{\text{enc}}\)-dimensional vector. The problem is that this is the wrong shape in two ways. The LLM expects \(D_{\text{llm}}\)-dimensional vectors (typically 4096 for a 7B model, not the encoder's 512 or 1024). And \(T\) is too large: 10 seconds of audio at 50 frames/sec gives 500 encoder frames, far more than the ~40 text tokens a person would use to transcribe the same utterance.
The adapter is the learned module that solves both problems: it maps from encoder space (\(D_{\text{enc}}\)) to LLM embedding space (\(D_{\text{llm}}\)), and usually compresses the time axis from \(T\) frames to \(T'\) tokens. The encoder and LLM are typically pretrained separately and frozen; the adapter is what gets trained to bridge them. How aggressively it compresses, and whether that compression is uniform or content-sensitive, defines the four main families below.
Linear / MLP applies a per-frame projection: each frame independently mapped to LLM dimension. No temporal compression: \(T' = T\). The LLM gets the full 500-frame sequence for 10 seconds of audio. This solves the dimensionality mismatch (\(D_{\text{enc}} \to D_{\text{llm}}\)) without touching the temporal mismatch. The assumption is that the LLM's attention can handle a 500-token audio prefix without degradation, which is increasingly true as context windows grow, but comes at a real compute cost. What it throws away: nothing at the adapter stage. What it defers: the temporal alignment problem, which the LLM must now learn to handle implicitly. It breaks in memory-constrained settings or when audio is long enough to crowd out the text context. Ultravox uses this approach: a two-layer MLP on Whisper encoder frames, relying on growing context windows to absorb the cost.
Strided / Pooled groups k frames and averages (or strides) them into one embedding. \(T' = T/k\). At k=5, 500 frames become 100. This solves both mismatches: dimensionality and sequence length. The assumption is that temporal compression is uniform: every k-frame window has equal informational density. That assumption is wrong for speech: silence compresses fine; dense consonant clusters lose detail. It breaks on fast speech, accented speech, or any audio with rapid acoustic change, where exactly the frames that need the most resolution get pooled together. Qwen-Audio uses adjacent-frame stacking / stride-style compression before projection: simple architecture, manageable sequence length, fixed-rate resolution cost.
Q-Former / Perceiver uses a small set of learned query vectors that cross-attend to all encoder frames. The number of queries determines the output length: \(T' = Q\) (e.g. 32). This solves the sequence-length problem most aggressively — 500 frames → 32 tokens — while letting attention learn what to keep. The assumption is that 32 learned queries are sufficient to capture everything the LLM task needs; that the information bottleneck forces useful compression rather than information loss. It breaks when the task requires fine-grained temporal detail (speaker diarization, precise word timing) that a 32-slot bottleneck discards. The queries are trained, not hardcoded, so what survives depends heavily on the downstream loss. SALMONN uses 32 query tokens per window, borrowing the cross-attention mechanism from BLIP-2 (Li et al., 2023). Flamingo used a different approach — a Perceiver Resampler with gated cross-attention, not Q-Former.
CTC-guided uses a pre-trained CTC model to identify which frames correspond to the same linguistic unit, then merges them. Content-adaptive: silence compresses heavily, dense speech less. This is the mechanism closest to the HMM's approach from Section 1: it removes duration variability before the LLM, normalizing audio to roughly the text token rate. The assumption is that the CTC model's segmentation is reliable enough to guide compression, and that the right granularity for merging is the CTC boundary, not a fixed window. It breaks when CTC alignment is wrong (noisy audio, accents, out-of-vocabulary words), only works for speech, and the merge strategy (average? last frame? attention-weighted?) is itself an unresolved choice that can further degrade quality when CTC makes errors. Despite being conceptually appealing — it is the approach most directly analogous to what classical ASR did explicitly — it has not become a standard building block in deployed audio-LLMs. There is no single clean reason: the dependency on a separate CTC model, the speech-only constraint, and the unresolved merge strategy each add friction, and simpler adapters turn out to work well enough that the added complexity is hard to justify.
Tradeoff: preserves all detail, LLM gets long sequence
Used by: Ultravox, LLaSA
Tradeoff: shorter sequence, uniform — silence compressed equally
Used by: Qwen-Audio, WavLLM, SpeechVerse (1D conv)
Tradeoff: aggressive compression, attention learns what to keep
Used by: SALMONN, Pengi
How Audio Comes Out
Getting audio into an LLM and getting it back out are not symmetric problems. For many language-understanding tasks, you can afford to be lossy on input: a continuous projection that drops some acoustic detail still lets the LLM understand the words. On output, the LLM must produce something that decodes all the way back to a waveform, which demands much more specificity. Voice agents, speech synthesis, and audio continuation all require it.
Generation approaches split along a second axis that is worth naming explicitly. Semantic tokens are low-rate representations — HuBERT discrete units, or (with the RVQ caveats from Section 2) loosely the lower codebook levels — that capture linguistic and prosodic content with enough compression to fit inside an LLM context window. Acoustic tokens are the higher-rate levels that carry the fine detail required for waveform reconstruction: CB₂–CBN at full codec rate, or multi-level codec sequences. The two differ not just in information content but in volume — semantic tokens run at a manageable rate; acoustic tokens at full codec rate add hundreds of tokens per second of audio. Most practical generation architectures try to decouple the two: let the LLM handle the semantic pass autoregressively, and use faster specialized machinery for acoustic fill-in.
Three architectures for audio output
Cascade (LLM → text → TTS) is the simplest approach and still the most common in production. The text bottleneck is the core problem: prosody, emotion, speaker identity, and timing don't survive it. The TTS must reinvent them from context, which it does imperfectly. Cascade also has a latency floor: even with streaming TTS, the LLM must commit to words before any audio can start (~300–500ms time-to-first-audio).
Hybrid (LLM → CB₁ → acoustic model) routes around the text bottleneck by having the LLM generate only the first codebook level. CB₁ tokens are audio-derived, so paralinguistic information that text discards — speaker identity, prosodic contour — can survive in them. A separate, faster model fills in CB₂–CBN in parallel. The cost: CB₁ is not a clean semantic representation (the RVQ factorization misconception from Section 2 applies here), and the fill-in model has no access to the LLM's internal state. Latency improves over cascade on the acoustic side, but the autoregressive CB₁ pass still dominates time-to-first-audio.
Native generation extends the LLM's vocabulary to include audio token IDs directly — text and audio tokens appear in the same sequence, with no representational boundary. The model can learn to control prosody and voice the same way it learned to control rhetoric. This is also where the mismatch reappears in reverse: the LLM must generate not a sequence but a matrix (N codebook levels × T frames), and its native next-token regime does not map cleanly onto that structure. Latency is in principle lowest — audio tokens stream directly, ceiling at codec decode — but only with efficient generation strategies that avoid a full CB₁ pass before CB₂ starts.
The three architectures also differ in what feeds back during autoregressive generation. In cascade systems only text tokens are in the AR loop: audio is produced by a separate TTS step after the LLM finishes. In native and hybrid systems, audio token indices feed back as input at each step — the LLM attends to its own audio output when predicting the next token. This distinction matters for coherence: a native model can condition future tokens on the acoustic details of what it already generated; a cascade model cannot.
The structural bottleneck for hybrid and native is the multi-codebook generation problem. At 25 fps with 4 codebook levels, one second of audio is 100 tokens: 25× the text token rate for equivalent content. Three strategies address it: hierarchical (AR for CB₁, then a parallel pass fills CB₂–CBN — VALL-E style), delay patterns (codebooks staggered by one frame each, so a single AR pass generates all levels simultaneously — MusicGen), and masked parallel decoding (all tokens predicted in parallel via iterative unmasking in K passes — SoundStorm). Each is a different answer to the same question: how do you serialize a time×codebook matrix into a sequence?
Where These Patterns Appear
The input paths and output architectures in this post are not taxonomic abstractions. Four archetypes cover the design space well; each makes a specific bet about which mismatch matters most to solve explicitly.
Whisper encoder features flow through a 2-layer MLP directly into Mistral's embedding space, with no temporal compression. The full ~500-frame audio sequence enters the LLM. This relies on growing context windows and enough training to absorb the rate mismatch without explicit compression machinery. What it loses: the LLM must learn the 10–50× rate mismatch from scratch. Long audio pushes context limits. Output voice is entirely the TTS system's — the LLM never touches audio on the output side.
A dual-encoder input (Whisper + BEATs) feeds 32 learned query tokens that cross-attend to the full encoder sequence via Q-Former (from BLIP-2). The 32-token bottleneck is fixed regardless of audio duration — a 1-second clip and a 30-second clip both produce 32 tokens. What it loses: fine-grained temporal detail. Tasks requiring precise word timing, speaker diarization, or sub-second event detection are challenging when the bottleneck discards the frame-level sequence. A 32-slot fixed budget can't represent structure it didn't learn to preserve.
VALL-E is a TTS model, not an audio understanding system. Its primary input is text, plus a 3-second audio prompt for speaker conditioning. It represents the discrete codec output path: the LLM autoregressively generates EnCodec CB₁ tokens conditioned on text and speaker prompt. A separate non-autoregressive model then fills CB₂–CBN in parallel. Voice cloning is possible because CB₁ captures enough speaker identity to condition the acoustic fill-in. What it loses: the LLM generates CB₁ at full codec rate (~75 fps for 5sec = 375 tokens), so latency is dominated by the AR pass. CB₁ is not a clean semantic representation — the RVQ residual factorization doesn't cleanly separate content from timbre.
Architecture unpublished. The evidence: end-to-end response latency as low as 232ms (average ~320ms per the GPT-4o System Card; this is time-to-first-audio-token, not full-response latency), demonstrated ability to mirror emotion, accent, and paralinguistic cues, and the ability to interrupt and be interrupted mid-generation. These are inconsistent with a cascade architecture where a text bottleneck necessarily discards paralinguistic information before TTS can regenerate it. What remains unknown: the tokenization scheme, codec frame rate, training data scale, and whether audio input and output share the same token space.
▸ Full system table — eight systems mapped to input and output patterns
| System | Input | Output | Key choice |
|---|---|---|---|
| Ultravox Fixie AI — open |
Continuous: Whisper encoder + 2-layer MLP | Text output (separate TTS for audio) | Linear adapter: no temporal compression. Relies on growing context windows to absorb the full 500-frame audio sequence. |
| Qwen-Audio Alibaba — open |
Continuous: strided/pooled adapter | Text output (no audio generation path) | Adjacent-frame stacking before projection; manageable sequence length at fixed compression ratio. |
| SALMONN Tsinghua / ByteDance — open |
Continuous: Q-Former (32 learned queries) | Text output (separate TTS for audio) | 32-query bottleneck collapses the full encoder sequence regardless of audio length. Q-Former mechanism from BLIP-2; Flamingo used a Perceiver Resampler instead. |
| SpeechVerse Amazon |
Continuous: 1D conv adapter (strided) | Cascade | 1D convolutional adapter with LoRA fine-tuning of the LLM. Fixed-rate compression; no CTC-based segmentation. |
| VALL-E Microsoft |
Text + 3s audio prompt (CB₁ speaker conditioning) | Hybrid: LLM generates CB₁ AR, separate non-AR model fills CB₂–CBN | Hierarchical multi-codebook generation. Voice cloning from 3 seconds of reference audio. |
| MusicGen Meta — open (generation only) |
Text conditioning | Native: single AR pass with delay patterns across codebooks | Delay-pattern serialization: one AR pass generates all codebook levels with time offsets. No separate non-AR stage. |
| SoundStorm Google DeepMind (generation only) |
Semantic token conditioning | Native: masked parallel decoding | K iterative unmasking passes instead of \(T \times N\) sequential steps. Non-autoregressive; fast inference. |
| GPT-4o OpenAI — closed |
Inferred: native or continuous | Inferred: native | Architecture unpublished. Response latency (as low as 232ms, average ~320ms per the GPT-4o System Card) and demonstrated paralinguistic preservation suggest native or near-native integration. |
MusicGen and SoundStorm are generation-only systems; their input column reflects conditioning input, not an audio comprehension path. GPT-4o latency from the GPT-4o System Card (OpenAI, 2024).
What Remains Open
Audio poses two fundamental problems for LLMs: content is implicit (buried under speaker identity, acoustics, prosody), and tokenization must be invented from scratch. Audio has no pre-existing symbolic interface to approximate, only a continuous signal that must be given structure by the system itself. The classical HMM pipeline solved the variable-length alignment problem with structure, not scale. Modern systems replaced that structure with learned expressiveness, gaining vastly better results but losing the principled alignment mechanism.
Three questions remain genuinely open:
Can a single unified architecture handle both directions? With explicit structure, not just scale — the way HMMs handled alignment? SpiritLM (interleaved text/HuBERT tokens) and Moshi (Kyutai, 2024 — full-duplex speech-to-speech with an inner monologue stream) are published steps toward this. Both are genuine advances: SpiritLM demonstrates cross-modal continuation in a single vocabulary; Moshi demonstrates real-time full-duplex with an inner monologue stream. The field is moving quickly, and improvements in fidelity, latency, and voice control are actively being worked on across both industry and academic labs.
Will the field converge on discrete codecs, continuous latent generation, or hybrids that blur the boundary? Recent work on flow matching and diffusion over token spaces suggests the hard discrete/continuous line is already softening. What counts as a "token" may become a more flexible concept than current codec-based systems imply.
Can we close the rate mismatch? Ultra-low frame rate codecs (5–12.5 Hz) are pushing audio token rates closer to text. If audio can be represented at near-text rates without quality loss, the entire adapter problem simplifies dramatically, and the LLM sees much less of the many-to-one frame-to-phoneme mapping.
Glossary
Key terms used in this post. Hover over underlined terms in the text for inline definitions.
Further Reading
Key papers underlying the ideas in this post, grouped by theme.
- HMMRabiner, L. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proc. IEEE, 1989. — The HMM reference.
- CTCGraves et al. "Connectionist Temporal Classification." ICML, 2006.
- AttnBahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR, 2015.
- WhisperRadford et al. "Robust Speech Recognition via Large-Scale Weak Supervision." ICML, 2023.
- HuBERTHsu et al. "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units." IEEE/ACM TASLP, 2021.
- w2v2Baevski et al. "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." NeurIPS, 2020.
- EnCodecDéfossez et al. "High Fidelity Neural Audio Compression." TMLR, 2023.
- DACKumar et al. "High-Fidelity Audio Compression with Improved RVQGAN." NeurIPS, 2023.
- FlamingoAlayrac et al. "Flamingo: a Visual Language Model for Few-Shot Learning." NeurIPS, 2022. — Perceiver Resampler with gated cross-attention (distinct from Q-Former).
- BLIP-2Li et al. "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models." ICML, 2023. — Q-Former origin.
- SALMONNTang et al. "SALMONN: Towards Generic Hearing Abilities for Large Language Models." ICLR, 2024.
- SpiritLMNguyen et al. "SpiritLM: Interleaved Spoken and Written Language Model." arXiv, 2024. — interleaved text + HuBERT speech token LM; published native audio architecture.
- MoshiDéfossez et al. "Moshi: a speech-text foundation model for real-time dialogue." Kyutai, 2024. — full-duplex speech-to-speech at 12.5Hz with inner monologue stream.
- WavLLMHu et al. "WavLLM: Towards Robust and Adaptive Speech Large Language Model." EMNLP Findings, 2024.
- Qwen-AudioChu et al. "Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models." arXiv, 2023.
- VALL-EWang et al. "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv, 2023. — Hierarchical multi-codebook generation.
- MusicGenCopet et al. "Simple and Controllable Music Generation." NeurIPS, 2023. — Delay pattern interleaving.
- SoundStormBorsos et al. "SoundStorm: Efficient Parallel Audio Generation." arXiv, 2023. — Masked parallel decoding.
- InstructTTSYang et al. "InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt." IEEE/ACM Trans. Audio Speech Lang. Process. 32: 2913–2925, 2024.