How Language Models Handle Audio

April 2026

Text arrives at a language model already segmented into discrete symbols — a few tokens per second, each with a stable vocabulary index. Speech is different. Speech arrives as a dense continuous signal, with no token boundaries, no unit corresponding to a wordpiece, and 10–50× more frames per second than text tokens. An audio-language model is therefore not just "an LLM with a microphone." It is a system for reconciling two incompatible computational paradigms. Classical speech recognition handled this with explicit alignment models — HMMs that mapped variable-duration audio frames to discrete linguistic units, absorbing the mismatch structurally. Modern audio-language models have had to develop their own approaches to the same problem. The alignment is learned now, not engineered, but it still has to be solved.

This post is organized around four questions worth asking of any audio-language model: What is its audio representation? How does it reduce the frame rate? Where does audio enter the LLM? And how, if at all, does audio come back out?

Contents

Prologue — The Four Mismatches — the incompatibilities every audio-LLM must resolve
§1 — What Classical ASR Made Explicit — HMMs, self-loops, the variable-length alignment problem
§2 — The Two-Fold Problem — content is implicit, tokenization must be invented
§3 — Discrete vs. Continuous: Where the Fork Actually Happens — adapters, what the LLM actually receives
§4 — How Audio Comes Out — the reverse mismatch: cascade, hybrid, native, multi-codebook generation
In Practice — Where These Patterns Appear — four archetypes mapped to the four mismatches
Conclusion — What Remains Open
Further Reading

This post focuses on speech and speech-adjacent audio. Non-speech audio — music, environmental sound — shares some of the same representational challenges but has a different alignment structure and is not the primary focus here.

The Four Mismatches

Introduces → ① Representation ② Rate ③ Alignment ④ Output

Most audio-LLM designs are solving one of three visible problems: what vectors should represent the sound, how many of them the LLM must process, and whether the output should be text or sound again. Alignment is the latent problem tying the first two together: how variable-duration acoustic structure maps to discrete linguistic units. Less visible once explicit alignment machinery is removed, but still there.

Together these are the four mismatches between the LLM's expectations and what audio actually is. Most design choices in this post are one way of resolving one of them.

The four mismatches — and where in the pipeline each is resolved

The four mismatches. Left: the signal path from raw waveform to model output. Right: the mismatch each stage must resolve. Every architectural decision in this post lives in one of these four boxes. The input side (top three) is covered first; the output mismatch last.

The easiest way to see the problem is to look at the same utterance under different compression strategies.

Interactive: rate mismatch — "the cat sat on the mat" (5 seconds, simulated)

Rate mismatch, made concrete. Each row shows the same 5-second utterance under a different compression strategy. Hover any cell for details. Text (7 tokens) and raw audio (250 frames) differ by 35×. Pooling preserves time-alignment but blurs within-cell content. CTC compression is content-adaptive: silence collapses to one blank regardless of duration, coarse speech boundaries are preserved under a monotonic alignment assumption. Q-Former maps the entire sequence to 32 learned queries — not time-aligned and not guaranteed to preserve fine temporal structure; the cells are equally spaced in query space, not in audio time. Codec tokens at full quality (4 levels × 25 fps = 100 tok/sec) produce 500 tokens for 5 seconds — 70× more than the same utterance as text tokens. Colors: consonant vowel silence

Mental model. Audio → (representation) → frame vectors → (alignment) → latent linguistic structure → (rate reduction) → manageable sequence → (LLM interface) → vectors in \(\mathbb{R}^D\) → (generation) → text or audio. Every system in this post is a specific answer to one or more of those arrows.

We'll start by looking at how older systems handled these mismatches explicitly, then follow the same structure into the modern era, where the same problems persist but the machinery handling them is less visible.

Section 1

What Classical ASR Made Explicit

This section → ① Representation ② Rate ③ Alignment ④ Output

The alignment and rate mismatches are not new problems. Classical ASR — roughly 1985 to 2015 — had explicit, inspectable solutions to both. Understanding them makes it easier to see what modern systems are doing differently, and what they gave up. Most production ASR systems in that era followed the same skeleton: the audio waveform was chopped into 25ms frames (with a 10ms hop), each frame was transformed through an FFT, passed through a mel-scale filterbank, and compressed via the discrete cosine transform into a set of cepstral coefficients (MFCCs) — commonly 13, though the number varied by system and era. Adding first and second derivatives (deltas and double-deltas) typically gave 39 numbers per frame. That was your feature vector.

The classical speech pipeline

Figure 1a. The waveform-to-transcript pipeline that dominated ASR for 30 years. The shape — raw audio → fixed-dimensional vectors per frame → a model that must handle the fact that different speakers say the same word at different speeds — has not changed across 30 years of ASR. Only the tools at each stage have.

The feature extraction stage — MFCCs — occupies the same role in the pipeline as neural audio encoders do today: take a variable-length waveform and produce a sequence of fixed-dimensional vectors, one per frame. The key difference: MFCCs are fixed mathematical transforms. The same mel filterbank and DCT applied to the same audio always give the same coefficients, with each coefficient a DCT basis coefficient over the log-mel spectrum. Whisper or HuBERT features are learned from data and opaque: two systems trained on the same data may produce completely different representations. Same output shape; one is a deterministic transform, the other is learned from data.

The HMM acoustic model had to deal with a problem that every audio-LLM still faces: the same word spoken fast and spoken slowly produces a different number of frames. "Cat" from a fast speaker might be 4 frames. "Caaaaat" from a slow speaker might be 12. Both should decode to the same output. This is the variable-length alignment problem.

The HMM's trick: self-loops and marginalization

The HMM solved this with a specific mechanism. Each phoneme was modeled as a short sequence of hidden states (typically 3: onset, middle, release), and each state had a self-loop — a probability of staying in the same state for the next frame. If the self-loop probability is 0.6, the expected state duration is 2.5 frames, but the model accommodates anything from 1 to many frames in that state.

HMM for phoneme /k/ — same phoneme, two speaking rates

Figure 1b. Left: the 3-state HMM topology for phoneme /k/. Self-loops (arcs back to the same state) absorb duration: when someone speaks slowly, the model stays in each state longer. Right: the same /k/ from two speakers at different speeds. Both decode to the same state topology — s₁, s₂, s₃ for /k/ — with different numbers of self-loop repetitions. The HMM didn't compress, pad, or attend. It marginalized.

The forward algorithm computes the total probability of the observed frame sequence by summing over all valid state sequences (every possible alignment of frames to states). It does this in \(O(T \times S^2)\) where \(S\) is the number of states (typically small; and for a left-to-right topology the transition matrix is extremely sparse, so in practice the inner loop touches only 2–3 states per frame). The Viterbi algorithm finds the single best alignment. In both cases, duration variation is handled by the math, not by an architectural choice.

What the HMM brought that attention and CTC don't

Self-attention and CTC both handle variable-length input. What separates the three mechanisms is not capability but what structural prior each encodes:

The HMM has three things attention doesn't. First, built-in monotonicity: the left-to-right topology means state s₂ can only follow s₁, encoding the prior that speech unfolds sequentially. Attention must learn this from data. Second, an implicit duration model: self-loop probabilities induce a geometric distribution over how long each state lasts. The model never explicitly says "phoneme /k/ lasts 80ms," but the probability of staying in a state versus leaving it encodes that information. (This geometric prior was recognized even in the HMM era as a poor fit for actual phoneme durations, which are roughly log-normal; hidden semi-Markov models addressed it with explicit duration distributions, at higher cost.) Attention has no explicit duration model of this kind. The mapping of 50 frames to one linguistic unit is buried in the weights, learned implicitly from data rather than encoded structurally. Third, principled marginalization: the forward algorithm sums over ALL valid paths to each state, integrating out the uncertainty about exactly when each phoneme occurred. Attention does not perform explicit global marginalization over monotone alignments. It learns one set of weights per head, finding weights that work, but alignment structure is implicit rather than structurally guaranteed.

Three mechanisms for variable-length alignment

Figure 1c. Side-by-side comparison of the three alignment mechanisms. HMMs have the most structural priors — monotonicity and duration are built in, marginalization is exact. CTC inherits alignment marginalization but drops the explicit duration model. Attention leaves alignment structure to be learned from data. More expressiveness at each step; less structural guarantee.

"But doesn't causal masking make attention monotonic?" No — causal masking prevents a position from attending to later positions in the sequence. It does not impose a monotone many-frames-to-one-label alignment. When the LLM generates "cat" at position 520, it can attend to any of the 500 audio frames before it. There's no constraint that it should focus on frames 200–210 where /kæt/ was spoken. Attention by itself does not collapse many acoustic frames into one linguistic unit. There is no built-in mechanism equivalent to the HMM self-loop.

CTC sits between HMMs and attention on this spectrum. It enforces monotonic alignment and marginalizes over alignments via its own forward algorithm with blanks and collapses, but drops the explicit duration model and, in its vanilla form, assumes conditional independence between output labels given the input: each label is chosen independently at the loss level, so vanilla CTC cannot model output label-to-label dependencies. The encoder (a BLSTM or Conformer) does enrich the input representations with frame context, so the independence assumption applies to labels given a richly encoded input — not raw frames — but output label dependencies remain unmodeled at the loss level. CTC inherits HMM alignment structure and grafts it onto neural features.

CTC-guided adapters (Section 3) do something specific: they run CTC compression before the LLM ever sees the audio, normalizing the frame rate from ~50 frames/sec (typical for SSL encoders like HuBERT/wav2vec 2.0 at 20ms stride; EnCodec/SoundStream commonly operate at 50–75 fps; some newer codecs at 12.5–25 fps) down to roughly the text token rate (~4–8 tokens/sec). The LLM then operates in its native regime, having to learn much less of the rate mismatch because CTC already removed most of it. But CTC carries its own limitations: conditional independence between labels, dependence on a pre-trained CTC model, noisy segment boundaries, monotonic alignment only, and speech-specificity (it won't generalize to music or environmental audio). Most teams choose simpler adapters, betting that attention plus enough data can learn what CTC gives you structurally.

CTC as preprocessing — normalizing to the LLM's native rate

Figure 1d. CTC compression as preprocessing. Top: raw audio at ~50 frames/sec with repeated phoneme labels. Middle: text tokens at ~4/sec for comparison (a 10–50× rate mismatch). Bottom: after CTC merges frames for the same phoneme, the audio is now at ~4–8 tokens/sec. The LLM sees much less of the rate mismatch: CTC removed most of it before the sequence enters the model.

Classical (HMM era)

Features: Hand-designed (MFCCs). 13–39 dims/frame. Each coefficient ≈ spectral envelope shape. Interpretable.

Variable length: HMM self-loops. Duration modeled as a geometric distribution. Marginalized out via forward algorithm.

Alignment: Viterbi decoding produces a frame-to-phoneme alignment as a byproduct — inspectable and exact under the model's assumptions.

Limitation: Conditional independence between frames given state. Fixed feature design — can't learn what the task needs.

→

Modern (Audio-LLM era)

Features: Learned (Whisper, HuBERT, codec). 256–4096 dims/frame. Vastly more powerful. But opaque — what is dimension 847?

Variable length: Adapters compress (pooling, Q-Former). Attention handles long sequences. CTC merges. Multiple ad-hoc mechanisms.

Alignment: Implicit in attention weights — not given as a first-class exact object the way Viterbi or CTC alignments are. Or CTC-based (monotonic only).

Advantage: End-to-end learning. Features optimize for the task. Vastly more expressive models. Far better results.

End-to-end systems massively outperform HMMs. The tradeoff is conceptual: the HMM era had a clean answer to the alignment and rate mismatches: treat duration as a hidden variable, marginalize over it, and read off exact frame-to-phoneme assignments. The modern era replaced that with a collection of engineering choices (adapters, compression ratios, attention over long sequences). Each step gained expressiveness and lost structural guarantee. Modern models removed the explicit alignment machinery. They did not remove the problem. They pushed it into learned representations.

Section 2

The Two-Fold Problem

This section → ① Representation ② Rate ③ Alignment ④ Output

A text LLM eats and emits discrete tokens from a fixed vocabulary of ~32K–100K symbols. Each token is a discrete symbolic unit with a dedicated embedding: "cat" maps to token ID 3857, which maps to a row in the embedding table, which the transformer processes. You can read the token and know what it represents. Audio has neither property. The entire field of audio-language models is an answer to two entangled problems.

The bridge problem

Audio input

LLM vocabulary

The cat sat …

Figure 2a. Audio is a continuous waveform; LLMs are trained on discrete tokens from a finite vocabulary. The "?" is the entire challenge: every audio-language model is an answer to what goes in that gap.

Problem 1: Content is implicit

When someone says "the cat sat on the mat," the waveform that reaches the microphone contains not just the words but the speaker's identity (male, female, child), their emotional state (calm, angry, sarcastic), the room acoustics (reverb, echo), background noise (traffic, music), prosody (which words are stressed, the intonation contour), and breathing patterns. All of these are superimposed in the same signal. The linguistic content — the actual words — is implicit, buried under everything else.

Five signals superimposed in one waveform

Figure 2b. The audio waveform contains linguistic content, prosody, speaker identity, emotion, and room acoustics — all mixed together. Click "stack" to see them superimposed as they exist in the raw signal. The entire audio encoder exists because content never sits on the surface — you must extract it.

Text tokens carry none of this. The token "cat" doesn't encode who typed it, what font they used, or what room they were in. The entire audio encoder — Whisper, HuBERT, WavLM, or a neural codec — exists because audio does: it must separate or at least organize these superimposed signals into a representation the LLM can use.

Problem 2: Tokenization must be invented

Text arrives at a language model already symbolized. Writing systems — alphabets, syllabaries, logographs — are pre-existing symbolic interfaces that humans developed over millennia. BPE tokenization isn't discovering some natural ontology of language; it's compressing a stream of symbols that were already symbols. The tokenizer approximates a pre-existing structure. You can read a BPE token because the structure it represents was already readable before the tokenizer ran.

Audio has no such pre-existing symbolic interface. No writing system encodes waveforms at the level of acoustic detail a model needs. When you apply k-means clustering or residual vector quantization to audio features, you're not approximating something that was already there. You're inventing a symbolic structure from scratch, or choosing to avoid one entirely. What does EnCodec codebook entry 4,371 "mean"? Nothing stable. It's a point in a learned latent space that reconstructs a particular acoustic pattern. The token doesn't contain content the way text tokens do; it's a compression coordinate that was never anything else first. There was no pre-linguistic layer to approximate.

The rate mismatch compounds everything. "The cat sat on the mat" is 6 text tokens. The same sentence spoken at a single-codebook semantic level produces 50–100 audio tokens depending on the encoder and speaking rate — already 10–20× more than text. With a multi-codebook codec (the norm for high-quality audio), multiply by the number of levels: 8 codebooks × 75 fps for 2 seconds gives ~1,200 tokens for the same utterance. The LLM's attention must handle a structure that audio demands and text doesn't.

The discrete tokenization landscape

Text tokens vs. audio tokens — same sentence, very different representations

Text BPE — 6 tokens for "the cat sat on the mat"

Audio — same content, dozens of opaque indices

Same sentence. Six transparent tokens vs. dozens of opaque indices. The discrete tokenization methods below all produce the right column.

Figure 2c. Text BPE tokens are inspectable — you can read them. Audio token indices are compression coordinates into a learned codebook — you can't read index 847 and know what sound it represents. The tokenizer invented this symbolic structure from scratch; there was no pre-existing one to approximate.

The shared skeleton — every method follows this

Waveform

always starts here

→

Neural Encoder

architecture varies

→

Quantizer

mechanism varies

→

Discrete Indices

structure varies

→

LLM

always the consumer

Figure 2d. Green = shared across all methods. Yellow dashed = where they diverge. The waveform always enters, the LLM always consumes. What varies is the middle.

Method	Training Objective	Quantizer	Codebooks	Optimized For	Per-Frame Output
Text BPE reference	N/A — statistical compression	Byte-pair merges	1 flat	Compress text corpus	"cat" ← you can read it
HuBERT 2021	Masked prediction	k-means on learned features	1 flat	Linguistic / phonetic structure	412 ← single cluster ID
wav2vec 2.0 2020	Contrastive	Gumbel-softmax product quantization	2 groups × 320 entries each	Discriminative features	G₁58 × G₂214
SoundStream Google · 2021	Reconstruction	Residual VQ progressive residuals	variable N_q via structured dropout	Faithful audio reconstruction (16 kHz)	CB₁312 ₂54 ₃91 …
EnCodec Meta AI · 2022	Reconstruction	Residual VQ progressive residuals	2–32 hierarchical	Faithful audio reconstruction (24 kHz)	CB₁847 ₂92 ₃6 …
SpeechTokenizer 2023	Reconstruction + semantic distillation	RVQ + HuBERT loss on CB₁	8 hierarchical	Content/acoustic separation	CB₁503 ₂71 …

Two patterns stand out in this table. First, every audio method produces opaque tokens: integers that index into a learned codebook. You cannot read an audio token and know what it means. Second, the training objective shapes what the tokens capture: HuBERT (masked prediction) and wav2vec 2.0 (contrastive) produce tokens more concentrated on linguistic structure, because that's what their objectives reward. SoundStream and EnCodec (both reconstruction-trained) faithfully preserve the entire signal, including speaker identity, room reverb, and breathing that the LLM doesn't need.

RVQ codebooks are not a clean factorization. CB₁ is often labeled "semantic content," CB₂ "prosody," CB₃ "timbre" in diagrams. Each codebook is actually trained to minimize reconstruction error on the residual from all previous codebooks — CB₁ captures whatever has the most energy (which loosely correlates with content), but there is no principled separation into interpretable factors. SpeechTokenizer's explicit semantic distillation loss on CB₁ is the clearest evidence: if RVQ gave you clean factorization for free, no one would need to engineer it in. The CB₁/CB₂/CB₃ labeling in diagrams is a useful simplification, not a reliable description of what those codebooks actually encode.

What every method shares:

Waveform in → integer indices from a learned codebook out. No "audio alphabet" exists.
Tokens are opaque — you can't inspect them and read their meaning.
Substantially more tokens/sec than text. Single-codebook semantic systems run at 25–75 frames/sec (~6–18× text rate). Multi-codebook codecs multiply by N levels: 8 codebooks × 75 fps = 600 tokens/sec vs. ~4 text tokens/sec.
The codebook is always finite (256–16,384 entries) but entangled — what each entry "means" isn't decomposable.

This is the discrete path. A parallel approach skips discretization entirely, projecting continuous encoder features directly into LLM embedding space via a learned adapter. Section 3 covers where the fork actually happens and what each branch gives up: discrete tokens enable next-token generation but lose information at quantization and explode sequence length; continuous projection is richer but requires different generation machinery.

◆ ◆ ◆

Section 3

Discrete vs. Continuous: Where the Fork Actually Happens

This section → ① Representation ② Rate ③ Alignment ④ Output

Most audio encoders — Whisper, HuBERT, WavLM, wav2vec 2.0 — produce continuous feature vectors. The discrete/continuous split is not about which encoder you use; it is about what happens after the encoder: do you quantize those features into codebook indices, or project them directly into the LLM's embedding space?

The fork happens after the encoder — not at the encoder

Figure 3a. The same HuBERT model can feed either path — apply k-means for discrete cluster IDs, or project via an adapter for soft embeddings. The encoder doesn't determine which path you take. The fork is downstream.

What "discrete" and "continuous" actually mean

This distinction is subtler than it first appears. Discrete means selecting an index from a finite set of learned centroids: the input feature gets "snapped" to the nearest codebook entry. Information is lost at the snap. Training is also complicated: argmin is not differentiable, so gradients cannot flow back through the selection step directly. The common workaround (the straight-through estimator) simply pretends the gradient of the discrete selection equals that of the continuous input; good enough in practice, but an approximation. Continuous means multiplying the feature vector by a learned weight matrix W (written W×) to project it into the LLM's embedding space. No quantization, no information loss, and the whole operation is fully differentiable, so standard backprop works end-to-end.

The subtle case: even in the discrete path, some systems use the codebook vector (not just the index) downstream. The values are continuous, but constrained to a finite set of learned points. The distinction that matters: is the representation space unconstrained (any point in \(\mathbb{R}^D\)) or restricted to \(K\) learned centroids?

Discrete vs. continuous — two different geometries

Discrete: select an index

Continuous: project the vector

Figure 3b. The geometric difference between discrete and continuous paths. In the discrete case (left), an input feature × is snapped to the nearest codebook centroid — information outside that Voronoi region is discarded, and gradients require a straight-through estimator to flow back. In the continuous case (right), a weight matrix W projects the feature to any point in \(\mathbb{R}^D\) — no quantization, no information loss, fully differentiable end-to-end. The discrete path produces a shorter description (one integer); the continuous path preserves more signal.

What actually enters the transformer

Even for text, tokens are immediately mapped to continuous vectors; the real interface to the transformer is always a sequence of vectors in \(\mathbb{R}^D\). A discrete token is just a lookup key. The model never sees the integer. Every path has one thing in common: what enters the transformer is always a vector. The difference is how that vector was obtained and whether a discrete bottleneck sits in the middle. Tracing each path end-to-end makes this concrete:

Four paths to the same place — all converge on a vector in \(\mathbb{R}^D\)

Figure 3c. Four ways audio reaches the transformer, traced end-to-end. Text and discrete-vocab-extension follow the same mechanism: index → embedding table lookup → vector. The codebook-vector path retrieves from the VQ codebook (not the LLM table) and projects. The continuous path has no index at all. All four arrive at the same destination: a vector in \(\mathbb{R}^D\).

From spectrograms to encoder features

Encoder features are the latest step in a lineage, not a new kind of object. A mel spectrogram is a matrix: mel frequency bins × time frames. The encoder output has the same shape: learned feature dimensions × time frames. It is a virtual spectrogram in a learned feature space — one whose dimensions are trained to surface structure the downstream task needs. The adapter then projects this into a virtual spectrogram in LLM embedding space. Same matrix shape at every stage, different coordinate system.

The LLM and the encoder operate in different geometric spaces. The LLM expects \(D_{\text{llm}}\)-dimensional vectors trained on a particular distributional geometry; the encoder produces features in a geometry learned from audio. The adapter maps between them, with the encoder's training objective determining what structure the features carry into that translation.

The assumption is that a pretrained audio encoder has already extracted the right structure: the information the LLM needs is accessible in the encoder's feature space, even if not in the right geometry yet. This is often true, but it breaks when the encoder's pretraining task diverges from what the LLM task needs. An encoder trained for reconstruction (EnCodec) faithfully preserves reverb and breathing; an encoder trained for masked prediction (HuBERT) concentrates on phonetic structure. Neither is universally right. The adapter can partially compensate, but it can't recover information the encoder discarded. What gets thrown away at the encoder stage is gone.

The representation lineage — same matrix shape, different coordinate systems

Mel spectrogram (80 mels × T)

0stime →10s

Fixed filterbank. Hand-designed features — the "tokenization" of classical speech systems.

Encoder features (1280 dims × T)

frame 0features →frame T

Learned features. Same shape — but the axis is now a learned feature axis optimized by the encoder's training objective.

After adapter: projected features (\(D_{\text{llm}} \times T'\))

frame 0compressed →frame T'

The adapter re-projects into LLM-space and (often) compresses the time axis. A virtual spectrogram in LLM embedding space.

Figure 3d. Three representations in the lineage from waveform to LLM input — all are matrices of (feature dims × time frames). The mel spectrogram uses a hand-designed frequency axis. The encoder output uses a learned feature axis. After the adapter, the time axis is compressed and features are projected into LLM embedding space. The shape is the same throughout; only the coordinate system changes.

The adapter design space

After the encoder runs, you have a matrix of shape \(T \times D_{\text{enc}}\): \(T\) time frames, each represented as a \(D_{\text{enc}}\)-dimensional vector. The problem is that this is the wrong shape in two ways. The LLM expects \(D_{\text{llm}}\)-dimensional vectors (typically 4096 for a 7B model, not the encoder's 512 or 1024). And \(T\) is too large: 10 seconds of audio at 50 frames/sec gives 500 encoder frames, far more than the ~40 text tokens a person would use to transcribe the same utterance.

The adapter is the learned module that solves both problems: it maps from encoder space (\(D_{\text{enc}}\)) to LLM embedding space (\(D_{\text{llm}}\)), and usually compresses the time axis from \(T\) frames to \(T'\) tokens. The encoder and LLM are typically pretrained separately and frozen; the adapter is what gets trained to bridge them. How aggressively it compresses, and whether that compression is uniform or content-sensitive, defines the four main families below.

Linear / MLP applies a per-frame projection: each frame independently mapped to LLM dimension. No temporal compression: \(T' = T\). The LLM gets the full 500-frame sequence for 10 seconds of audio. This solves the dimensionality mismatch (\(D_{\text{enc}} \to D_{\text{llm}}\)) without touching the temporal mismatch. The assumption is that the LLM's attention can handle a 500-token audio prefix without degradation, which is increasingly true as context windows grow, but comes at a real compute cost. What it throws away: nothing at the adapter stage. What it defers: the temporal alignment problem, which the LLM must now learn to handle implicitly. It breaks in memory-constrained settings or when audio is long enough to crowd out the text context. Ultravox uses this approach: a two-layer MLP on Whisper encoder frames, relying on growing context windows to absorb the cost.

Strided / Pooled groups k frames and averages (or strides) them into one embedding. \(T' = T/k\). At k=5, 500 frames become 100. This solves both mismatches: dimensionality and sequence length. The assumption is that temporal compression is uniform: every k-frame window has equal informational density. That assumption is wrong for speech: silence compresses fine; dense consonant clusters lose detail. It breaks on fast speech, accented speech, or any audio with rapid acoustic change, where exactly the frames that need the most resolution get pooled together. Qwen-Audio uses adjacent-frame stacking / stride-style compression before projection: simple architecture, manageable sequence length, fixed-rate resolution cost.

Q-Former / Perceiver uses a small set of learned query vectors that cross-attend to all encoder frames. The number of queries determines the output length: \(T' = Q\) (e.g. 32). This solves the sequence-length problem most aggressively — 500 frames → 32 tokens — while letting attention learn what to keep. The assumption is that 32 learned queries are sufficient to capture everything the LLM task needs; that the information bottleneck forces useful compression rather than information loss. It breaks when the task requires fine-grained temporal detail (speaker diarization, precise word timing) that a 32-slot bottleneck discards. The queries are trained, not hardcoded, so what survives depends heavily on the downstream loss. SALMONN uses 32 query tokens per window, borrowing the cross-attention mechanism from BLIP-2 (Li et al., 2023). Flamingo used a different approach — a Perceiver Resampler with gated cross-attention, not Q-Former.

CTC-guided uses a pre-trained CTC model to identify which frames correspond to the same linguistic unit, then merges them. Content-adaptive: silence compresses heavily, dense speech less. This is the mechanism closest to the HMM's approach from Section 1: it removes duration variability before the LLM, normalizing audio to roughly the text token rate. The assumption is that the CTC model's segmentation is reliable enough to guide compression, and that the right granularity for merging is the CTC boundary, not a fixed window. It breaks when CTC alignment is wrong (noisy audio, accents, out-of-vocabulary words), only works for speech, and the merge strategy (average? last frame? attention-weighted?) is itself an unresolved choice that can further degrade quality when CTC makes errors. Despite being conceptually appealing — it is the approach most directly analogous to what classical ASR did explicitly — it has not become a standard building block in deployed audio-LLMs. There is no single clean reason: the dependency on a separate CTC model, the speech-only constraint, and the unresolved merge strategy each add friction, and simpler adapters turn out to work well enough that the added complexity is hard to justify.

Three adapter types used in practice — how they transform \(T\) encoder frames into \(T'\) LLM tokens

Linear / MLP

Per-frame projection. Each frame independently mapped to LLM dimension. No temporal compression.

Compression: none — \(T' = T\)
Tradeoff: preserves all detail, LLM gets long sequence
Used by: Ultravox, LLaSA

Strided / Pooled

Groups of k frames averaged or strided into one embedding. Simple, effective, fixed ratio.

Compression: k× fixed — \(T' = T/k\)
Tradeoff: shorter sequence, uniform — silence compressed equally
Used by: Qwen-Audio, WavLLM, SpeechVerse (1D conv)

Q-Former / Perceiver

Learned query vectors cross-attend to all encoder frames. Number of queries determines output length.

Compression: learned — \(T' = Q\) (e.g. 32)
Tradeoff: aggressive compression, attention learns what to keep
Used by: SALMONN, Pengi

Figure 3e. Three deployed adapter architectures. All transform \(T\) encoder frames into \(T'\) LLM tokens; the difference is the compression strategy and what information survives. Linear preserves all frames at full sequence length. Strided applies a fixed compression ratio. Q-Former compresses to a fixed number of learned queries.

What the LLM sees — a flat sequence of embedding vectors

Text token embeddings — row looked up from embedding matrix by token ID

Audio embeddings — encoder features projected by adapter W×

Neither tokens nor waveforms enter the LLM — only embeddings do. A text token (e.g. ID 3857) is first mapped to a continuous vector by the embedding table; the LLM never sees the integer. An audio frame is mapped to a continuous vector by the encoder and adapter. Both paths produce a D-dimensional vector in the same embedding space. The transformer attends over positions in a shared embedding space — there is no formally separate audio-processing pathway at the transformer level, though learned representations may carry modality information implicitly.

Figure 3f. The LLM input is always a flat sequence of embedding vectors. Text tokens arrive via the embedding lookup table; audio arrives via encoder + adapter projection. The two types sit at the same level (both are \(D_{\text{llm}}\)-dimensional vectors), even though they started in completely different representation spaces.

What each approach loses: discrete tokenization gives a unified vocabulary and supports next-token generation, but quantization loses fine acoustic detail and each codebook level adds tokens the sequence must absorb. Continuous projection preserves richer features on input; for audio output it does not fit the standard next-token sampling regime, so generating audio requires separate machinery (diffusion, flow matching). Adapters that compress aggressively (Q-Former, pooling) discard timing detail that tasks like diarization need. CTC-guided adapters handle speech well but break on music or environmental audio. There is no lossless path. Section 4 is about the output side of the same problem.

Section 4

How Audio Comes Out

This section → ① Representation ② Rate ③ Alignment ④ Output

Getting audio into an LLM and getting it back out are not symmetric problems. For many language-understanding tasks, you can afford to be lossy on input: a continuous projection that drops some acoustic detail still lets the LLM understand the words. On output, the LLM must produce something that decodes all the way back to a waveform, which demands much more specificity. Voice agents, speech synthesis, and audio continuation all require it.

The core asymmetry: On the input side, continuous projections work well — you don't need discrete tokens to understand audio. But on the output side, the LLM needs to produce something that can be decoded back to a waveform. An LLM generates outputs by sampling from a probability distribution over a discrete vocabulary. It can predict "the next token is index 847 with probability 0.3." It does not natively predict "the next token is the continuous vector [0.41, −0.59, 0.73, …]" under standard next-token autoregressive training. This is why the discrete/continuous distinction matters more for generation than for understanding.

Generation approaches split along a second axis that is worth naming explicitly. Semantic tokens are low-rate representations — HuBERT discrete units, or (with the RVQ caveats from Section 2) loosely the lower codebook levels — that capture linguistic and prosodic content with enough compression to fit inside an LLM context window. Acoustic tokens are the higher-rate levels that carry the fine detail required for waveform reconstruction: CB₂–CB_N at full codec rate, or multi-level codec sequences. The two differ not just in information content but in volume — semantic tokens run at a manageable rate; acoustic tokens at full codec rate add hundreds of tokens per second of audio. Most practical generation architectures try to decouple the two: let the LLM handle the semantic pass autoregressively, and use faster specialized machinery for acoustic fill-in.

Three architectures for audio output

Three ways to get audio out of the LLM

A. Cascade LLM generates text → separate TTS produces audio

Still the most common in production. Simple, reliable, debuggable. But: all paralinguistic information (prosody, emotion, timing) is lost at the text boundary. The TTS must re-invent it. Latency = LLM generation + TTS generation.

B. Hybrid LLM produces semantic tokens → separate model fills acoustic detail

The hierarchical generation pattern. The LLM handles "what to say" (semantic/CB₁) autoregressively. A separate, faster model handles "how it sounds" (CB₂–CB_N) in parallel or with fewer steps. Semantic tokens preserve prosody and timing that text doesn't.

C. Native audio generation LLM directly predicts all audio tokens — full end-to-end

SpiritLM (Nguyen et al., INRIA/Meta, 2024) is the clearest published example. It interleaves text tokens and HuBERT discrete speech tokens in a single vocabulary, trained on aligned speech-text pairs. The model learns to continue freely across modalities — given text, it can continue as speech; given speech, as text. An "expressive" variant adds pitch tokens alongside speech tokens to capture prosody more explicitly. Architecturally closest in spirit to what public evidence suggests for GPT-4o and Gemini, though neither has published its approach. The LLM treats audio tokens as first-class vocabulary items — preserves prosody, emotion, and timing. SpiritLM sidesteps the multi-codebook generation problem (below) by operating in a single flat token space: HuBERT discrete units rather than waveform-quality codec tokens. The tradeoff is lower audio fidelity — HuBERT units carry enough content for cross-modal continuation but not enough acoustic detail for high-fidelity reconstruction. Native architectures that target waveform-quality output do face the multi-codebook challenge. Both approaches need substantial training data pairing text and speech.

Figure 4a. Three architectures for audio output. A (cascade) never exposes audio tokens to the LLM. B (hybrid) uses the LLM for semantic tokens only, outsourcing acoustic detail to a separate model. C (native) puts text and audio in a unified token space — the most capable and most challenging approach.

Cascade (LLM → text → TTS) is the simplest approach and still the most common in production. The text bottleneck is the core problem: prosody, emotion, speaker identity, and timing don't survive it. The TTS must reinvent them from context, which it does imperfectly. Cascade also has a latency floor: even with streaming TTS, the LLM must commit to words before any audio can start (~300–500ms time-to-first-audio).

Hybrid (LLM → CB₁ → acoustic model) routes around the text bottleneck by having the LLM generate only the first codebook level. CB₁ tokens are audio-derived, so paralinguistic information that text discards — speaker identity, prosodic contour — can survive in them. A separate, faster model fills in CB₂–CB_N in parallel. The cost: CB₁ is not a clean semantic representation (the RVQ factorization misconception from Section 2 applies here), and the fill-in model has no access to the LLM's internal state. Latency improves over cascade on the acoustic side, but the autoregressive CB₁ pass still dominates time-to-first-audio.

Native generation extends the LLM's vocabulary to include audio token IDs directly — text and audio tokens appear in the same sequence, with no representational boundary. The model can learn to control prosody and voice the same way it learned to control rhetoric. This is also where the mismatch reappears in reverse: the LLM must generate not a sequence but a matrix (N codebook levels × T frames), and its native next-token regime does not map cleanly onto that structure. Latency is in principle lowest — audio tokens stream directly, ceiling at codec decode — but only with efficient generation strategies that avoid a full CB₁ pass before CB₂ starts.

The three architectures also differ in what feeds back during autoregressive generation. In cascade systems only text tokens are in the AR loop: audio is produced by a separate TTS step after the LLM finishes. In native and hybrid systems, audio token indices feed back as input at each step — the LLM attends to its own audio output when predicting the next token. This distinction matters for coherence: a native model can condition future tokens on the acoustic details of what it already generated; a cascade model cannot.

What each output path loses. Cascade discards all paralinguistic information — emotion, hesitation, speaker voice — at the text bottleneck. It is the most debuggable and robust option, but the output voice is entirely the TTS system's, not the model's. Hybrid avoids the text bottleneck for prosody but adds a second model and has no way to backpropagate from the final waveform into the LLM. Native audio generation is the most expressive path, but training is harder (the loss must operate over audio tokens directly, not readable text), generation is slower (the matrix serialization problem below), and there is no established evaluation protocol for open-ended voice quality.

The structural bottleneck for hybrid and native is the multi-codebook generation problem. At 25 fps with 4 codebook levels, one second of audio is 100 tokens: 25× the text token rate for equivalent content. Three strategies address it: hierarchical (AR for CB₁, then a parallel pass fills CB₂–CB_N — VALL-E style), delay patterns (codebooks staggered by one frame each, so a single AR pass generates all levels simultaneously — MusicGen), and masked parallel decoding (all tokens predicted in parallel via iterative unmasking in K passes — SoundStorm). Each is a different answer to the same question: how do you serialize a time×codebook matrix into a sequence?

The generation problem: one second of audio = a matrix of tokens (25 frames × 4 codebook levels)

Figure 4b. One second of audio at 25 fps with 4 RVQ codebook levels produces a 25×4 matrix of tokens — 100 tokens per second vs. ~4 text tokens for equivalent linguistic content. This is the generation problem in concrete form: how do you serialize a matrix into a sequence the LLM can predict?

The asymmetric pattern in practice: Most deployed systems use different representations for input and output: continuous projection (Whisper encoder → adapter → LLM) for understanding, separate codec or TTS for generation. This isn't a design flaw; it reflects the different requirements. The codec decoder itself — turning audio tokens back to a waveform — is the one largely solved step: fast, high-quality, real-time. The bottleneck is everything upstream. The fully native model, where a single LLM handles both audio input and output in a unified token space, is the aspiration. The systems inferred to use something like it (GPT-4o, Gemini) have not published their architectures, and the training data requirements are enormous.

In Practice

Where These Patterns Appear

The input paths and output architectures in this post are not taxonomic abstractions. Four archetypes cover the design space well; each makes a specific bet about which mismatch matters most to solve explicitly.

Ultravox — linear adapter, text output (Fixie AI, open)
Whisper encoder features flow through a 2-layer MLP directly into Mistral's embedding space, with no temporal compression. The full ~500-frame audio sequence enters the LLM. This relies on growing context windows and enough training to absorb the rate mismatch without explicit compression machinery. What it loses: the LLM must learn the 10–50× rate mismatch from scratch. Long audio pushes context limits. Output voice is entirely the TTS system's — the LLM never touches audio on the output side.

SALMONN — Q-Former adapter, text output (Tsinghua / ByteDance, open)
A dual-encoder input (Whisper + BEATs) feeds 32 learned query tokens that cross-attend to the full encoder sequence via Q-Former (from BLIP-2). The 32-token bottleneck is fixed regardless of audio duration — a 1-second clip and a 30-second clip both produce 32 tokens. What it loses: fine-grained temporal detail. Tasks requiring precise word timing, speaker diarization, or sub-second event detection are challenging when the bottleneck discards the frame-level sequence. A 32-slot fixed budget can't represent structure it didn't learn to preserve.

VALL-E — discrete codec generation (TTS / voice cloning) (Microsoft)
VALL-E is a TTS model, not an audio understanding system. Its primary input is text, plus a 3-second audio prompt for speaker conditioning. It represents the discrete codec output path: the LLM autoregressively generates EnCodec CB₁ tokens conditioned on text and speaker prompt. A separate non-autoregressive model then fills CB₂–CB_N in parallel. Voice cloning is possible because CB₁ captures enough speaker identity to condition the acoustic fill-in. What it loses: the LLM generates CB₁ at full codec rate (~75 fps for 5sec = 375 tokens), so latency is dominated by the AR pass. CB₁ is not a clean semantic representation — the RVQ residual factorization doesn't cleanly separate content from timbre.

GPT-4o — inferred native or near-native (OpenAI, closed)
Architecture unpublished. The evidence: end-to-end response latency as low as 232ms (average ~320ms per the GPT-4o System Card; this is time-to-first-audio-token, not full-response latency), demonstrated ability to mirror emotion, accent, and paralinguistic cues, and the ability to interrupt and be interrupted mid-generation. These are inconsistent with a cascade architecture where a text bottleneck necessarily discards paralinguistic information before TTS can regenerate it. What remains unknown: the tokenization scheme, codec frame rate, training data scale, and whether audio input and output share the same token space.

▸ Full system table — eight systems mapped to input and output patterns

System	Input	Output	Key choice
Ultravox Fixie AI — open	Continuous: Whisper encoder + 2-layer MLP	Text output (separate TTS for audio)	Linear adapter: no temporal compression. Relies on growing context windows to absorb the full 500-frame audio sequence.
Qwen-Audio Alibaba — open	Continuous: strided/pooled adapter	Text output (no audio generation path)	Adjacent-frame stacking before projection; manageable sequence length at fixed compression ratio.
SALMONN Tsinghua / ByteDance — open	Continuous: Q-Former (32 learned queries)	Text output (separate TTS for audio)	32-query bottleneck collapses the full encoder sequence regardless of audio length. Q-Former mechanism from BLIP-2; Flamingo used a Perceiver Resampler instead.
SpeechVerse Amazon	Continuous: 1D conv adapter (strided)	Cascade	1D convolutional adapter with LoRA fine-tuning of the LLM. Fixed-rate compression; no CTC-based segmentation.
VALL-E Microsoft	Text + 3s audio prompt (CB₁ speaker conditioning)	Hybrid: LLM generates CB₁ AR, separate non-AR model fills CB₂–CB_N	Hierarchical multi-codebook generation. Voice cloning from 3 seconds of reference audio.
MusicGen Meta — open (generation only)	Text conditioning	Native: single AR pass with delay patterns across codebooks	Delay-pattern serialization: one AR pass generates all codebook levels with time offsets. No separate non-AR stage.
SoundStorm Google DeepMind (generation only)	Semantic token conditioning	Native: masked parallel decoding	K iterative unmasking passes instead of \(T \times N\) sequential steps. Non-autoregressive; fast inference.
GPT-4o OpenAI — closed	Inferred: native or continuous	Inferred: native	Architecture unpublished. Response latency (as low as 232ms, average ~320ms per the GPT-4o System Card) and demonstrated paralinguistic preservation suggest native or near-native integration.

MusicGen and SoundStorm are generation-only systems; their input column reflects conditioning input, not an audio comprehension path. GPT-4o latency from the GPT-4o System Card (OpenAI, 2024).

Conclusion

What Remains Open

Audio poses two fundamental problems for LLMs: content is implicit (buried under speaker identity, acoustics, prosody), and tokenization must be invented from scratch. Audio has no pre-existing symbolic interface to approximate, only a continuous signal that must be given structure by the system itself. The classical HMM pipeline solved the variable-length alignment problem with structure, not scale. Modern systems replaced that structure with learned expressiveness, gaining vastly better results but losing the principled alignment mechanism.

Three questions remain genuinely open:

Can a single unified architecture handle both directions? With explicit structure, not just scale — the way HMMs handled alignment? SpiritLM (interleaved text/HuBERT tokens) and Moshi (Kyutai, 2024 — full-duplex speech-to-speech with an inner monologue stream) are published steps toward this. Both are genuine advances: SpiritLM demonstrates cross-modal continuation in a single vocabulary; Moshi demonstrates real-time full-duplex with an inner monologue stream. The field is moving quickly, and improvements in fidelity, latency, and voice control are actively being worked on across both industry and academic labs.

Will the field converge on discrete codecs, continuous latent generation, or hybrids that blur the boundary? Recent work on flow matching and diffusion over token spaces suggests the hard discrete/continuous line is already softening. What counts as a "token" may become a more flexible concept than current codec-based systems imply.

Can we close the rate mismatch? Ultra-low frame rate codecs (5–12.5 Hz) are pushing audio token rates closer to text. If audio can be represented at near-text rates without quality loss, the entire adapter problem simplifies dramatically, and the LLM sees much less of the many-to-one frame-to-phoneme mapping.

Some of these questions may already have answers. The underlying structure — the mismatch between dense continuous audio and the token-trained LLM regime, the tension between structural alignment machinery and learned expressiveness — will remain the right framework for understanding these systems regardless of which answers arrive first.

Audio-language models are systems for forcing a continuous, overcomplete, rate-mismatched signal into the discrete token-and-context regime that language models operate in, then optionally inverting that transformation on the way out. Each architectural choice in this post is a specific answer to one part of that forcing problem.

An audio-language model is not defined by its architecture. It is defined by how it answers four questions: what represents the sound, how much of it survives to the LLM, how acoustic structure aligns to language, and what form the output takes.

Glossary

Key terms used in this post. Hover over underlined terms in the text for inline definitions.

HMM Hidden Markov Model. A probabilistic model where phoneme states are hidden variables and acoustic frames are observations. Alignment solved via self-loops and the forward algorithm; best path via Viterbi decoding.

CTC Connectionist Temporal Classification. A loss function and alignment mechanism that marginalizes over all valid frame-to-label paths. Inherits HMM monotonicity; requires no explicit state topology.

MFCCs Mel-Frequency Cepstral Coefficients. Hand-designed audio features: FFT → mel-scale filterbank → DCT. 13–39 numbers per frame. Standard input to classical ASR systems from ~1985–2015.

Mel spectrogram A time-frequency representation of audio where frequency is on a mel scale (perceptually uniform). Computed as FFT → mel filterbank. Input to most modern audio encoders (Whisper, HuBERT).

BPE Byte Pair Encoding. Subword tokenization for text: iteratively merges the most frequent adjacent character pairs. Text LLMs use BPE vocabularies of 32k–100k tokens. Tokens are human-readable because text was already symbolic.

RVQ Residual Vector Quantization. Multi-stage codebook compression: CB₁ quantizes the encoder output, CB₂ quantizes the residual from CB₁, and so on. Each frame gets one integer index per codebook level. Used by EnCodec, SoundStream, DAC.

Codebook A learned lookup table mapping integer indices to embedding vectors. In RVQ codecs, each codebook level has N entries (typically 1024). The "code" is the index; the "book" is the set of embedding vectors it indexes.

EnCodec Meta's neural audio codec. Convolutional encoder → RVQ bottleneck → convolutional decoder. Operates at 50–75 fps with 4–8 codebook levels. Widely used as a speech tokenizer for audio-LLM integration.

Q-Former / Perceiver An adapter module with a fixed set of learned query vectors that cross-attend to all encoder frames. Output length equals the number of queries (e.g. 32), regardless of input length. Q-Former was introduced in BLIP-2 (Li et al., 2023); Flamingo used the related but distinct Perceiver Resampler with gated cross-attention.

Viterbi algorithm Dynamic programming algorithm that finds the single most probable state sequence in an HMM. Produces an exact frame-to-phoneme alignment as a byproduct — inspectable and deterministic under the model's assumptions.

HuBERT Hidden-Unit BERT. A self-supervised speech encoder pre-trained to predict offline k-means cluster assignments. Produces continuous encoder features that can be discretized (k-means) or projected directly into LLM space.

Straight-through estimator A gradient approximation for discrete operations: on the forward pass, use the quantized value; on the backward pass, pass gradients through as if no quantization occurred. Makes codebook learning end-to-end trainable.

The Four Mismatches

What Classical ASR Made Explicit

The HMM's trick: self-loops and marginalization

What the HMM brought that attention and CTC don't

The Two-Fold Problem

Problem 1: Content is implicit

Problem 2: Tokenization must be invented

The discrete tokenization landscape

Discrete vs. Continuous: Where the Fork Actually Happens

What "discrete" and "continuous" actually mean

What actually enters the transformer

From spectrograms to encoder features

The adapter design space

How Audio Comes Out

Three architectures for audio output

Where These Patterns Appear

What Remains Open

Glossary

Further Reading