How Audio Codecs Turn Sound Into Tokens
Traditional spoken dialogue systems rely on a cascaded architecture. This setup chains multiple independent components together: Voice Activity Detection (VAD) identifies when a user speaks, Speech-to-Text (STT) transcribes the audio waveform into text, a text-based Large Language Model (LLM) generates a textual response, and Text-to-Speech (TTS) synthesizes that output back into an audio signal.
This chain reaction causes two massive headaches. For starters, stacking all these independent programs creates a noticeable, awkward lag. On top of that, converting a human voice into plain text completely strips away the soul of the conversation. You lose emotional nuance, the sarcasm, the sighs, and the background environment.
To overcome these limitations, modern architectures integrate audio processing directly into the transformer trunk. By treating brief snippets of audio waveforms as discrete tokens, a single model can process text and sound simultaneously.
The Audio Tokenization Pipeline: Quantizing the Waveform
Text is easily tokenized using deterministic algorithms. Audio is highly continuous, varying dynamically in both amplitude and frequency over time. To process a continuous audio file, we must convert it into discrete integer tokens that fit a model's vocabulary.
This transformation is handled by a neural audio codec, such as Mimi or EnCodec. The codec contains three core layers:
The Encoder: This network takes continuous raw audio waveforms and compresses them into a lower-frequency sequence of continuous vectors representing small time frames (such as 80ms intervals).
The Quantizer: This layer maps each continuous vector to a discrete codebook vector.
The Decoder: During generation, the decoder takes the discrete tokens and reconstructs the waveform back into continuous audio.
The critical step in this process is the quantizer, which turns continuous values into discrete indices.
Residual Vector Quantization (RVQ): The Portrait Analogy
Simple vector quantization maps a continuous vector to its nearest neighbor in a single, massive dictionary of sounds (a codebook). If we want to capture the complex richness of human speech, a single dictionary would need millions of entries, which is computationally impractical.
Codecs bypass this hurdle using something called Residual Vector Quantization (RVQ). Instead of relying on one impossible dictionary, RVQ stacks a bunch of small, simple ones.
Think of it like an artist sketching a portrait in layers. The first pass is just a rough outline; it gets the broad shape right, but there is zero detail. For the second pass, the artist looks only at what is missing from the sketch and draws the big features like eyes or nose. The third pass cleans up the remaining imperfections, adding fine skin textures and highlights.
To understand how this works, imagine a digital artist painting a portrait in multiple successive passes:
Pass 1 (The Rough Sketch): The artist sketches the broad silhouette of the face. This gets the basic shape right, but it lacks any detail.
Pass 2 (The Features): The artist looks at what is missing from the sketch (the remaining error, or "residual") and paints the eyes, nose, and mouth.
Pass 3 (The Textures): The artist targets the remaining imperfections, adding the fine skin textures, individual hairs, and light reflections.
By stacking these small, simple passes, the artist creates a highly detailed portrait without needing a single, impossible brush stroke that does everything at once.
RVQ does the exact same thing with audio. The first layer captures the coarsest vocal sounds, the second layer captures the remaining structural details, and subsequent layers refine the subtle textures and background noise. The final representation for a single audio snippet is simply a short list of coordinates pointing to these smaller dictionaries. This approach allows the model to capture complex audio tokens with high precision while keeping the software footprint light.
Reconciling Semantic and Acoustic Tokens
Audio signals contain two distinct layers of information. There is linguistic meaning (what words were said), and there is acoustic style (how the words were said, the speaker's voice, and the background noise).
Semantic Tokens: These capture phonetic structures and linguistic data, but they lack the fine details needed to reconstruct high-fidelity audio.
Acoustic Tokens: These are produced by the neural audio codec's quantizers. They excel at capturing rich textures, pitches, and environmental sounds.
To achieve fluid, natural conversations, modern systems combine these streams. Some architectures model parallel streams of tokens, allowing the transformer to process semantic information alongside style parameters. For example, the model can predict text tokens as a prefix to the acoustic codec tokens. This "inner monologue" layout guides the acoustic generation, ensuring high linguistic accuracy while preserving immediate voice generation.
Full-Duplex Spoken Dialogue
Because audio is processed directly as tokens, models can support full-duplex communication. A multi-stream architecture can process the user's incoming audio tokens and the model's outgoing audio tokens in parallel.
This multi-stream layout enables natural conversational behaviors:
Interruptions: The model can immediately detect when a user starts speaking over its output and stop generating its own tokens.
Background Processing: The transformer can analyze non-linguistic cues, like laughter or deep breaths, without waiting for a text transcript.
Near-Zero Latency: Eliminating the text transition layers allows the system to achieve an interactive loop latency of under 200ms, matching natural human conversational tempos.
By ditching the middleman and treating sound exactly like words, voice AI turns into a single, fluid guessing game. The result? A conversational loop that feels as fast and natural as talking to a real person.
