Scaling LLM Throughput via Speculative Decoding

The execution speed of large language models during inference is rarely limited by raw computing power. One of the major, real bottlenecks is memory bandwidth. During the token generation phase, an LLM must read its entire model weight matrix from high-bandwidth memory into the GPU cache just to predict a single token. This means a 70-billion parameter model running in 16-bit precision transfers roughly 140 gigabytes of data per token, bogging down the processor while it waits for memory retrieval.

Speculative decoding alters this dynamic by executing token generation in parallel. By introducing a significantly smaller draft model to predict potential token sequences, engineers can verify multiple tokens simultaneously with a single forward pass of the larger target model.

The Core Speculative Framework

The process relies on a strict hierarchy pairing a small, computationally inexpensive draft model with a large, high-capacity target model. Crucially, both models must share the exact same tokenizer vocabulary to prevent token misalignment errors.

The execution sequence operates through a loop of speculation and verification. First, the draft model generates a sequence of speculative tokens using standard sequential generation. Because the draft model contains significantly fewer parameters, it executes these steps rapidly, moving a fraction of the data across the memory bus.

Next, the target model receives the original prompt plus the speculated tokens. It runs a single parallel forward pass across this entire block. This single pass takes roughly the same amount of time as generating a single token normally because the operations happen concurrently across the target model's layers. Finally, the system compares the probability distributions output by both models to determine how many of the guessed tokens are statistically acceptable.

Mathematical Verification and Sample Preservation

To ensure that speculative decoding does not degrade the target model's intelligence or alter its natural output distribution, systems utilize a modified rejection sampling algorithm.

Let p(x) be the probability distribution of a token according to the target model, and q(x) be the probability distribution according to the draft model. If the draft model suggests a token, the target model accepts this token with a probability defined by dividing p(x) by q(x), capped at a maximum of 1.

If the token is accepted, the system moves to evaluate the next speculated token in the sequence. If the target model rejects a token at a specific position, the generation loop halts for that block. The system discards any speculated tokens after the rejection point, and the target model samples a fresh, corrected token from a adjusted distribution.

To keep the output mathematically identical to the target model alone, this adjusted distribution is calculated by taking the excess probability mass where the target model outperformed the draft model, written as max(0, p(x) - q(x)), and normalizing it across the remaining vocabulary. This mathematical correction guarantees that the final output matches the exact probability distribution of the target model running completely on its own.

Performance Realities and Hardware Constraints

The actual speedup achieved by speculative decoding is highly variable and depends on the alignment between the draft and target models. If the draft model is too simple, the target model will frequently reject its guesses, causing the system to waste compute cycles on rejected tokens and drop back to baseline speeds. Conversely, if the draft model is too large, the overhead of running it negates the memory savings of the parallel target pass. 

In optimal conditions, where the draft model achieves an acceptance rate of 70% to 80%, speculative decoding regularly delivers a 2x to 3x increase in token generation speeds.

This approach shifts the optimization problem away from raw model compression. Instead of shrinking a single model and risking a loss in reasoning capabilities, engineers can deploy the full, uncompromised target model while leveraging a smaller, specialized draft network to manage the memory bandwidth tax.

Back to Main   |  Share