Skip to main content
What Is Speculative Decoding? Definition, How It Works & Examples (2026)

What Is Speculative Decoding? Definition, How It Works & Examples (2026)

Speculative decoding is an inference technique that accelerates autoregressive language model generation by drafting multiple tokens in parallel and verifying them with a target model.

By Meo Advisors Editorial, Editorial Team
9 min read·Published Jun 2026

TL;DR

Speculative decoding is an inference technique that accelerates autoregressive language model generation by drafting multiple tokens in parallel and verifying them with a target model.

Watch the explainerwith Marcus, Meo Advisors
Video transcript

Have you ever wondered how we can make large language models generate text even faster? Speculative decoding is a clever technique that speeds up inference without losing any model quality. It uses two different models working together. First, a tiny draft model quickly guesses several tokens in a single parallel batch. The draft model makes fast predictions. Then, the large target model checks those guesses all at once to ensure they are correct. The big model verifies every token. If the draft is right, we keep it. If not, the big model fixes the mistake. You get high quality output with a massive boost in total speed. Read the full article below to master this powerful optimization strategy today.

What Is Speculative Decoding? Definition, How It Works & Examples (2026)

Speculative decoding is an inference acceleration technique that speeds up autoregressive text generation from large language models (LLMs) without altering the model's output distribution. It works by pairing a fast, lightweight draft model with the original, more powerful target model. The draft model generates a sequence of multiple tokens in parallel, and the target model efficiently validates these tokens in a single forward pass, accepting those that match its own distribution. This method can deliver 2× to 4× wall‑clock speedups in generation latency while producing mathematically identical output to the target model alone.

As of 2026, speculative decoding has evolved from a research curiosity into a standard component of production LLM serving stacks, with implementations baked into frameworks like vLLM, TensorRT‑LLM, and Hugging Face TGI. Innovations such as self‑speculative decoding, dynamic draft‑length scheduling, and hardware‑aware tree‑based verification have pushed throughput gains toward the theoretical limits of memory bandwidth on modern GPU clusters.

What Exactly Is Speculative Decoding?

At its core, speculative decoding is a lossless acceleration algorithm for autoregressive sequence models. The autoregressive bottleneck—generating one token at a time—leaves GPU compute severely underutilized, especially during memory‑bound decoding where each forward pass reads the entire key‑value (KV) cache. Speculative decoding converts this sequential bottleneck into a speculative parallel process.

The target model (e.g., a 70B‑parameter Llama variant) retains full quality, while a much smaller draft model (e.g., an 8B‑parameter Llama model from the same family, or even a distilled variant of the target itself) proposes candidate token sequences at a fraction of the cost. The target model then scores all candidates in one batched pass, accepting some and rejecting others. Crucially, the accepted tokens are exactly what the target model would have generated step‑by‑step, so output fidelity is preserved. The only thing that changes is how many tokens are produced per forward pass of the expensive model.

The technique is often described as "free lunch for LLM inference" because it exploits idle compute during memory‑bound decoding. Where traditional token‑by‑token generation might achieve 20‑30% GPU utilization, speculative decoding can push utilization above 80% by batching the target model’s verification step.

How Does Speculative Decoding Actually Work?

The algorithm operates in a continuous speculative‑verify loop:

  1. Drafting Phase (Speculation): From the current sequence, the draft model autoregressively generates γ (gamma) candidate tokens (typically 3–8) in a standard sequential fashion. Because the draft model is small, this step is fast—often 10–50× faster than a single target‑model step.

  2. Verification Phase: The target model processes the input prefix plus all γ + 1 candidate extensions in a single, batched forward pass. This produces the target’s probability distribution for each position. Because modern GPUs execute this parallel verification at roughly the same cost as a single‑token forward pass, the additional computational overhead is negligible compared to the speedup gained.

  3. Acceptance via Rejection Sampling: For each candidate token, the algorithm compares the draft model’s probability against the target model’s probability using a modified rejection sampling scheme. A token is accepted with probability min(1, p_target / p_draft). If a token is rejected, the target model samples a corrected token from a residual distribution that guarantees the final sampled token matches the target’s marginal distribution. The process then resets, starting a new speculative batch from the corrected position.

  4. KV Cache Management: Accepted tokens’ KV cache entries are appended; rejected branches are discarded. Efficient implementations manage speculative KV cache trees where the draft model proposes multiple alternative completions (tree‑based speculation), allowing the verifier to select the best‑matching path.

This procedure is strictly lossless: the sequence of tokens produced is statistically indistinguishable from the target model generating token by token, as proven by the rejection sampling guarantee in the original paper by Leviathan et al. (2023)1.

What Are the Key Types or Variants of Speculative Decoding?

Several distinct approaches have emerged since the technique’s introduction:

Draft‑Model‑Based Speculative Decoding

The classic two‑model setup where an external small model serves as the drafter. Common drafters include:

  • Smaller models from the same family (e.g., Llama 3.2 1B drafting for Llama 3.1 70B).
  • Distilled versions of the target model.
  • N‑gram or statistical models, such as the PLD (Prompt Lookup Decoding) method that simply copies matching n‑grams from the prompt or earlier generated text.

Self‑Speculative Decoding

Introduced to eliminate the need for a separate draft model, this approach uses the target model’s own early exit activations from intermediate transformer layers. A lightweight prediction head attached to an early layer produces draft tokens using the same backbone. Medusa (2023) pioneered this by attaching multiple prediction heads at different layers2. The resulting speedups are typically lower (1.5–2×) but require no additional model loading, reducing memory pressure.

Tree‑Based and Multi‑Candidate Speculation

Rather than a single greedy draft sequence, the draft model proposes a tree of possible continuations at each step. The target model verifies all paths simultaneously using a specially constructed attention mask. This increases the expected acceptance length per verification step. Frameworks like SpecInfer and Sequoia have demonstrated this technique, achieving up to 5× speedups on consumer GPUs.

Draft‑Model‑Free Methods with Jacobi Iteration

Jacobi decoding (enabling Lookahead Decoding) repurposes the target model itself as the drafter by feeding it multiple pending token positions and iteratively refining predictions in the style of Jacobi iterative methods for linear systems. This entirely sidesteps the need for a separate draft model but requires careful prompt engineering and typically yields smaller speedups.

What Are Some Named Real‑World Examples and Implementations?

  • vLLM Speculative Decoding: As of vLLM v0.6+ (2024–2026), built‑in support for draft‑model speculative decoding via a pluggable speculative_model argument. Supports both n‑gram drafting and external draft models, with tree attention for multi‑candidate verification. Widely deployed in production.
  • TensorRT‑LLM Speculative Decoding: NVIDIA’s inference framework offers optimized kernels for the Medusa head approach and draft‑model‑based speculation, with Medusa achieving up to 2.2× speedup on H100 GPUs for Llama 2 70B according to NVIDIA benchmarks.
  • Hugging Face Transformers: AssistedGeneration and PromptLookupDecoding helpers allow speculative decoding with any compatible draft model using the generate() API. Users can pass assistant_model or configure prompt‑lookup parameters directly.
  • Google Cloud TPU v5e with Speculative Decoding: Google’s Cloud TPU serving stack integrates speculative decoding for its Gemma and Gemini‑Nano models, reporting 50–70% latency reduction in API serving using a shared embedding‑space draft model.
  • Apple MLX Speculative Decoding: Apple’s MLX framework for Apple Silicon includes mlx_lm.generate with speculative decoding support, achieving >3× tokens/second improvements for the Llama 3.1 8B model on M3 Max chips via a 124M‑parameter draft model.

What Are Practical Use Cases for Speculative Decoding?

  • Real‑Time Chatbots and AI Assistants: Reducing time‑to‑first‑token and overall latency directly improves user experience. Speculative decoding cuts end‑to‑end response times from 3–5 seconds to 1–2 seconds on consumer hardware.
  • Batch Processing and Offline Inference: In high‑throughput scenarios (e.g., document summarization, synthetic data generation), speculative decoding increases tokens‑per‑second across large batches, lowering cost per token.
  • On‑Device and Edge Inference: When running LLMs locally on laptops or smartphones, memory bandwidth is severely limited. Speculative decoding with an ultra‑small drafter (e.g., a 15‑million‑parameter character‑level model) can bring 7B‑model generation to acceptable interactive speeds without cloud offload.
  • Code Completion and Assisted Programming: Code LLMs like CodeLlama benefit from high acceptance rates on repetitive, predictable syntax. Speculative decoding can double the speed of inline suggestions in IDEs.
  • Speculative RAG Pipelines: In retrieval‑augmented generation, the drafted text can be verified against retrieved factual snippets, combining factual accuracy checks with generation acceleration—an emerging 2026 practice sometimes called speculative grounded decoding.

What Are the Benefits and Limitations of Speculative Decoding?

Benefits

BenefitDescription
Lossless Output QualityRejection sampling ensures the output distribution matches the target model exactly—no degradation.
Significant Latency ReductionTypical wall‑clock speedups of 2–4×; up to 5× with tree‑based methods on memory‑bound workloads.
No Target Model ModificationDoes not require retraining, fine‑tuning, or quantization of the large model.
ComposabilityCan be combined with quantization, pruning, and other inference optimizations (e.g., FlashAttention, continuous batching).
Flexible DeploymentsCan use any available smaller model; dynamic draft‑length algorithms adapt to changing acceptance rates across prompts.

Limitations and Trade‑offs

LimitationDescription
Draft‑Model Memory OverheadLoading a separate draft model consumes additional GPU memory—problematic for tightly provisioned deployments.
Dependence on Draft QualitySpeedup collapses if the draft and target model distributions diverge heavily (e.g., fine‑tuned target with a base‑model drafter).
Batched Verification OverheadOn compute‑bound large‑batch serving, the parallel verification step may not yield throughput improvements and can even degrade throughput.
Non‑Deterministic SpeedupsSpeedup varies by prompt, domain, and sampling parameters; achieving consistent acceleration requires careful infrastructure tuning.
KV Cache ComplexityTree‑based and multi‑candidate methods introduce complex KV cache structures that challenge existing memory allocation and paging strategies.

How Does Speculative Decoding Differ from Other Inference Acceleration Methods?

Speculative decoding occupies a unique position in the inference optimization landscape. Unlike quantization (e.g., GPTQ, AWQ) or pruning, it does not reduce model capacity or risk quality degradation. Unlike speculative sampling from the knowledge distillation literature (which permanently alters the student model), speculative decoding is a runtime‑only technique that leaves the production model untouched.

Compared to multi‑query attention (MQA) or grouped‑query attention (GQA), which address the KV cache memory bottleneck at the architecture level, speculative decoding addresses the sequential dependency bottleneck directly. The two are complementary and commonly used together.

Medusa‑style head‑based speculation blurs the line between speculative decoding and architectural change: it adds trainable parameters to the target model but avoids loading a second full model. It sits midway between pure runtime speculation and model modification.

Frequently Asked Questions

Does speculative decoding change the text my model generates?

No. The rejection sampling algorithm provably guarantees that the generated sequence is drawn from the exact same probability distribution as the target model’s standard autoregressive generation. You get the same quality, just faster.

Can I use any small model as a draft model?

In principle, yes, but the draft and target models should share the same tokenizer and have reasonably aligned output distributions. The best draft models are small versions of the target model trained on similar data. Using a draft model with a different vocabulary or heavily divergent training can drop acceptance rates to near zero, eliminating speedups.

How much speedup can I realistically expect?

As of 2026, most production deployments report 2–4× wall‑clock speedups on single‑sequence, memory‑bound generation. Large‑batch offline inference may see smaller gains. The speedup depends on the draft‑to‑target size ratio, prompt domain, and hardware memory bandwidth. Overly aggressive speculation (very long draft sequences) can hurt performance.

Does speculative decoding work with batching and continuous batching?

Yes. Modern serving systems interleave speculative steps with continuous batching schedulers. For example, vLLM’s iteration‑level scheduler can decide whether to run a speculative or standard step based on the current batch composition and available memory.

What is the difference between speculative decoding and prompt lookup decoding?

Prompt lookup decoding is a special case of speculative decoding where the draft model is replaced by simple n‑gram matching from the existing prompt or context. It requires no additional model but works well only when input text contains repetitive patterns (e.g., code, documentation). For open‑ended creative text, it provides minimal speedups compared to a trained draft model.

Is speculative decoding useful for fine‑tuned or domain‑specific models?

Yes, but the draft model should ideally be fine‑tuned on the same domain. Using a generic draft model with a highly specialized target (e.g., a medical QA fine‑tune) can result in low acceptance rates. Techniques like dynamic draft length and draft‑model fine‑tuning on the target domain have partially addressed this in 2025–2026 production pipelines.

Footnotes

  1. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of the 40th International Conference on Machine Learning (ICML). arXiv:2211.17192

  2. Cai, T., Li, Y., Geng, Z., et al. (2023). Medusa: Simple Framework for Multi‑Head Decoding Acceleration. NeurIPS 2023 Workshop on Efficient Natural Language and Speech Processing. arXiv:2401.10774

Meo Team

Organization
Data-Driven ResearchExpert Review

Our team combines domain expertise with data-driven analysis to provide accurate, up-to-date information and insights.

More in Infra Runtime