What is KV Cache? Definition, How It Works & Examples (2026)
A KV cache is a memory-optimization technique used in autoregressive transformer models to store precomputed Key (K) and Value (V) tensor projections from previous generation steps, eliminating redundant computation during text generation. By caching these intermediate representations rather than recomputing the entire sequence context for every new token, an LLM reduces the time complexity of generating each new token from quadratic to linear relative to sequence length, making real-time conversational AI and high-throughput serving practical.
What exactly is a KV cache?
In transformer-based large language models, the attention mechanism computes three projections—Query (Q), Key (K), and Value (V)—for each token in a sequence. During autoregressive generation, predicting token t requires attending to all previous tokens (1 through t-1). Without caching, the model would recompute the K and V projections for every preceding token at each step, leading to O(n²) computational complexity for an n-token sequence. A KV cache intercepts this process: after each generation step, the newly computed K and V tensors for the latest token are appended to a running cache in GPU memory, and on the next step, only the Q vector for the new token is computed fresh. The attention operation then uses the single new Q against the full cached K and V history to produce the output. This reduces per-step compute to O(n) and turns generation latency primarily into a memory-bandwidth problem rather than a compute-bound one.
How does a KV cache work under the hood?
To understand the mechanism concretely, consider a single multi-head attention layer in a typical decoder-only transformer such as GPT-3 or LLaMA, with hidden dimension d, number of heads h, and per-head dimension d_k = d / h.
-
Prefill Phase: When the user prompt of length L is processed, the model computes Q, K, and V projections for all L tokens in parallel, producing tensors of shape (batch_size, num_heads, L, d_k). The attention scores are computed as softmax(QK^T / √d_k), and the output is produced. Crucially, the K and V tensors for all L prompt tokens are saved into a cache structure—one cache per layer—immediately after this forward pass.
-
Decode Phase (Token Generation): For the first new token (position L+1), the model computes only its Q, K, and V vectors. The new K and V are concatenated to the existing cache, expanding its sequence-length dimension to L+1. The model then computes attention using the single new Q vector against the entire cached K and V. The output token is sampled, and its K and V are appended to the cache for the next step. This loop repeats until a stop condition is met.
-
Memory Layout: In high-performance implementations such as FlashAttention with PagedAttention (at the heart of vLLM), the KV cache is not stored as a single contiguous tensor. Instead, it is partitioned into blocks or pages in GPU VRAM. vLLM’s PagedAttention, introduced in 2023 and now standard as of 2026, manages the cache in fixed-size blocks (e.g., 16 or 32 tokens per block), analogous to virtual memory paging in operating systems. This eliminates fragmentation, enables memory sharing across sequences (e.g., for beam search or parallel sampling), and allows sequences of dramatically different lengths to coexist efficiently in a single batch.
-
Precision and Quantization: The dominant trend as of 2026 is low-precision KV caching. Instead of storing K and V tensors in FP16 or BF16, serving frameworks increasingly adopt KV cache quantization, compressing keys and values to INT8, FP8, or even 4-bit formats. For example, the Llama 3.1 405B inference stack at Meta uses FP8 KV caches to reduce memory pressure, and the HQQ (Half-Quadratic Quantization) method provides an efficient calibration-free path to 4-bit cache storage while preserving perplexity. Quantizing the KV cache is distinct from weight quantization and directly addresses the memory-capacity bottleneck for long-context inference.
What are the key types or variants of KV cache management?
KV cache implementations differ primarily in their memory-allocation strategy and eviction policies, as long-context sequences can exhaust even high-memory GPUs:
| Variant | Strategy | Representative System | Primary Advantage |
|---|---|---|---|
| Contiguous Cache | Pre-allocates maximum-length tensors for each sequence; static memory footprint. | Hugging Face Transformers (naive mode) | Simplicity; easy to implement. |
| Paged / Block Cache | Allocates non-contiguous blocks in a virtual memory pool; dynamic growth. | vLLM (Kwon et al., 2023); TensorRT-LLM | Near-zero memory waste; flexible batching. |
| Eviction-Based Cache | Discards KV entries judged to be low utility (e.g., tokens outside a sliding window, or with low accumulated attention scores). | StreamingLLM; H2O (Heavy Hitter Oracle) | Enables theoretically infinite context on fixed memory. |
| Multi-Query / Grouped-Query Cache | Reduces cache size architecturally by sharing K and V heads across multiple Q heads (Multi-Query Attention) or groups (Grouped-Query Attention). | Llama 2 (MQA); Llama 3 (GQA w/ 8 groups); Mistral | Dramatically smaller cache footprint without major quality degradation. |
| Prefix-Aware Cache | Detects and reuses KV computation for repeated shared prefixes (e.g., system prompts, few-shot exemplars). | SGLang (RadixAttention); LLM serving with "prompt caching" APIs | Avoids redundant prefill time when many requests share a common prefix. |
Sliding Window Attention deserves special mention: used by Mistral 7B (4096-token window) and adopted widely, it enforces a hard limit where each token can attend only to the most recent W tokens, making the cache automatically bounded. When combined with a global attention sink (as in StreamingLLM), this can be softened to preserve a few initial tokens that anchor attention scores, preventing perplexity collapse.
Which real-world systems and libraries implement KV caching?
Virtually every modern LLM inference engine incorporates an optimized KV cache. Prominent examples include:
- vLLM (UC Berkeley / Anyscale): Popularized PagedAttention and remains the reference implementation for high-throughput, memory-efficient KV caching. As of v0.6.x in 2026, it supports prefix caching, chunked prefill, and automatic FP8 KV cache quantization.
- NVIDIA TensorRT-LLM: Provides a highly optimized, closed-source KV cache layer with support for inflight batching, paged KV cache, and INT8/FP8 quantization on NVIDIA hardware. Its Graph-Runtime integration allows dynamic cache reallocation across requests.
- Hugging Face Transformers: The
generate()function withuse_cache=True(the default) maintains a static, contiguous cache per sequence. While not memory-optimal for production, it remains the de facto reference for research and fine-tuning. - SGLang (Stanford / LMSys): Uses RadixAttention, which organizes the KV cache in a radix tree (trie) based on token sequences. This automatically deduplicates cache entries for shared prefixes across different requests, a major win for multi-turn conversations and API-based systems with large system prompts.
- llama.cpp: Implements a quantized KV cache in C/C++ for consumer hardware. As of 2026, its Q4_0 and Q8_0 cache formats let users run 128K-context models on a single consumer GPU (e.g., 24 GB RTX 4090) by compressing the cache aggressively, a feat impossible with FP16 storage.
What are the practical use cases for a KV cache?
The KV cache is not optional in autoregressive LLM inference; it is a necessary condition for latency-sensitive and high-throughput applications:
- Real-Time Chatbots and Copilots: Without caching, the time to generate token n would be proportional to n², making conversations with more than a few hundred tokens perceptibly sluggish. KV caching ensures each new token takes roughly constant time regardless of conversation length (saturated only by memory bandwidth).
- High-Throughput API Serving: Platforms like Anthropic’s Claude, OpenAI’s GPT-4o, and open-source deployments on Together AI must serve thousands of concurrent requests. A well-managed paged KV cache lets the scheduler pack sequences of mixed lengths into a single GPU batch, maximizing utilization.
- Long-Document Summarization and RAG: Processing a 128,000-token document for retrieval-augmented generation stores a KV cache that can easily exceed 100 GB per request in FP16. Quantized and eviction-based caches make this economically feasible, allowing providers to offer long-context windows without prohibitive VRAM costs.
- Speculative Decoding: In this latency-reduction technique, a small draft model proposes several candidate tokens, and the large target model verifies them in parallel. The KV cache must be branched—shared from the original sequence but extended speculatively—then rolled back if verification fails. Tree-based KV cache structures (e.g., in TensorRT-LLM and Medusa) support efficient branching and rollback.
What are the benefits and limitations of the KV cache?
Benefits
- Latency Reduction: Reduces per-token generation latency from O(n²) to approximately O(n), enabling real-time, streaming text generation.
- Compute Efficiency: The heavy matrix multiplications for K and V projections are amortized over the life of a sequence. Prefill costs are one-time.
- Architectural Compatibility: Works across all standard attention variants (multi-head, multi-query, grouped-query) and is transparent to the model architecture, requiring only inference-code support.
- Memory Sharing: Advanced allocators allow multiple requests to share identical prefix caches (e.g., system prompts), effectively multiplying the number of concurrent users a single GPU can serve.
Limitations
- VRAM Hunger: An uncompressed KV cache for a 70B-parameter model with 128K context at FP16 requires approximately 2 * num_layers * num_heads * d_k * sequence_length * 2 (for K+V) bytes. For Llama 3 70B (80 layers, 8 K-heads, 128-dim each, 128K tokens), this exceeds ~160 GB—dwarfing the model weights (~140 GB). The cache, not weights, is now the VRAM bottleneck.
- Batching Complexity: Without paged memory, sequences of different lengths in a batch waste VRAM because every sequence’s cache tensor must be padded to the length of the longest sequence in the batch.
- First-Token Latency (Time-to-First-Token): The prefill phase, which computes K and V for the entire prompt, is compute-bound and can be slow for long prompts. Prefix caching mitigates this for shared prefixes, but unique long prompts still incur significant startup cost.
- Quantization Accuracy Trade-off: Aggressively quantizing the KV cache (below 4 bits) can introduce attention-score errors that accumulate over long contexts, leading to perplexity spikes or factual errors. Calibration datasets and adaptive precision schemes are active areas of research as of 2026.
How does a KV cache differ from a model weight cache?
A KV cache and a weight cache serve fundamentally different roles in inference, despite both using GPU memory:
- Weight Cache: Holds the static, read-only model parameters (weights and biases) that define the neural network’s trained function. These are accessed for every token in every request. They are constant across all sequences and typically loaded once at server startup. No eviction or growth logic applies.
- KV Cache: Holds per-sequence, dynamic state—the intermediate attention keys and values that grow with each generated token. It is request-local, mutable (append-only), and must be allocated and freed per sequence. Its size scales linearly with (batch_size × sequence_length).
In essence, model weights are “the program,” shared globally; the KV cache is “the execution state” of each ongoing generation. The two are often jointly considered in memory planning: a production serving system must balance weight memory (increasing with model size and tensor parallelism) against cache memory (increasing with concurrency and context length). As of 2026, it is common to see split strategies where weights reside on high-bandwidth memory (HBM) with bit-precise formats, while the KV cache is sharded or offloaded, or stored at low precision in a separate memory pool.
Frequently Asked Questions
Does every transformer model use a KV cache? No. Only autoregressive decoder models (GPT-style) require a KV cache during generation. Encoder-only models like BERT process the entire input in a single forward pass and do not generate sequentially. Encoder-decoder models like T5 or Whisper use a KV cache only in the decoder half (for cross-attention, the encoder outputs are fixed and can be cached as a static matrix).
Can the KV cache be traded for recomputation, and is that ever faster? Yes. Some inference libraries support “recomputation” strategies where older KV entries are discarded and recomputed on demand when a sequence exceeds a certain length, trading memory for compute. This is rarely faster at inference time because recomputation is compute-bound and hurts latency, but it is essential for on-device or edge scenarios where VRAM is severely constrained (e.g., running a 7B model on a phone).
Why does the KV cache use so much memory, and can I share it between different requests? The memory explodes because K and V tensors exist for every layer, every head, and every token in a sequence. A 32-layer model with 32 heads of dimension 128, for a 2048-token sequence, stores 32 × 32 × 2048 × 128 × 2 (for K and V) × 2 bytes (FP16) ≈ 1 GB per sequence—just for the cache. Sharing caches across requests is safe only when the prompt prefix is identical (byte-for-byte), which is what prefix-aware caches exploit (SGLang, prompt-caching APIs). Full sequence caches of different requests cannot be merged because attention is non-commutative with respect to sequence position.
Is the KV cache used during training? No. During training, teacher forcing provides the entire target sequence at once, and the model computes attention in parallel across all positions using a causal mask. The computational graph is stateless, and no incremental state needs to be preserved. The KV cache is a pure inference-time optimization.
What happens if the KV cache runs out of memory mid-generation? The behavior depends on the inference runtime. vLLM and TensorRT-LLM typically preempt requests (swap them to CPU memory) or reject new sequences when the shared block pool is exhausted. Naive implementations may trigger an out-of-memory (OOM) CUDA error, crashing the server. This is why accurate memory planning and dynamic batching are critical in production systems. As of 2026, unified memory architectures (e.g., Grace-Hopper) and KV cache offloading to CPU RAM via high-speed NVLink-C2C are making OOM conditions rarer, even at extreme context lengths.
How has KV cache management changed in 2026 compared to 2023? As of 2026, three shifts have become standard: (1) KV cache quantization to FP8 or INT8 is the norm in serving frameworks, not a research novelty; (2) prefix-aware caching (RadixAttention-style) is integrated into most commercial APIs and open-source servers, not just academic prototypes; and (3) disaggregated prefill and decode architectures (e.g., DistServe) split the prefill and decode phases across separate GPU pools, each with its own cache strategy optimized for compute or memory bandwidth. This trend addresses the growing gap between prompt-processing and token-generation demands.