What is PagedAttention? Definition, How It Works & Examples (2026)
PagedAttention is a memory management algorithm for large language model (LLM) inference that organizes the key-value (KV) cache into fixed-size, non-contiguous memory pages — borrowing the virtual memory paging concept from operating systems — to dramatically reduce GPU memory waste and increase serving throughput.
What is PagedAttention?
PagedAttention was introduced in the 2023 paper Efficient Memory Management for Large Language Model Serving with PagedAttention by Woosuk Kwon et al. from UC Berkeley and Stanford, and it forms the core innovation behind the open-source inference engine vLLM. Traditional LLM serving systems pre-allocate a large, contiguous block of GPU memory for each request's KV cache. Because sequence lengths vary unpredictably, this approach leads to severe memory fragmentation — both internal (unused space within a reserved block) and external (gaps between blocks that cannot be reused). PagedAttention solves this by dividing the KV cache into small, fixed-size logical blocks (pages), each holding the keys and values for a fixed number of tokens, and mapping those logical blocks to physical memory blocks that need not be contiguous.
The result is near-zero memory waste: physical blocks are allocated on demand and released immediately when a sequence finishes, much like how a modern OS manages RAM for running processes. This allows a single GPU to serve far more concurrent requests than was previously possible.
How Does PagedAttention Work?
PagedAttention operates through three interlocking mechanisms:
-
Block table per sequence. Each request maintains a block table — a mapping from logical block indices to physical block addresses in GPU memory. When the attention kernel needs to read or write keys and values for token position t, it consults the block table to find the correct physical page.
-
On-demand physical block allocation. The inference engine's block manager allocates a new physical block only when the current logical block is full. Because block size is fixed (commonly 16 tokens per block), the maximum internal fragmentation per sequence is just one partially-filled block at the tail — a tiny overhead compared to the kilobytes wasted by contiguous pre-allocation.
-
Copy-on-write for parallel sampling. When a request uses beam search or generates multiple output samples from the same prompt, PagedAttention shares the prompt's physical blocks across all beams via reference counting. A physical block is only copied to a new location when one beam needs to diverge and write different tokens — the classic copy-on-write (CoW) pattern. This makes parallel decoding strategies dramatically more memory-efficient.
The attention computation itself is modified to accept a block table argument. Instead of a single contiguous tensor, the kernel gathers key and value vectors from scattered physical blocks. Modern GPU kernels (including FlashAttention-style fused kernels) have been adapted to support this gather pattern with minimal latency overhead. You can read the original research at the arXiv preprint: https://arxiv.org/abs/2309.06180.
Why Does PagedAttention Matter for LLM Inference?
Memory is the primary bottleneck in production LLM serving. A 70-billion-parameter model already consumes most of an 80 GB A100 GPU just for weights; the remaining memory must be shared among all concurrent requests' KV caches. Under traditional contiguous allocation:
- Peak memory is reserved upfront based on the maximum possible sequence length, even if most requests are short.
- Memory cannot be reclaimed mid-sequence, so a long-running request locks up resources even during idle decode steps.
- Batching is severely limited, forcing low GPU utilization.
PagedAttention addresses all three problems. The vLLM team reported up to 24× higher throughput compared to Hugging Face Transformers and up to 3.5× higher throughput than text-generation-inference (TGI) on equivalent hardware in their original benchmarks, with negligible latency increase per token.
Beyond raw throughput, PagedAttention enables continuous batching (also called iteration-level scheduling), where new requests can join a running batch at any decode step rather than waiting for the entire batch to finish. This is now the standard approach for high-throughput LLM APIs.
As of 2026, PagedAttention has become the de facto memory management standard for open-source LLM inference, with implementations in vLLM, LMDeploy, TensorRT-LLM, and several cloud providers' proprietary stacks. The concept has also influenced hardware memory controller designs for next-generation AI accelerators.
What Are Real-World Examples and Implementations of PagedAttention?
vLLM is the reference implementation and the most widely deployed open-source LLM serving framework using PagedAttention. It supports models from Llama, Mistral AI, Google Gemini (via API compatibility layers), Falcon, and dozens of other architectures available on Hugging Face.
LMDeploy (from Shanghai AI Lab) independently implements a paged KV cache under the name TurboMind, optimized for NVIDIA and Ascend hardware.
TensorRT-LLM from NVIDIA incorporates paged KV cache management as a first-class feature, exposing it through its C++ and Python APIs for production deployment.
OpenAI-compatible API servers built on vLLM are now a standard deployment pattern for self-hosted models, meaning PagedAttention indirectly powers a large fraction of enterprise LLM traffic.
A minimal conceptual example: when serving a batch of 64 requests with sequence lengths ranging from 50 to 2,000 tokens, a contiguous allocator must reserve 2,000-token blocks for all 64 requests (128,000 token-slots total). PagedAttention allocates only the blocks actually consumed — perhaps 30,000 token-slots — freeing the rest for additional requests.
For further background on virtual memory paging, the foundational OS concept PagedAttention borrows from, see the Wikipedia article on demand paging.
Frequently Asked Questions
Does PagedAttention increase per-token latency?
In practice, the latency overhead is minimal. The block-table lookup and scattered memory gather add a small number of memory-access operations per attention layer, but modern GPU kernels amortize this cost across large batches. Benchmarks consistently show that the throughput gains far outweigh any marginal latency increase for typical serving workloads.
Is PagedAttention the same as Flash Attention?
No. FlashAttention is an algorithm that reorders attention computation to minimize reads and writes to slow GPU HBM memory, reducing the time complexity of the attention operation itself. PagedAttention is a memory management strategy for the KV cache across multiple requests and time steps. The two are complementary: vLLM uses FlashAttention-style kernels internally while managing KV cache storage with PagedAttention.
What block size should I use with PagedAttention?
The default block size in vLLM is 16 tokens. Smaller blocks reduce internal fragmentation but increase the size of block tables and the overhead of block-table lookups. Larger blocks improve kernel efficiency but waste more memory on partially-filled tail blocks. For most workloads, 16–32 tokens per block is the practical sweet spot, and most frameworks expose this as a tunable parameter.
Does PagedAttention work with multi-GPU tensor parallelism?
Yes. vLLM and other frameworks extend PagedAttention to multi-GPU setups by partitioning the KV cache across devices in alignment with the tensor-parallel sharding of the model's attention heads. Each GPU manages its own block manager for its shard of the KV cache, and the block tables are coordinated at the request-scheduling level.
Can PagedAttention be used with quantized models?
Yes. KV cache quantization (storing keys and values in INT8 or FP8 rather than FP16) is orthogonal to PagedAttention and is commonly combined with it. Quantizing the KV cache further reduces the memory footprint of each physical block, allowing even more concurrent requests. As of 2026, FP8 KV cache with PagedAttention is a standard configuration for high-throughput production deployments on Hopper-generation GPUs.