What is LLM Inference? Definition, How It Works & Examples…

LLM inference is the computational process by which a trained large language model (LLM) generates text, code, or other outputs in response to a user prompt, predicting one token at a time based on learned statistical patterns. Unlike the massive, one-time cost of training an LLM from scratch, inference is an ongoing operational expense that dominates the total cost of ownership for AI-powered applications. As of 2026, inference has become the primary bottleneck for scaling generative AI, driving innovations in speculative decoding, quantization, and custom silicon that push the boundary between latency, throughput, and model intelligence.

What exactly is LLM inference?

LLM inference is the forward pass of a neural network in autoregressive generation mode. When a user submits a prompt, the model does not "think" or "reason" in a human sense; it performs a series of matrix multiplications across billions of parameters to produce a probability distribution over its vocabulary for the next token. That token is then fed back into the model as part of the new input sequence, and the cycle repeats until a stop condition is met. This token-by-token generation loop is what distinguishes inference from a simple classification forward pass: each new token depends on all previously generated tokens, making the process inherently sequential and memory-intensive.

The term "inference" in deep learning historically meant making predictions with a trained model. For LLMs, it specifically refers to decoding — sampling from the model's output distribution using strategies like greedy decoding, top-k sampling, top-p (nucleus) sampling, or beam search. The key insight is that LLM inference is memory-bound, not compute-bound: the primary bottleneck is moving model weights from high-bandwidth memory (HBM) to the compute units, not the floating-point operations themselves. This has profound implications for hardware design and optimization strategies.

How does LLM inference work under the hood?

Under the hood, LLM inference is a two-phase process: prefill and decode.

Prefill phase

When a prompt is received, the entire input sequence is processed in parallel through the transformer layers. This phase is compute-bound because the model can leverage the full parallelism of GPUs or TPUs to compute key-value (KV) pairs for every input token simultaneously. The output of the prefill phase is the first generated token and a KV cache — a stored representation of the attention keys and values for every token in the input sequence.

Decode phase

This is the autoregressive loop. For each new token, the model must:

Compute the query vector for the new token.
Attend to all previous tokens by loading their KV cache entries from memory.
Perform the attention computation.
Run through the feed-forward network (FFN) layers.
Project to vocabulary space and sample the next token.
Append the new token's KV entries to the cache.

Because each step depends on the previous one, the decode phase is memory-bandwidth-bound. The model must repeatedly read its entire weight matrix and the growing KV cache from memory for every single token generated. For a 70-billion-parameter model running at 4-bit precision, generating 100 tokens requires moving roughly 35 GB of weights plus a KV cache that can balloon to gigabytes for long contexts. This is why inference speed is measured in tokens per second rather than inferences per second.

Attention mechanism and KV cache

The scaled dot-product attention at the heart of transformers computes attention scores between every pair of tokens. Without caching, this would be O(n²) in compute for every new token. The KV cache stores precomputed key and value projections, reducing the attention computation for each new token to O(n) — but the cache itself grows linearly with sequence length, creating a memory wall for long-context inference. As of 2026, techniques like multi-query attention (MQA), grouped-query attention (GQA), and sliding window attention are standard architectural choices that shrink the KV cache footprint, enabling longer contexts without proportional memory growth.¹

What are the key types and variants of LLM inference?

LLM inference strategies can be categorized along several axes:

Dimension	Variants	Description
Decoding strategy	Greedy, Top-k, Top-p, Beam search, Contrastive search	Controls how tokens are sampled from the output distribution; affects creativity vs. determinism
Precision format	FP32, FP16, BF16, INT8, INT4, FP8, NF4	Lower precision reduces memory bandwidth pressure at the cost of potential quality degradation
Parallelism strategy	Tensor parallelism, Pipeline parallelism, Data parallelism, Expert parallelism (for MoE)	Distributes model weights and computation across multiple accelerators
Scheduling	Continuous batching, Dynamic batching, PagedAttention	How multiple requests are interleaved to maximize accelerator utilization
Speculative execution	Speculative decoding, Medusa, Eagle, Lookahead decoding	Uses a smaller draft model to propose multiple tokens, then verifies them in parallel with the large model
Hardware target	GPU (NVIDIA, AMD), TPU, Inferentia, Gaudi, LPU (Groq), custom ASIC	Different silicon architectures optimized for memory bandwidth vs. compute

Speculative decoding deserves special mention as the most impactful inference optimization of the mid-2020s. It works by running a small, fast "draft" model to generate multiple candidate tokens, then feeding them as a batch into the large target model for parallel verification. If the large model accepts the draft tokens, the system achieves 2-3x speedup without any quality loss — the output distribution is mathematically identical to standard autoregressive decoding. Google's Medusa and Meta's Multi-Token Prediction extend this idea by adding multiple prediction heads directly to the target model.²

What are real-world examples of LLM inference systems?

Several production inference engines dominate the landscape as of 2026:

vLLM: An open-source inference engine developed at UC Berkeley that introduced PagedAttention, a virtual memory management technique for KV caches that eliminates fragmentation and enables near-optimal memory utilization. vLLM supports continuous batching and has become the default serving engine for many open-weight models like Llama and Mistral.³
TensorRT-LLM: NVIDIA's optimized inference library that fuses transformer layers into highly optimized CUDA kernels, supports quantization (FP8, INT4, INT8), and integrates with NVIDIA's Triton Inference Server for production deployment. It leverages FlashAttention-3 and FP8 tensor cores on Hopper and Blackwell architectures.
SGLang: An emerging framework from Stanford and UC Berkeley that combines a high-performance inference runtime with a structured generation language, enabling complex multi-step reasoning workflows (chain-of-thought, tree-of-thought) to be expressed declaratively and executed efficiently.
Groq LPU: A radically different architecture using deterministic, software-scheduled SRAM-based processors that achieve extremely low latency (often sub-100ms time-to-first-token) by eliminating the memory bandwidth bottleneck of GPU-based inference entirely.
Apple Intelligence on-device inference: Apple's deployment of ~3B parameter models running locally on A17 Pro and M-series chips, using neural engine acceleration and low-bit palletization to achieve usable latency within the thermal and memory constraints of mobile devices.

How is LLM inference used in practice?

LLM inference powers a rapidly expanding set of applications:

Conversational AI and chatbots: Customer support, virtual assistants, and therapeutic chatbots that require low latency for natural turn-taking. Time-to-first-token (TTFT) under 200ms is the standard for conversational feel.
Code generation and completion: Tools like GitHub Copilot and Cursor rely on inference that must feel instantaneous — often using fill-in-the-middle (FIM) inference where the model predicts code given both prefix and suffix context.
Retrieval-augmented generation (RAG): Inference combined with document retrieval, where the model processes retrieved context chunks. Long-context inference (128K+ tokens) is critical here, and the KV cache memory cost becomes the dominant constraint.
Agentic workflows: Multi-step reasoning where a single user request triggers dozens or hundreds of inference calls as the model plans, uses tools, reflects, and revises. Inference cost and latency compound dramatically in these scenarios.
Batch processing and data synthesis: High-throughput, offline inference for generating synthetic training data, labeling datasets, or processing large document corpora. Here, throughput (tokens per second per dollar) matters more than latency.
Streaming and real-time translation: Low-latency inference pipelined with audio processing for simultaneous translation and live captioning.

What are the benefits and limitations of LLM inference?

Benefits

Generality: A single inference endpoint can handle translation, summarization, coding, and creative writing without task-specific fine-tuning.
Scalability of intelligence: Larger models trained on more data consistently produce more capable outputs, and inference is the mechanism that delivers that intelligence to users.
In-context learning: Inference enables few-shot prompting, where the model adapts to new tasks from examples provided in the prompt without any weight updates.
Continuous improvement: Model weights can be updated (via fine-tuning or full retraining) and the same inference infrastructure serves the improved model immediately.

Limitations

Latency and interactivity: The autoregressive nature of decoding imposes a fundamental latency floor. Even with optimizations, generating a 500-token response from a 70B model typically takes 2-10 seconds on consumer hardware.
Memory wall: KV cache growth with sequence length creates a hard memory constraint. A 128K context with a 70B model can require over 30 GB of KV cache alone, limiting batch sizes and increasing cost.
Cost at scale: Inference dominates the operational cost of LLM-powered products. At $0.50-$2.00 per million output tokens for frontier models (as of early 2026), high-volume applications face significant infrastructure bills.
Non-determinism: Sampling-based inference produces different outputs for the same input, complicating testing, evaluation, and compliance in regulated industries.
Hallucination and factual reliability: Inference has no inherent truth-checking mechanism; the model generates statistically plausible text that may be factually incorrect, requiring external verification layers.

How does LLM inference differ from LLM training?

While both involve forward and backward passes through the same model architecture, LLM inference and training are fundamentally different computational workloads:

Aspect	LLM Training	LLM Inference
Primary bottleneck	Compute (FLOPs)	Memory bandwidth
Data flow	Processes entire dataset in parallel across thousands of accelerators	Processes one token at a time per sequence
Precision	Typically BF16/FP16 mixed precision; requires stable gradients	Can use INT4/INT8/FP8; no gradients needed
Hardware utilization	Aims for near-100% compute utilization over days/weeks	Often 1-5% compute utilization due to memory stalls
Batch size	Massive (millions of tokens globally)	Small (1-256 sequences per accelerator)
Latency sensitivity	Hours to months per run; throughput is everything	Milliseconds to seconds; interactivity matters
KV cache	Not used (teacher forcing processes all tokens in parallel)	Essential for avoiding recomputation
Cost profile	Large upfront capital expenditure	Ongoing operational expenditure; 80-90% of lifetime model cost

This distinction explains why inference-specific hardware (Groq LPU, Inferentia, custom ASICs) and inference-specific optimizations (speculative decoding, PagedAttention, quantization) have become their own thriving subfield, separate from training infrastructure.

Frequently Asked Questions

Why is LLM inference so slow on consumer hardware?

The primary bottleneck is memory bandwidth. A modern GPU like an RTX 4090 has ~1 TB/s of memory bandwidth. A 7B parameter model at 4-bit precision is ~3.5 GB. Reading those weights for every token generated limits theoretical maximum throughput to roughly (1 TB/s) / (3.5 GB) ≈ 285 tokens per second — and real-world performance is lower due to KV cache overhead and kernel launch overhead. Larger models exceed GPU memory entirely, requiring slower CPU offloading.

What is the difference between time-to-first-token (TTFT) and tokens-per-second (TPS)?

TTFT measures the latency from prompt submission to the first generated token appearing — it is dominated by the prefill phase, which processes the entire input prompt in parallel. TPS measures the rate of token generation during the decode phase. A system might have excellent TPS but poor TTFT if the prefill is not optimized, or vice versa. Streaming applications care about TTFT for perceived responsiveness; batch applications care about overall TPS.

Does quantization hurt inference quality?

It depends on the model size and quantization method. For models above ~30B parameters, 4-bit quantization (e.g., GPTQ, AWQ) typically causes negligible quality degradation on standard benchmarks. For smaller models (3B-7B), the impact can be more noticeable. FP8 quantization (available on Hopper and newer NVIDIA GPUs) preserves quality extremely well because it maintains a wider dynamic range than integer formats. The field has largely converged on 4-bit weight quantization with 8-bit or 16-bit activations as the sweet spot for quality-per-bit.

Can LLM inference be done entirely on-device?

Yes, as of 2026, models in the 1B-7B parameter range run effectively on flagship smartphones and laptops. Apple's on-device models, Google's Gemini Nano, and Qualcomm's AI Engine all demonstrate usable inference for summarization, smart reply, and basic reasoning tasks. The key enablers are 4-bit quantization, neural engine/tensor accelerator hardware, and architectural innovations like grouped-query attention that reduce memory footprint. However, frontier-level intelligence still requires cloud-based models with hundreds of billions of parameters.

What is continuous batching and why does it matter?

Continuous batching (also called dynamic batching or in-flight batching) is a scheduling technique where new requests can join a running batch without waiting for all existing requests to complete. Traditional static batching requires all sequences in a batch to finish before a new batch starts, wasting compute when sequences have different lengths. Continuous batching, as implemented in vLLM and TensorRT-LLM, can improve throughput by 2-10x in production serving environments with varied request lengths.

How does the KV cache cause out-of-memory errors?

The KV cache stores key and value tensors for every layer and every attention head for every token in the sequence. For a 70B model with 80 layers, 64 attention heads, and a 128-dimensional head, a single 128K-token sequence requires approximately 80 × 64 × 128 × 128,000 × 2 (K and V) × 2 bytes (FP16) ≈ 33.5 GB. Serving multiple long-context requests simultaneously can exhaust even the 80 GB of an H100 GPU, causing out-of-memory failures. PagedAttention and KV cache quantization are direct responses to this problem.¹

Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems, 35. https://arxiv.org/abs/2205.14135 ↩ ↩²
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. Proceedings of the 40th International Conference on Machine Learning. https://arxiv.org/abs/2211.17192 ↩
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). https://arxiv.org/abs/2309.06180 ↩

What is LLM Inference? Definition, How It Works & Examples (2026)

TL;DR