What Are Tokens? Definition, How It Works & Examples (2026) |…

Tokens are the fundamental atomic units of text—such as words, subwords, or characters—that large language models (LLMs) and other natural language processing (NLP) systems process, serving as the essential bridge between raw human language and the numerical vectors a model can compute. In modern transformer-based architectures, a sequence of text is decomposed into a sequence of tokens, each of which is mapped to a unique integer ID from a fixed vocabulary, enabling the model to ingest, generate, and reason about language in a structured, mathematically tractable form.

What exactly is a token in artificial intelligence?

In the context of large language models, a token is the smallest semantic unit into which a piece of text is segmented before being processed by the neural network. The concept originates from the preprocessing step known as tokenization, which transforms a raw string of Unicode characters into a list of discrete elements. Contrary to common intuition, tokens are rarely complete English words. Instead, modern tokenizers—typically based on the Byte-Pair Encoding (BPE) algorithm or its variants—break words into frequently occurring subword fragments. For example, the word "unbelievable" might be tokenized into three tokens: un, believ, and able. This subword strategy elegantly balances vocabulary size against the ability to represent rare or previously unseen words, preventing the out-of-vocabulary problem that plagued earlier character-level or whole-word models.

Each token in a model's vocabulary is assigned a unique integer index. During inference, text is converted into a 1D sequence of these integer IDs, which is then passed through an embedding layer—a learned lookup table that maps each token ID to a high-dimensional, dense floating-point vector (often 768 to 12,288 dimensions, depending on the model). These token embeddings form the initial hidden states that flow through the transformer's attention and feed-forward layers. Critically, the choice of tokenization directly impacts a model's multilingual capabilities, its efficiency on code or numerical data, and even its ability to perform arithmetic.

How do tokens work inside a large language model?

The lifecycle of a token in a transformer-based LLM proceeds through several precise stages:

Tokenization: The input string is scanned by a pre-trained tokenizer. For BPE, this involves merging the most frequent byte pairs iteratively until a target vocabulary size (e.g., 50,000, 100,000, or 200,000) is reached. In 2026, many frontier models have moved beyond pure BPE to incorporate SentencePiece (a language-agnostic variant) or Tiktoken, OpenAI's open-source BPE library that efficiently handles special control tokens.
Index Mapping: Each recognized subword unit is replaced by its integer ID from the vocabulary. The tokenizer also prepends a special beginning-of-sequence token (e.g., <BOS>) and, in decoder-only models, appends an end-of-sequence token (<EOS>) to demarcate generation boundaries.
Embedding: The integer IDs pass through an embedding matrix learned during training. The resulting vectors capture semantic similarity; tokens with similar meanings cluster in this high-dimensional space.
Positional Encoding: Because transformers process tokens in parallel rather than sequentially, a positional vector is added to each token embedding to encode its position in the sequence. As of 2026, Rotary Position Embedding (RoPE) is the dominant technique, encoding relative position information directly into the attention computation.
Contextualization: The embedded sequence flows through layers of multi-head self-attention, where each token attends to every other token, building a deep contextual representation. The final hidden state vectors encode the meaning of each token within the full context of the sequence.
Decoding: For text generation, the final hidden state of the last token is projected back onto the vocabulary space via a linear layer (the "language modeling head"), producing logits. A softmax converts logits to a probability distribution, and the next token is sampled autoregressively. This new token is appended to the sequence, and the process repeats.

The practical consequence is that an LLM's context window is not measured in words or characters, but in tokens. The maximum number of tokens a model can process at once defines its fundamental architectural limit—by 2026, production models routinely support context windows of 1 million tokens or more.

What are the main types of tokenization and token representations?

Tokenization strategies represent a fundamental trade-off between vocabulary compactness and semantic granularity. The primary types include:

Tokenization Type	Unit of Segmentation	Vocabulary Size	Key Advantage	Key Disadvantage
Word-based	Whole words separated by whitespace	Very large (>10^6)	Intuitive, high semantic density per token	Catastrophic out-of-vocabulary (OOV) rate, huge embedding matrix
Character-based	Individual Unicode characters	Tiny (<256 for English)	Zero OOV, very small vocabulary	Extremely long sequences, poor semantic signal per token
Subword (BPE/WordPiece)	Frequent subword units	Medium (30k–200k)	Excellent OOV handling, efficient compression, multilingual	Rare-word segmentation can be arbitrary; insensitive to morphology
Morpheme-based	Linguistically meaningful sub-words	Medium-large	Better linguistic grounding, interpretable	Requires language-specific rules, complex pipeline

Byte-level BPE, as used in OpenAI's GPT-4o and Google's Gemini models, operates directly on UTF-8 bytes rather than Unicode characters, guaranteeing a base vocabulary of exactly 256 tokens and allowing the model to reconstruct any arbitrary sequence of bytes without an unknown token.

Beyond the tokenization algorithm, there is a growing taxonomy of special tokens that structure model behavior: control tokens (<|im_start|>, <|system|>) define conversational roles for instruction-tuned models; tool-use tokens demarcate function-calling arguments; and modality expansion tokens in multimodal models signal the injection of image or audio embeddings into the token stream. In 2026, a critical research frontier involves "token-free" or byte-level state-space models, which aim to process raw byte sequences directly, bypassing the tokenizer entirely to eliminate tokenization bias.¹

What are some prominent real-world examples of token usage?

Token design decisions are not academic footnotes; they are competitive differentiators for frontier AI systems. Concrete examples include:

Claude (Anthropic): Uses a custom BPE tokenizer with a large vocabulary exceeding 100,000 tokens. In 2024, Claude's tokenizer notably introduced a way to represent individual characters as sub-tokens, drastically improving performance on character-level tasks like spelling a word backward.
GPT-4o (OpenAI): Employs the o200k_base tokenizer, a 200,000-token vocabulary BPE tokenizer built on the Tiktoken library. This expanded vocabulary improved compression rates for non-English languages by ~2.5x compared to GPT-4's cl100k_base tokenizer, directly making the model cheaper and faster for multilingual users.²
Llama 4 (Meta AI): Adopted the SentencePiece-based tokenizer from the Tiktoken lineage, widening its vocabulary to 128,000 tokens in its 2025 release. A key feature is the explicit handling of Unicode's byte-fallback mechanism, ensuring perfect fidelity when processing corrupted web text.
Gemini 2.5 (Google DeepMind): Uses its proprietary tokenizer that integrates multimodal tokens natively. Images are not merely converted to text before tokenization; raw image patches are encoded as dense vectors fed into the transformer alongside text tokens, a technique called early fusion.
Mistral Large 2 (Mistral AI): Extended its tokenizer to 131,000 tokens with explicit support for function-calling function signatures, enabling structured output parsing directly from the token stream with near-perfect schema adherence.

How are tokens used in practical AI applications?

The concept of a token translates directly into concrete resource constraints and pricing metrics in production AI systems.

API Pricing and Metering: Every major inference API (Anthropic, OpenAI, Google Cloud, Together AI) bills precisely by the token—typically input tokens and output tokens—often with differential pricing because output generation requires sequential computation. As of 2026, with prefix-caching and prompt optimization becoming standard, enterprise cost-per-million-tokens has dropped by over 80% for frontier models compared to 2024.
Context Window Management: Architectures like retrieval-augmented generation (RAG) depend on fitting relevant documents alongside a conversation history into a finite token budget. Techniques such as token-level pruning and sliding window attention manage memory overhead, directly trading token recall for latency.
Constrained Decoding and Structured Output: Libraries such as Outlines, Guidance, and LMQL manipulate generation at the token level by masking the logits of invalid tokens, guaranteeing that an LLM emits valid JSON, SQL, or even a specific programming language grammar.
Multimodal Quantization: In vision language models, an image is "tokenized" into discrete visual tokens via a separately trained image encoder (like CLIP or SigLIP) or directly into patch embeddings. The efficiency of visual tokenization—how many text-token-equivalents an image consumes—determines the cost and speed of visual reasoning.

What are the key benefits and inherent limitations of tokens?

Benefits:

Universal Interface: Tokens provide a uniform mathematical representation for text, code, images, and even robotic action sequences, allowing a single transformer architecture to process any modality that can be serialized into discrete units.
Compression Ratio: Subword tokenization compresses a typical English text string to roughly 75% of its character length, directly reducing the context length the transformer must process by an equivalent margin.
Generalization: Sharing subword components allows an LLM to infer the meaning of rare words by decomposing them into known roots, suffixes, and prefixes.
Controllability: Operating on logit-level token probabilities enables precise, verifiable guardrails against generating harmful content or malformed output.

Limitations and Trade-offs:

Tokenizer Bias: Spurious token boundaries introduce systematic weaknesses. A notorious example is the "reversal curse," where models fail to answer a question if the "knowledge" token sequence was only seen in one direction during training. Additionally, whitespace tokenization disparities cause a model to treat "Python" and " Python" (with a leading space) as distinct concepts.
Numerical and Orthographic Fragility: Deeply subword-tokenized numbers are not atomic. The number "380" may be tokenized as 38 and 0, losing arithmetic structure and making precise calculation inherently difficult without external tools. Models consistently struggle with character-level manipulations like counting the letter 'r' in "strawberry."³
Language Disparity (The Tokenizer Tax): Because BPE optimizes for compression, high-resource languages like English see far more efficient tokenization than low-resource or morphologically complex languages. A sentence in Burmese or Finnish might require 4-8 times more tokens than its English semantic equivalent, making API calls slower and significantly more expensive for those users.
Irreversibility and Opacity: Tokenization is a lossy, deterministic pre-step. The model has no direct access to the raw text. A specific rare word can be silently mapped to a sequence of subword IDs that, when decoded, reconstructs incorrectly, a class of failure known as "tokenizer round-tripping errors."

How do tokens differ from embeddings and vectors?

There exists a common conflation between these three terms, but they represent distinctly different stages of the NLP pipeline:

A token is a discrete integer representing a textual unit. It is a symbolic entity—the input-to and output-from the model.
An embedding is a dense, continuous floating-point vector that represents the meaning of a token. An embedding is the learned internal representation of that discrete token ID and lives inside the model's latent space.
A vector is the broader mathematical object. An embedding is a vector, but not all vectors in the model are embeddings (e.g., attention query, key, and value vectors are derived intermediate states).

The sequence is: Text → Tokenizer → Token IDs → Embedding layer → Embedding vectors → Transformer → Output logits → Softmax → Next Token ID. The token is the symbolic bookend; the embedding is its semantic essence.

Frequently Asked Questions

Can a token be an entire sentence? In theory, if a sentence occurs so frequently in the training data that a subword tokenizer merges it into a single unit, it could become a token. In practice, only very short, extremely high-frequency phrases (like "I'm" or "The") become single tokens. Most sentences parse to many subword tokens.

Why do some languages use more tokens than others? Tokenizers like BPE optimize for statistical frequency over a given corpus. Languages with compact, analytic morphology (English, Chinese) get compressed efficiently. Agglutinative languages that add many suffixes to a root (Turkish, Finnish) are split into many more subword units, inflating token counts. As of 2026, this lack of tokenizer equity remains a major challenge in global AI deployment.

Does a higher token count always mean higher cost? In general, yes—frontier API models charge per unit of computation, and more tokens require more floating-point operations (FLOPs) in the self-attention mechanism (which scales quadratically with sequence length in vanilla transformers). However, 2026 advances in state-space models (LFM-2, Mamba-2) and linear attention (RWKV-7) are decoupling cost from token count, as their computational cost scales linearly with sequence length.

What happens to a token the model "doesn't know"? The tokenizer has a closed vocabulary. Any byte sequence that cannot be decomposed into valid subword merges is handled by a fallback mechanism. Modern tokenizers fall back to UTF-8 byte tokens, so unknown text is represented as a sequence of single-byte tokens, ensuring zero information loss and enabling the model to still process the raw bytes.

Why do models struggle with counting letters if they process tokens? Because the model does not see letters, it sees subword token IDs. The word "hello" is typically a single token; the model has no inherent, built-in "alphabet" to inspect the internal structure of that token. It must learn orthographic patterns indirectly from training data where character-by-character tasks have been explicitly demonstrated, a fundamentally brittle process.

Does punctuation count as a token? Yes. Punctuation like periods, commas, and question marks are frequently independent tokens. This is essential because the self-attention mechanism must learn that a question mark token drastically shifts the semantic probability of the preceding tokens. Whitespace is also tokenized, primarily as a prefix to the following word.

Xue, L., et al. "ByT5: Towards a token-free future with pre-trained byte-to-byte models." Transactions of the Association for Computational Linguistics 10 (2022): 291-306. Link ↩
OpenAI Cookbook. "How to count tokens with tiktoken." OpenAI Documentation. Link ↩
Piantadosi, S. T., & Hill, F. "Meaning without reference in large language models." arXiv preprint arXiv:2208.02957 (2022). Link ↩

What Are Tokens? Definition, How It Works & Examples (2026)

TL;DR