What is Grouped Query Attention? Definition, How It Works & Examples (2026)
Grouped Query Attention (GQA) is an attention mechanism for transformer models that interleaves multi-head attention (MHA) and multi-query attention (MQA) by dividing query heads into a smaller number of groups, each sharing a single key and value head. This design drastically reduces the size of the key-value (KV) cache during autoregressive inference, achieving near-MHA quality while improving throughput and memory efficiency. GQA was introduced by Ainslie et al. (2023) GQA paper and has been adopted in flagship large language models (LLMs) such as LLaMA 2 and Llama 3.
What Is Grouped Query Attention?
Grouped Query Attention is an evolved form of the attention mechanism that sits between the two extremes of multi-head attention and multi-query attention. In standard transformers, MHA employs an independent set of key, value, and query projections for each attention head. Each head attends to a different representation subspace, enriching the model’s capacity. However, during autoregressive decoding, MHA requires storing a separate key and value tensor for every head at every token position—the KV cache—which balloons memory usage as batch size and sequence length grow. MQA solves this by collapsing all query heads to share a single key and value head, reducing the KV cache size by a factor of the number of heads, but at some cost to quality.
GQA groups query heads into a smaller number of groups—typically G groups—and assigns a single key and value head to each group. If H is the total number of query heads and G the number of groups, then H/G query heads share one KV head. When G = H, GQA equals MHA; when G = 1, it is MQA. Thus, GQA provides a tunable trade-off: fewer groups give larger memory savings, while more groups preserve more capacity. The number of groups is a hyperparameter that can be set per model layer or globally, with typical values being 8, 4, or 2 for large models.
How Does Grouped Query Attention Work?
The core computation of GQA follows the standard scaled dot-product attention, but with a restructured projection scheme:
- Input embeddings are linearly projected into queries (Q), keys (K), and values (V). Unlike MHA, where each of the H heads has its own Q, K, V weight matrices, GQA maintains H query projections but only G key and value projections.
- Query heads are partitioned into G groups. Each group g receives the same K_g and V_g projections derived from the group’s dedicated weight matrices.
- Attention is computed per group: For each query head in a group, the dot-product attention is performed against the shared K_g and V_g.
- Outputs are concatenated and projected as in MHA.
This arrangement keeps the number of query heads unchanged (preserving the representational diversity from multiple query spaces) while drastically reducing the memory footprint of keys and values. In practice, during training, GQA is often obtained by uptraining from an MHA checkpoint: the model’s key and value weight matrices are averaged within each group to initialize the GQA weights, then training continues, which converges faster and matches MHA quality with fewer groups than training from scratch.
KV cache reduction: With H heads, MHA stores 2 × H × d × L elements per layer (for keys and values, where d is head dimension and L sequence length). GQA reduces this to 2 × G × d × L, a factor of H/G savings. For a model like LLaMA 2 70B with H = 64 and G = 8, the cache shrinks by 8×, enabling longer contexts or larger batches at the same memory budget.
What Are the Key Variants of Grouped Query Attention?
GQA exists on a continuum of attention sharing strategies:
| Variant | KV heads per query head | KV cache size (relative) | Quality vs MHA | Example use |
|---|---|---|---|---|
| Multi-Head (MHA) | 1 per head (H) | 1× (largest) | Baseline | Original Transformer, BERT |
| Grouped-Query (GQA) | 1 per group (G) | G/H fraction | Near-baseline | LLaMA 2 70B, Llama 3 |
| Multi-Query (MQA) | 1 shared (1) | 1/H fraction | Small drop | PaLM, PaLM-2 |
- MQA (Shazeer, 2019) PaLM paper radically reduces memory but can suffer throughput bottlenecks because the single KV head must handle all queries, limiting parallelism in some hardware. It may also slightly degrade long-range or fine-grained attention.
- GQA balances the two, often achieving quality within 0.1–0.2 perplexity points of MHA while offering savings proportional to the group reduction.
- Additional variants include symmetric grouped attention (where the number of key heads equals value heads but differs from query heads) and windowed GQA combined with sliding window attention (as in Mistral 7B).
How Is Grouped Query Attention Used in Real-World Models?
GQA’s adoption has been swift since its introduction. Significant implementations include:
- LLaMA 2 (Meta, 2023) LLaMA 2 paper: The 34B and 70B parameter models use GQA with 8 groups. For the 70B with H = 64 heads, this gives G = 8, reducing the KV cache by 8×. The 7B and 13B kept MHA, but subsequent Meta research indicated GQA is beneficial across scales.
- Llama 3 (Meta, 2024): All model sizes (8B, 70B, and 405B) adopted GQA as the default attention mechanism, confirming its viability for both small and large architectures.
- Mistral 7B (Mistral AI, 2023): Uses GQA in conjunction with sliding window attention, improving efficiency for long context windows (over 128k tokens).
- Gemma 2 (Google, 2024): Implements GQA in some configurations, illustrating the technique’s cross-organizational appeal.
These models serve diverse applications—from chat to code generation—underscoring GQA’s versatility. As of 2026, GQA is a standard building block in LLM design, and many model libraries (e.g., Hugging Face Transformers) provide native support for configuring the number of key-value heads.
What Are the Practical Use Cases for Grouped Query Attention?
GQA’s efficiency gains primarily benefit inference scenarios:
- Low-latency serving: Reducing KV cache size decreases memory pressure, allowing higher request batching without out-of-memory errors. This directly improves throughput and per‑token cost in production APIs.
- Long-context models: The ability to handle larger sequence lengths under a fixed memory budget makes GQA essential for document‑length reasoning, retrieval‑augmented generation (RAG) contexts, and whole‑codebase understanding.
- Edge and mobile deployment: On memory‑constrained devices (phones, IoT), GQA enables smaller KV caches, critical for real‑time applications like on‑device assistants.
- CPU and hybrid inference: When combined with CPU offloading techniques, the reduced cache footprint makes offloading more feasible, as the transfer volume is lower.
- Training-free adoption: Because GQA can be uptrained from MHA checkpoints, teams can convert existing models with minimal additional training, making it a practical upgrade path.
What Are the Benefits and Limitations of Grouped Query Attention?
Benefits
- Memory efficiency: The primary advantage—KV cache reduction by H/G factor—lowers GPU memory usage, enabling larger batches or longer sequences without hardware upgrades.
- Throughput improvement: Smaller caches mean faster KV reads and writes, reducing memory bandwidth contention; this can increase tokens‑per‑second, especially on memory‑bound hardware.
- Quality retention: GQA maintains model quality very close to MHA. In original experiments, a GQA model with 4 groups on large-scale language modeling lost less than 0.2 perplexity points while reducing cache size by 16×.
- Flexibility: The number of groups can be tuned per model size or layer, allowing engineers to dial in the desired trade-off.
- Uptraining efficiency: GQA checkpoints can be bootstrapped from MHA checkpoints via weight averaging, shortening the convergence time and avoiding training‑from‑scratch costs.
Limitations
- Slight quality degradation: While small, the quality drop can be notable in tasks requiring extremely fine attention, such as precise factual recall or multi‑hop reasoning. For maximum accuracy, some models may still prefer MHA.
- Hardware utilization quirks: The grouped sharing can introduce irregular memory access patterns, potentially leaving some GPU threads underutilized. Optimized kernels (like FlashAttention‑3) mitigate this but require implementation effort.
- Hyperparameter sensitivity: The choice of G is not trivial; too few groups can hurt quality, while too many diminish the efficiency gains. The optimal value depends on model size, data distribution, and use case.
- Training complexity: Uptraining still requires a fraction of original training compute, and not all models readily transfer from MHA to GQA without architectural adjustments.
How Does Grouped Query Attention Compare to Multi-Head and Multi-Query Attention?
| Aspect | Multi-Head Attention (MHA) | Multi-Query Attention (MQA) | Grouped Query Attention (GQA) |
|---|---|---|---|
| KV head count | One per query head (H) | One (shared) | G (typically 2–8) |
| KV cache size | Large | Very small | Medium |
| Memory bandwidth | High (many KV tensors to load) | Low (single KV tensor) | Moderate |
| Model quality | Highest | Slight reduction | Very close to MHA |
| Training | Standard | Often from MHA uptraining or from scratch | Uptrained from MHA or trained with grouped init |
| Inference throughput | Lower (cache‑bound) | Higher (cache‑friendly, but possible compute bound) | Best overall trade‑off |
| Use when | Quality is paramount, compute‑bound scenarios | Extreme memory constraints, very large batches | Balanced requirement for quality and efficiency |
GQA is the practical middle ground: when you cannot afford the memory cost of MHA but need better quality than MQA, GQA delivers. As of 2026, the majority of open‑source LLMs adopt GQA, reflecting this balance.
Frequently Asked Questions
What is the difference between Grouped Query Attention and Multi-Head Attention?
Multi-Head Attention assigns a unique key and value head to each query head, providing maximum representational capacity but largest KV cache. Grouped Query Attention shares key and value heads across groups of query heads, compressing the cache while preserving distinct query subspaces. The result is a memory‑quality trade‑off.
How many groups should I use in Grouped Query Attention?
The number of groups (G) is a hyperparameter. Common choices are 8 or 4 for large models (e.g., 70B parameters). The original GQA paper observed that 4–8 groups often match MHA quality within 0.1 perplexity points while offering a 16‑8× memory reduction. Start with a value proportional to the total number of heads; a rule of thumb is G ≈ √H.
Does Grouped Query Attention work during training as well as inference?
GQA reduces memory footprint during training if you use the reduced KV projections, but the main motivation is inference. During training, the savings are smaller because gradients and optimizer states dominate memory. However, GQA models can be trained from scratch or uptrained; uptraining from MHA is common to avoid training a new model entirely.
Can I convert an existing MHA model to Grouped Query Attention?
Yes, through uptraining. First, define the group structure (e.g., 8 groups). Initialize the GQA key and value weights by averaging the original MHA weights within each group. Then continue training the model for a fraction of the original training budget. The GQA paper demonstrates this method preserves most of the MHA quality while gaining the inference benefits.
Is Grouped Query Attention suitable for small models (under 1B parameters)?
It can be, but the memory savings are less impactful because small models already have manageable KV caches. The quality trade‑off might be more noticeable relative to the baseline. Many small models still use MHA, as memory constraints are less critical. However, recent families like Llama 3 apply GQA even at the 8B scale with good results.
How does Grouped Query Attention affect attention score computation?
The attention mechanism is identical per group: each query in a group computes dot‑product attention with the shared key and value for that group. The only difference is the structural sharing of the K and V projections. This means existing attention kernel optimizations (e.g., FlashAttention) can be adapted with minor changes to handle group‑sharing, maintaining performance.
As of 2026, Grouped Query Attention has become a de facto standard in LLM architecture, integrated into frameworks like PyTorch and JAX, and supported by inference libraries vLLM and TensorRT-LLM. Research is extending the idea to adaptive grouping schemes where the number of groups varies per layer for even finer trade‑offs, and to combined approaches that also quantize the KV cache for further compression.