Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Key Points
- 1This paper introduces "conditional memory" as a novel sparsity axis for large language models, complementing MoE, to address inefficient knowledge retrieval by separating static knowledge storage from dynamic computation.
- 2It proposes Engram, an O(1) lookup module that leverages hashed N-grams and context-aware gating to integrate this conditional memory into Transformer backbones.
- 3Experiments demonstrate Engram's superior performance over iso-parameter and iso-FLOPs MoE baselines across various benchmarks, identify a U-shaped scaling law for optimal sparsity allocation, and highlight its infrastructure efficiency due to deterministic prefetching.
The paper "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" proposes a novel axis of sparsity, termed "conditional memory," to address the inefficiency of knowledge retrieval in Transformer models. This approach is presented as complementary to conditional computation methods like Mixture-of-Experts (MoE). The authors argue that traditional Transformers lack intrinsic primitives for knowledge retrieval, leading to inefficient computational search for information. To rectify this, they introduce the Engram module, which modernizes classical N-gram embeddings to enable O(1) lookup.
The core methodology revolves around the Engram architecture, which structurally separates static pattern storage from dynamic computation within the Transformer backbone. For an input sequence and hidden states at layer , Engram processes each position through two stages: retrieval and fusion.
1. Sparse Retrieval via Hashed N-grams:
- Tokenizer Compression: To mitigate semantic redundancy common in subword tokenizers, a vocabulary compression layer is implemented. A mapping function converts token IDs into normalized canonical IDs (e.g., via NFKC normalization and lowercasing). This reduces the effective vocabulary size and enhances semantic density. These normalized IDs are then used to form suffix N-grams .
- Multi-Head Hashing: Given the impracticality of parameterizing the entire combinatorial space of N-grams, a hashing-based approach is adopted. To alleviate collisions, distinct hash heads are employed for each N-gram order . Each head maps the compressed context to an index within an embedding table (of size ): . The function is realized as a lightweight multiplicative-XOR hash. The final memory vector is formed by concatenating all retrieved embeddings: .
2. Context-aware Gating (Fusion):
The retrieved embedding provides context-independent prior information but can suffer from noise due to hash collisions or ambiguity. To refine this, a context-aware gating mechanism is used, where the current hidden state serves as a dynamic Query and as the source for Key and Value projections: and , where are learnable projection matrices. A gate is computed as:
This gate modulates the retrieved value vector, defining the gated output as . This design enables the gate to suppress noisy retrieved memory if it conflicts with the current context. To expand the receptive field and enhance non-linearity, a short, depthwise causal convolution with kernel size , dilation (up to maximum N-gram order), and SiLU activation is applied. The final output is:
The Engram module is integrated into the backbone via a residual connection: , selectively placed at specific layers.
Integration with Multi-branch Architecture:
For multi-branch architectures like Manifold-Constrained Hyper-Connections (mHC) with branches, a parameter sharing strategy is employed. A single sparse embedding table and projection matrix are shared across all branches. However, unique Key projection matrices enable branch-specific gating. The gating signal for the -th branch's hidden state is:
These independent gates then modulate the shared value vector : .
System Efficiency and Sparsity Allocation:
Engram's deterministic retrieval mechanism allows for decoupling parameter storage from computational resources. Unlike MoE, Engram's retrieval indices depend solely on the input token sequence, enabling asynchronous prefetching during inference. Engram modules placed at specific layers buffer computation from preceding layers, preventing GPU stalls. During training, embedding tables are sharded across multiple GPUs using model parallelism and All-to-All communication. Leveraging the Zipfian distribution of N-grams, a multi-level cache hierarchy is built, caching frequent embeddings in faster tiers (GPU HBM, Host DRAM) and storing rare patterns in slower but larger media (NVMe SSD).
The paper defines a "Sparsity Allocation problem" concerning how to optimally distribute a fixed total parameter budget between MoE experts and Engram embeddings. Experiments reveal a consistent U-shaped relationship between validation loss and the allocation ratio , representing the proportion of inactive parameters allocated to MoE experts. Optimal performance is achieved when 20-25% of the total sparse parameter budget is reallocated to Engram, demonstrating the structural complementarity of the two modules. In the infinite memory regime, Engram shows predictable scaling, with validation loss consistently improving and following a linear relationship in log-space as memory slots increase, indicating continued benefits from larger memory without additional computation.
Large Scale Pre-training and Experimental Results:
Pre-training four models (Dense-4B, MoE-27B, Engram-27B, Engram-40B) on 262 billion tokens, all with identical activated parameters, confirmed Engram's efficacy. Engram-27B, sharing the exact total parameter count with MoE-27B (by reducing MoE experts from 72 to 55 and reallocating parameters to a 5.7B Engram memory, ), significantly outperformed MoE-27B across a wide range of benchmarks, including knowledge-intensive tasks (MMLU +3.4), general reasoning (BBH +5.0), and code/math domains (HumanEval +3.0). Engram-40B, which only expanded Engram memory to 18.5B parameters while maintaining the same backbone and computational budget, further improved performance, demonstrating Engram's robust scalability.
Mechanism Analysis and Infrastructure Efficiency:
Analyses using LogitLens and CKA suggest that Engram alleviates the burden on early layers of the backbone from reconstructing static knowledge, thereby increasing the effective depth for complex reasoning. By offloading local dependencies to lookup, Engram enables attention mechanisms to focus on global context, leading to superior performance in long-context scenarios (e.g., Multi-Query NIAH: 84.2 97.0). Infrastructure-wise, Engram's deterministic addressing facilitates runtime prefetching from host memory, incurring negligible overhead (<3%) even when offloading 100B parameter tables, effectively bypassing GPU memory constraints and enabling aggressive parameter scaling.
The paper concludes by positing that conditional memory will be an indispensable modeling primitive for future sparse models.