Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Paper

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

2026.01.14
·Web·by 이호민
#LLM#Sparsity#Conditional Memory#MoE#N-gram

Key Points

  • 1This paper introduces "conditional memory" as a novel sparsity axis for large language models, complementing MoE, to address inefficient knowledge retrieval by separating static knowledge storage from dynamic computation.
  • 2It proposes Engram, an O(1) lookup module that leverages hashed N-grams and context-aware gating to integrate this conditional memory into Transformer backbones.
  • 3Experiments demonstrate Engram's superior performance over iso-parameter and iso-FLOPs MoE baselines across various benchmarks, identify a U-shaped scaling law for optimal sparsity allocation, and highlight its infrastructure efficiency due to deterministic prefetching.

The paper "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" proposes a novel axis of sparsity, termed "conditional memory," to address the inefficiency of knowledge retrieval in Transformer models. This approach is presented as complementary to conditional computation methods like Mixture-of-Experts (MoE). The authors argue that traditional Transformers lack intrinsic primitives for knowledge retrieval, leading to inefficient computational search for information. To rectify this, they introduce the Engram module, which modernizes classical N-gram embeddings to enable O(1) lookup.

The core methodology revolves around the Engram architecture, which structurally separates static pattern storage from dynamic computation within the Transformer backbone. For an input sequence X=(x1,…,xT)X = (x_1, \ldots, x_T) and hidden states H(β„“)∈RTΓ—dH^{(\ell)} \in \mathbb{R}^{T \times d} at layer β„“\ell, Engram processes each position tt through two stages: retrieval and fusion.

1. Sparse Retrieval via Hashed N-grams:

  • Tokenizer Compression: To mitigate semantic redundancy common in subword tokenizers, a vocabulary compression layer is implemented. A mapping function P:Vβ†’Vβ€²P: V \to V' converts token IDs xtx_t into normalized canonical IDs xtβ€²=P(xt)x_t' = P(x_t) (e.g., via NFKC normalization and lowercasing). This reduces the effective vocabulary size and enhances semantic density. These normalized IDs are then used to form suffix N-grams gt,n=(xtβˆ’n+1β€²,…,xtβ€²)g_{t,n} = (x'_{t-n+1}, \ldots, x'_t).
  • Multi-Head Hashing: Given the impracticality of parameterizing the entire combinatorial space of N-grams, a hashing-based approach is adopted. To alleviate collisions, KK distinct hash heads are employed for each N-gram order nn. Each head kk maps the compressed context to an index within an embedding table En,kE_{n,k} (of size Mn,kM_{n,k}): zt,n,kβ‰œΟ†n,k(gt,n)z_{t,n,k} \triangleq \varphi_{n,k} (g_{t,n}). The function Ο†n,k\varphi_{n,k} is realized as a lightweight multiplicative-XOR hash. The final memory vector et∈Rdmeme_t \in \mathbb{R}^{d_{mem}} is formed by concatenating all retrieved embeddings: etβ‰œβˆ₯n=2Nβˆ₯k=1Ket,n,ke_t \triangleq \parallel_{n=2}^N \parallel_{k=1}^K e_{t,n,k}.

2. Context-aware Gating (Fusion):
The retrieved embedding ete_t provides context-independent prior information but can suffer from noise due to hash collisions or ambiguity. To refine this, a context-aware gating mechanism is used, where the current hidden state hth_t serves as a dynamic Query and ete_t as the source for Key and Value projections: kt=WKetk_t = W_K e_t and vt=WVetv_t = W_V e_t, where WK,WVW_K, W_V are learnable projection matrices. A gate αt∈(0,1)\alpha_t \in (0, 1) is computed as:
Ξ±t=Οƒ(RMSNorm(ht)⊀RMSNorm(kt)d)\alpha_t = \sigma \left( \frac{\text{RMSNorm}(h_t)^\top \text{RMSNorm}(k_t)}{\sqrt{d}} \right)
This gate modulates the retrieved value vector, defining the gated output as v~t=Ξ±tβ‹…vt\tilde{v}_t = \alpha_t \cdot v_t. This design enables the gate to suppress noisy retrieved memory if it conflicts with the current context. To expand the receptive field and enhance non-linearity, a short, depthwise causal convolution with kernel size w=4w=4, dilation Ξ΄\delta (up to maximum N-gram order), and SiLU activation is applied. The final output YY is:
Y=SiLU(Conv1D(RMSNorm(V~)))+V~Y = \text{SiLU} (\text{Conv1D}(\text{RMSNorm}(\tilde{V}))) + \tilde{V}
The Engram module is integrated into the backbone via a residual connection: H(β„“)←H(β„“)+YH^{(\ell)} \leftarrow H^{(\ell)} + Y, selectively placed at specific layers.

Integration with Multi-branch Architecture:
For multi-branch architectures like Manifold-Constrained Hyper-Connections (mHC) with MM branches, a parameter sharing strategy is employed. A single sparse embedding table and WVW_V projection matrix are shared across all MM branches. However, MM unique Key projection matrices {WK(m)}m=1M\{W_K^{(m)}\}_{m=1}^M enable branch-specific gating. The gating signal for the mm-th branch's hidden state ht(m)h_t^{(m)} is:
Ξ±t(m)=Οƒ(RMSNorm(ht(m))⊀RMSNorm(WK(m)et)d)\alpha_t^{(m)} = \sigma \left( \frac{\text{RMSNorm}(h_t^{(m)})^\top \text{RMSNorm}(W_K^{(m)} e_t)}{\sqrt{d}} \right)
These independent gates then modulate the shared value vector WVetW_V e_t: ut(m)=Ξ±t(m)β‹…(WVet)u_t^{(m)} = \alpha_t^{(m)} \cdot (W_V e_t).

System Efficiency and Sparsity Allocation:
Engram's deterministic retrieval mechanism allows for decoupling parameter storage from computational resources. Unlike MoE, Engram's retrieval indices depend solely on the input token sequence, enabling asynchronous prefetching during inference. Engram modules placed at specific layers buffer computation from preceding layers, preventing GPU stalls. During training, embedding tables are sharded across multiple GPUs using model parallelism and All-to-All communication. Leveraging the Zipfian distribution of N-grams, a multi-level cache hierarchy is built, caching frequent embeddings in faster tiers (GPU HBM, Host DRAM) and storing rare patterns in slower but larger media (NVMe SSD).

The paper defines a "Sparsity Allocation problem" concerning how to optimally distribute a fixed total parameter budget between MoE experts and Engram embeddings. Experiments reveal a consistent U-shaped relationship between validation loss and the allocation ratio ρ\rho, representing the proportion of inactive parameters allocated to MoE experts. Optimal performance is achieved when 20-25% of the total sparse parameter budget is reallocated to Engram, demonstrating the structural complementarity of the two modules. In the infinite memory regime, Engram shows predictable scaling, with validation loss consistently improving and following a linear relationship in log-space as memory slots increase, indicating continued benefits from larger memory without additional computation.

Large Scale Pre-training and Experimental Results:
Pre-training four models (Dense-4B, MoE-27B, Engram-27B, Engram-40B) on 262 billion tokens, all with identical activated parameters, confirmed Engram's efficacy. Engram-27B, sharing the exact total parameter count with MoE-27B (by reducing MoE experts from 72 to 55 and reallocating parameters to a 5.7B Engram memory, ρ=74.3%\rho=74.3\%), significantly outperformed MoE-27B across a wide range of benchmarks, including knowledge-intensive tasks (MMLU +3.4), general reasoning (BBH +5.0), and code/math domains (HumanEval +3.0). Engram-40B, which only expanded Engram memory to 18.5B parameters while maintaining the same backbone and computational budget, further improved performance, demonstrating Engram's robust scalability.

Mechanism Analysis and Infrastructure Efficiency:
Analyses using LogitLens and CKA suggest that Engram alleviates the burden on early layers of the backbone from reconstructing static knowledge, thereby increasing the effective depth for complex reasoning. By offloading local dependencies to lookup, Engram enables attention mechanisms to focus on global context, leading to superior performance in long-context scenarios (e.g., Multi-Query NIAH: 84.2 β†’\to 97.0). Infrastructure-wise, Engram's deterministic addressing facilitates runtime prefetching from host memory, incurring negligible overhead (<3%) even when offloading 100B parameter tables, effectively bypassing GPU memory constraints and enabling aggressive parameter scaling.

The paper concludes by positing that conditional memory will be an indispensable modeling primitive for future sparse models.