Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
Paper

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xingkai Yu
2026.01.14
·Arxiv·by 이호민
#LLM#Sparsity#Conditional Memory#MoE#N-gram

Key Points

  • 1This paper introduces Engram, a novel conditional memory module that modernizes N-gram embeddings for O(1) knowledge lookup, aiming to provide a native primitive for static information retrieval in large language models.
  • 2By formulating a Sparsity Allocation problem, the authors discover a U-shaped scaling law, demonstrating that Engram acts as an essential complement to Mixture-of-Experts (MoE) by optimizing the trade-off between neural computation and static memory.
  • 3Large-scale experiments show Engram achieves superior performance across diverse benchmarks, including general reasoning and long-context tasks, while its deterministic addressing allows for infrastructure-aware efficiency through runtime prefetching from host memory.

The paper introduces Conditional Memory via Scalable Lookup, a new sparsity axis for Large Language Models (LLMs), instantiated by a module called Engram. The core problem addressed is that traditional Transformers, especially Mixture-of-Experts (MoE) models, primarily scale capacity through conditional computation but lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. This leads to the backbone wasting computational depth on reconstructing static patterns.

Engram proposes conditional memory as a complementary sparsity axis to conditional computation. It modernizes classic N-gram embeddings for O(1)O(1) lookup, aiming to offload static knowledge retrieval from the main computational path.

Core Methodology: Engram Architecture

The Engram module augments the Transformer backbone by structurally separating static pattern storage from dynamic computation. Given an input sequence X=(x1,…,xT)X = (x_1, \dots, x_T) and hidden states H(β„“)∈RTΓ—dH^{(\ell)} \in \mathbb{R}^{T \times d} at layer β„“\ell, it processes each position tt in two phases: retrieval and fusion.

  1. Sparse Retrieval via Hashed N-grams:
    • Tokenizer Compression: To maximize semantic density and reduce effective vocabulary size, a surjective function P:Vβ†’Vβ€²P: V \to V' is pre-computed. This function collapses raw token IDs xtx_t into canonical IDs xtβ€²=P(xt)x'_t = P(x_t) based on normalized textual equivalence (e.g., NFKC, lowercasing). Suffix N-grams gt,n=(xtβˆ’n+1β€²,…,xtβ€²)g_{t,n} = (x'_{t-n+1}, \dots, x'_t) are formed using these compressed IDs.
    • Multi-Head Hashing: To parameterize the combinatorial space of N-grams without direct storage, a hashing-based approach is used. For each N-gram order nn, KK distinct hash heads map the compressed N-gram gt,ng_{t,n} to an index within an embedding table En,kE_{n,k} (of prime size Mn,kM_{n,k}) via a deterministic function Ο†n,k\varphi_{n,k}:
zt,n,kβ‰œΟ†n,k(gt,n)z_{t,n,k} \triangleq \varphi_{n,k}(g_{t,n})
et,n,k=En,k[zt,n,k]e_{t,n,k} = E_{n,k}[z_{t,n,k}]
The function Ο†n,k\varphi_{n,k} is implemented as a lightweight multiplicative-XOR hash. The final memory vector et∈Rdmeme_t \in \mathbb{R}^{d_{\text{mem}}} is formed by concatenating all retrieved embeddings:
etβ‰œβˆ₯n=2Nβˆ₯k=1Ket,n,ke_t \triangleq \|_{n=2}^{N} \|_{k=1}^{K} e_{t,n,k}

  1. Context-aware Gating and Fusion:
    • The retrieved embeddings ete_t are static priors. To enhance expressivity and mitigate noise from hash collisions or polysemy, a context-aware gating mechanism is employed. The current hidden state hth_t (which has aggregated global context) acts as a dynamic Query. The retrieved memory ete_t serves as the source for both Key and Value projections:
kt=WKetk_t = W_K e_t
vt=WVetv_t = W_V e_t
where WK,WVW_K, W_V are learnable projection matrices.
  • A scalar gate Ξ±t∈(0,1)\alpha_t \in (0, 1) is computed, applying RMSNorm to the Query and Key for gradient stability:
Ξ±t=Οƒ(RMSNorm(ht)⊀RMSNorm(kt)d)\alpha_t = \sigma \left( \frac{\text{RMSNorm}(h_t)^\top \text{RMSNorm}(k_t)}{\sqrt{d}} \right)
The gated output is v~t=Ξ±tβ‹…vt\tilde{v}_t = \alpha_t \cdot v_t. This gate semantically aligns the retrieved memory with the current context, suppressing irrelevant or noisy retrievals.
  • To expand the receptive field and enhance non-linearity, a short, depthwise causal convolution is applied. Let V~∈RTΓ—d\tilde{V} \in \mathbb{R}^{T \times d} be the sequence of gated values. The final output YY is:
Y=SiLU(Conv1D(RMSNorm(V~)))+V~Y = \text{SiLU} \left( \text{Conv1D}(\text{RMSNorm}(\tilde{V})) \right) + \tilde{V}
  • The Engram module is integrated into the backbone via a residual connection: H(β„“)←H(β„“)+YH^{(\ell)} \leftarrow H^{(\ell)} + Y. It's not applied to every layer; its placement is strategically chosen considering modeling performance and system latency constraints.
  1. Integration with Multi-branch Architecture:
For multi-branch backbones (e.g., Manifold-Constrained Hyper-Connections, mHC), a parameter-sharing strategy is used. A single sparse embedding table and a Value projection matrix WVW_V are shared across MM branches. However, MM distinct Key projection matrices {WK(m)}m=1M\{W_K^{(m)}\}_{m=1}^M are used for branch-specific gating. For the mm-th branch with hidden state ht(m)h_t^{(m)}, the branch-specific gating signal is:
Ξ±t(m)=Οƒ(RMSNorm(ht(m))⊀RMSNorm(WK(m)et)d)\alpha_t^{(m)} = \sigma \left( \frac{\text{RMSNorm}(h_t^{(m)})^\top \text{RMSNorm}(W_K^{(m)} e_t)}{\sqrt{d}} \right)
The retrieved memory is modulated by these independent gates applied to the shared value vector: ut(m)=Ξ±t(m)β‹…(WVet)u_t^{(m)} = \alpha_t^{(m)} \cdot (W_V e_t). This design allows linear projections to be fused into a single dense FP8 matrix multiplication.

  1. System Efficiency: Decoupling Compute and Memory:
Engram's deterministic retrieval allows for efficient hardware-algorithm co-design.
  • Training: Massive embedding tables are sharded across GPUs using standard model parallelism. All-to-All communication gathers active rows and dispatches gradients.
  • Inference: Engram tables are offloaded to host memory. Since indices are known pre-forward pass, a prefetch-and-overlap strategy is used. The system asynchronously retrieves embeddings from host memory via PCIe, overlapping this communication with on-device computation of preceding Transformer blocks. This placement means Engram is active at specific layers, providing a compute window to mask latency.
  • Multi-Level Cache Hierarchy: N-grams follow a Zipfian distribution. Frequently accessed embeddings are cached in faster tiers (GPU HBM/Host DRAM), while rare patterns reside in slower, high-capacity media (NVMe SSD), enabling scaling to massive capacities with minimal latency impact.

Scaling Laws and Sparsity Allocation

The paper investigates the synergy between MoE (conditional computation) and Engram (conditional memory) by formulating the Sparsity Allocation problem. It aims to find the optimal distribution of sparse capacity given a fixed total parameter budget (PtotP_{\text{tot}}) and activated parameters per token (PactP_{\text{act}}). The inactive parameters (Psparse=Ptotβˆ’PactP_{\text{sparse}} = P_{\text{tot}} - P_{\text{act}}) represent the "free" parameter budget. The allocation ratio ρ∈[0,1]\rho \in [0, 1] is defined as the fraction of PsparseP_{\text{sparse}} assigned to MoE experts:
PMoE(sparse)=ρPsparseP_{\text{MoE}}^{\text{(sparse)}} = \rho P_{\text{sparse}}
PEngram=(1βˆ’Ο)PsparseP_{\text{Engram}} = (1 - \rho) P_{\text{sparse}}
Experiments at different compute budgets (2Γ—10202 \times 10^{20} FLOPs and 6Γ—10206 \times 10^{20} FLOPs) reveal a consistent U-shaped relationship between validation loss and ρ\rho. Pure MoE (ρ=1\rho=1) is suboptimal. Reallocating approximately 20-25% of the sparse parameter budget to Engram yields the best performance (optimal Οβ‰ˆ75%βˆ’80%\rho \approx 75\%-80\%). This U-shape confirms the structural complementarity: MoE-dominated models lack dedicated memory, while Engram-dominated models lose conditional computation capacity.

In the "infinite memory regime" where the memory budget is relaxed, scaling the number of Engram memory slots yields consistent improvements in validation loss following a power law. This demonstrates Engram as a predictable and effective scaling knob, outperforming direct N-gram integration methods like OverEncoding for efficiency.

Large Scale Pre-training

The paper validates Engram's efficacy in real-world pre-training. Four models are trained for 262B tokens with identical data and strictly matched activated parameters (3.8B):

  • Dense-4B: Baseline dense model (4.1B total params).
  • MoE-27B: MoE with 72 routed experts (top-6 activated), 26.7B total params.
  • Engram-27B: Derived from MoE-27B by reducing routed experts (from 72 to 55) and reallocating freed parameters to a 5.7B Engram module, maintaining 26.7B total params (Οβ‰ˆ74.3%\rho \approx 74.3\% for sparse parameters). Engram modules are at layers 2 and 15, max N-gram size 3, 8 heads, 1280 dim.
  • Engram-40B: Same backbone as Engram-27B but scales the Engram module to 18.5B params (39.5B total params).

Results (Table 1) show that sparse models (MoE-27B, Engram-27B/40B) significantly outperform the iso-FLOPs Dense-4B. Crucially, Engram-27B achieves superior performance across diverse domains compared to the strictly iso-parameter and iso-FLOPs MoE-27B baseline. For example, MMLU (+3.4), CMMLU (+4.0), BBH (+5.0), ARC-Challenge (+3.7), HumanEval (+3.0), MATH (+2.4). Engram-40B further improves performance, showcasing the scaling potential of conditional memory.

Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. By delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 β†’\to 97.0). Finally, Engram's deterministic addressing allows runtime prefetching from host memory, incurring negligible overhead (<3%), bypassing GPU memory constraints and facilitating aggressive parameter expansion.