Attention Residuals
Paper

Attention Residuals

Guangyu Chen
2026.03.24
Β·ArxivΒ·by 넀루
#Attention#Deep Learning#LLM#Residual Connections#Transformer

Key Points

  • 1Attention Residuals (AttnRes) replace fixed residual accumulation in LLMs with softmax attention over preceding layer outputs, enabling selective, input-dependent aggregation across depth.
  • 2For scalability, Block AttnRes partitions layers into blocks, applying attention over block-level representations to reduce memory and communication overhead, supported by infrastructure optimizations like cross-stage caching and a two-phase computation strategy.
  • 3AttnRes consistently improves performance across model scales, mitigates PreNorm dilution by yielding more uniform hidden-state magnitudes and gradient distributions, and enhances downstream task performance.

The paper introduces Attention Residuals (AttnRes), a novel mechanism to address limitations of standard residual connections in deep learning models, particularly large language models (LLMs). Current PreNorm residual connections, formulated as hl=hlβˆ’1+flβˆ’1(hlβˆ’1)h_l = h_{l-1} + f_{l-1}(h_{l-1}), accumulate all layer outputs with fixed unit weights. This uniform aggregation leads to uncontrolled hidden-state growth (magnitudes growing as O(L)O(L) with depth LL) and progressive dilution of individual layer contributions, effectively burying early-layer information.

AttnRes draws a formal duality between depth-wise accumulation and sequential recurrence in Recurrent Neural Networks (RNNs), proposing an attention-based solution analogous to how Transformers improved upon RNNs for sequence modeling. Instead of fixed additive accumulation, AttnRes replaces it with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights.

The core formulation for AttnRes is given by:
hl=βˆ‘i=0lβˆ’1Ξ±iβ†’lβ‹…vih_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot v_i
where Ξ±iβ†’l\alpha_{i \to l} are layer-specific attention weights. These weights are computed using a kernel function Ο•(q,k)=exp⁑(q⊀RMSNorm(k))\phi(q, k) = \exp \left( q^\top \text{RMSNorm}(k) \right) to yield softmax attention over depth:
Ξ±iβ†’l=Ο•(ql,ki)βˆ‘j=0lβˆ’1Ο•(ql,kj)\alpha_{i \to l} = \frac{\phi (q_l, k_i)}{\sum_{j=0}^{l-1} \phi (q_l, k_j)}
For each layer ll, the query qlq_l is a layer-specific learnable pseudo-query vector wl∈Rdw_l \in \mathbb{R}^d. The key/value vectors viv_i are defined as:
vi={h1i=0fi(hi)1≀i≀lβˆ’1v_i = \begin{cases} h_1 & i = 0 \\ f_i(h_i) & 1 \le i \le l-1 \end{cases}
where h1h_1 is the token embedding and fi(hi)f_i(h_i) is the output of layer ii. RMSNorm is applied to keys (kik_i) to prevent large-magnitude outputs from dominating attention weights. This formulation, termed Full Attention Residuals, enables selective, content-aware retrieval across depth.

While Full AttnRes introduces negligible overhead in standard training (as layer outputs are already retained for backpropagation), its O(Ld)O(Ld) memory and communication footprint becomes a bottleneck at scale with activation recomputation and pipeline parallelism. To address this, the paper proposes Block Attention Residuals (Block AttnRes). This variant partitions the LL layers into NN blocks, where S=L/NS=L/N layers form a block. Within each block, layer outputs are reduced to a single representation by summation:
bn=βˆ‘j∈Bnfj(hj)b_n = \sum_{j \in B_n} f_j(h_j)
Cross-block attention is then applied only over these NN block-level summaries and the token embedding. Specifically, for the ii-th layer in block nn, the value matrix VV for attention becomes:
V={[b0,b1,…,bnβˆ’1]⊀ifΒ i=1Β (firstΒ layerΒ ofΒ blockΒ n)[b0,b1,…,bnβˆ’1,biβˆ’1n]⊀ifΒ iβ‰₯2Β (subsequentΒ layers)V = \begin{cases} [b_0, b_1, \dots, b_{n-1}]^\top & \text{if } i=1 \text{ (first layer of block } n) \\ [b_0, b_1, \dots, b_{n-1}, b_{i-1}^n]^\top & \text{if } i \ge 2 \text{ (subsequent layers)} \end{cases}
where b0=h1b_0=h_1 and biβˆ’1nb_{i-1}^n is the partial sum of previous layer outputs within the current block nn. This reduces memory and communication overhead from O(Ld)O(Ld) to O(Nd)O(Nd). The number of blocks NN interpolates between standard residuals (N=1N=1) and Full AttnRes (N=LN=L), with Nβ‰ˆ8N \approx 8 found to be empirically effective.

The paper also details infrastructure optimizations for Block AttnRes to ensure practical deployment at scale:

  1. Training Efficiency: For pipeline parallelism, a cross-stage caching mechanism is introduced. Instead of naΓ―vely re-transmitting all accumulated block representations at every transition (costing O(C(Cβˆ’1)Npd)O(C(C-1) N_p d) communication for CC chunks), blocks are cached locally. This reduces communication to O(P2Npd+(Vβˆ’1)P2Npd)O(P^2 N_p d + (V-1) P^2 N_p d) for PP physical and VV virtual stages, significantly improving efficiency.
  2. Inference Efficiency: A two-phase computation strategy is employed.
    • Phase 1 (Parallel Inter-Block Attention): For all SS layers within a block, their pseudo-queries wlw_l are batched into a single matrix multiplication against the cached inter-block representations. This amortizes memory access from SS reads to one per block.
    • Phase 2 (Sequential Intra-Block Attention + Online Softmax Merge): Intra-block attention is computed sequentially using the evolving partial sum, and the results are then merged with Phase 1 outputs via online softmax.
This strategy ensures that the per-layer memory access cost remains low, making the inference latency overhead less than 2%. For long-context prefilling, block representations are sharded along the sequence dimension across tensor-parallel devices, reducing per-device memory footprint.

The paper argues that standard residual connections and prior recurrence-based variants can be seen as performing depth-wise linear attention. AttnRes generalizes this to depth-wise softmax attention, completing for the depth dimension the same linear-to-softmax transition that proved transformative over the sequence dimension. Initializing pseudo-query vectors wlw_l to zero ensures that AttnRes starts as an equal-weight average, preventing training volatility.