
Attention Residuals
Key Points
- 1Attention Residuals (AttnRes) replace fixed residual accumulation in LLMs with softmax attention over preceding layer outputs, enabling selective, input-dependent aggregation across depth.
- 2For scalability, Block AttnRes partitions layers into blocks, applying attention over block-level representations to reduce memory and communication overhead, supported by infrastructure optimizations like cross-stage caching and a two-phase computation strategy.
- 3AttnRes consistently improves performance across model scales, mitigates PreNorm dilution by yielding more uniform hidden-state magnitudes and gradient distributions, and enhances downstream task performance.
The paper introduces Attention Residuals (AttnRes), a novel mechanism to address limitations of standard residual connections in deep learning models, particularly large language models (LLMs). Current PreNorm residual connections, formulated as , accumulate all layer outputs with fixed unit weights. This uniform aggregation leads to uncontrolled hidden-state growth (magnitudes growing as with depth ) and progressive dilution of individual layer contributions, effectively burying early-layer information.
AttnRes draws a formal duality between depth-wise accumulation and sequential recurrence in Recurrent Neural Networks (RNNs), proposing an attention-based solution analogous to how Transformers improved upon RNNs for sequence modeling. Instead of fixed additive accumulation, AttnRes replaces it with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights.
The core formulation for AttnRes is given by:
where are layer-specific attention weights. These weights are computed using a kernel function to yield softmax attention over depth:
For each layer , the query is a layer-specific learnable pseudo-query vector . The key/value vectors are defined as:
where is the token embedding and is the output of layer . RMSNorm is applied to keys () to prevent large-magnitude outputs from dominating attention weights. This formulation, termed Full Attention Residuals, enables selective, content-aware retrieval across depth.
While Full AttnRes introduces negligible overhead in standard training (as layer outputs are already retained for backpropagation), its memory and communication footprint becomes a bottleneck at scale with activation recomputation and pipeline parallelism. To address this, the paper proposes Block Attention Residuals (Block AttnRes). This variant partitions the layers into blocks, where layers form a block. Within each block, layer outputs are reduced to a single representation by summation:
Cross-block attention is then applied only over these block-level summaries and the token embedding. Specifically, for the -th layer in block , the value matrix for attention becomes:
where and is the partial sum of previous layer outputs within the current block . This reduces memory and communication overhead from to . The number of blocks interpolates between standard residuals () and Full AttnRes (), with found to be empirically effective.
The paper also details infrastructure optimizations for Block AttnRes to ensure practical deployment at scale:
- Training Efficiency: For pipeline parallelism, a cross-stage caching mechanism is introduced. Instead of naΓ―vely re-transmitting all accumulated block representations at every transition (costing communication for chunks), blocks are cached locally. This reduces communication to for physical and virtual stages, significantly improving efficiency.
- Inference Efficiency: A two-phase computation strategy is employed.
- Phase 1 (Parallel Inter-Block Attention): For all layers within a block, their pseudo-queries are batched into a single matrix multiplication against the cached inter-block representations. This amortizes memory access from reads to one per block.
- Phase 2 (Sequential Intra-Block Attention + Online Softmax Merge): Intra-block attention is computed sequentially using the evolving partial sum, and the results are then merged with Phase 1 outputs via online softmax.
The paper argues that standard residual connections and prior recurrence-based variants can be seen as performing depth-wise linear attention. AttnRes generalizes this to depth-wise softmax attention, completing for the depth dimension the same linear-to-softmax transition that proved transformative over the sequence dimension. Initializing pseudo-query vectors to zero ensures that AttnRes starts as an equal-weight average, preventing training volatility.