Paper

Attention Residuals

Guangyu Chen

2026.03.24

·Arxiv·by 네루

#Attention#Deep Learning#LLM#Residual Connections#Transformer

Key Points

1Attention Residuals (AttnRes) replace fixed residual accumulation in LLMs with softmax attention over preceding layer outputs, enabling selective, input-dependent aggregation across depth.
2For scalability, Block AttnRes partitions layers into blocks, applying attention over block-level representations to reduce memory and communication overhead, supported by infrastructure optimizations like cross-stage caching and a two-phase computation strategy.
3AttnRes consistently improves performance across model scales, mitigates PreNorm dilution by yielding more uniform hidden-state magnitudes and gradient distributions, and enhances downstream task performance.

h_l = h_{l-1} + f_{l-1}(h_{l-1})

Paper

Guangyu Chen

2026.03.24

·Arxiv·by 네루

#Attention#Deep Learning#LLM#Residual Connections#Transformer

1Attention Residuals (AttnRes) replace fixed residual accumulation in LLMs with softmax attention over preceding layer outputs, enabling selective, input-dependent aggregation across depth.
2For scalability, Block AttnRes partitions layers into blocks, applying attention over block-level representations to reduce memory and communication overhead, supported by infrastructure optimizations like cross-stage caching and a two-phase computation strategy.
3AttnRes consistently improves performance across model scales, mitigates PreNorm dilution by yielding more uniform hidden-state magnitudes and gradient distributions, and enhances downstream task performance.

h_l = h_{l-1} + f_{l-1}(h_{l-1})