Paper

Attention Residuals

Guangyu Chen

2026.03.24

·Arxiv·by 네루

#Attention#Deep Learning#LLM#Residual Connections#Transformer

핵심 포인트

1Attention Residuals(AttnRes)는 기존 LLM의 Residual connection이 가진 fixed unit weight accumulation 문제를 해결하기 위해, 이전 Layer output에 대한 softmax attention을 통해 학습된, input에 의존적인 가중치로 정보를 선택적으로 통합하는 방식을 제안합니다.
2대규모 모델 훈련 시 발생하는 메모리 및 통신 오버헤드를 줄이기 위해, Block AttnRes는 Layer를 Block으로 분할하고 Block-level 표현에만 Attention을 적용하여 O(Ld)를 O(Nd)로 감소시킵니다.
3AttnRes는 PreNorm Dilution 현상을 완화하고 Gradient 분포를 균일하게 하여, Scaling law 실험과 48B 모델 Pre-training에서 Baseline 대비 일관된 성능 향상을 보여줍니다.

\left(h_l = \sum_{i} v_i\right)

Paper

Guangyu Chen

2026.03.24

·Arxiv·by 네루

#Attention#Deep Learning#LLM#Residual Connections#Transformer

1Attention Residuals(AttnRes)는 기존 LLM의 Residual connection이 가진 fixed unit weight accumulation 문제를 해결하기 위해, 이전 Layer output에 대한 softmax attention을 통해 학습된, input에 의존적인 가중치로 정보를 선택적으로 통합하는 방식을 제안합니다.
2대규모 모델 훈련 시 발생하는 메모리 및 통신 오버헤드를 줄이기 위해, Block AttnRes는 Layer를 Block으로 분할하고 Block-level 표현에만 Attention을 적용하여 O(Ld)를 O(Nd)로 감소시킵니다.
3AttnRes는 PreNorm Dilution 현상을 완화하고 Gradient 분포를 균일하게 하여, Scaling law 실험과 48B 모델 Pre-training에서 Baseline 대비 일관된 성능 향상을 보여줍니다.

\left(h_l = \sum_{i} v_i\right)