mHC: Manifold-Constrained Hyper-Connections
Paper

mHC: Manifold-Constrained Hyper-Connections

Yixuan Wei
2026.01.10
ยทArxivยทby ๋„ค๋ฃจ
#LLM#Deep Learning#Neural Network Architecture#Residual Connection#Foundational Models

Key Points

  • 1Hyper-Connections (HC) enhance model performance by expanding the residual stream, but their unconstrained nature compromises identity mapping, leading to training instability and significant memory access overhead.
  • 2Manifold-Constrained Hyper-Connections (mHC) address this by projecting HC's residual mapping onto a manifold of doubly stochastic matrices, restoring the identity mapping property and ensuring stable signal propagation through norm preservation.
  • 3Furthermore, mHC incorporates rigorous infrastructure optimizations like kernel fusion and selective recomputing, enabling efficient and stable large-scale training with tangible performance improvements and superior scalability.

Manifold-Constrained Hyper-Connections (mHC) addresses the numerical instability and high memory access overhead encountered in Hyper-Connections (HC) by restoring the identity mapping property inherent to residual connections.

The standard residual connection is formulated as xl+1=xl+F(xl,Wl)\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l, \mathbf{W}_l), which offers an identity mapping xL=xl+โˆ‘i=lLโˆ’1F(xi,Wi)\mathbf{x}_L = \mathbf{x}_l + \sum_{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, \mathbf{W}_i) over multiple layers, ensuring stable signal propagation. HC extends this paradigm by expanding the residual stream width from CC to nร—Cn \times C and introducing learnable mappings: xl+1=Hlresxl+(Hlpost)โŠคF(Hlprexl,Wl)\mathbf{x}_{l+1} = \mathbf{H}^{\text{res}}_l \mathbf{x}_l + (\mathbf{H}^{\text{post}}_l)^\top \mathcal{F}(\mathbf{H}^{\text{pre}}_l \mathbf{x}_l, \mathbf{W}_l). While this enhances topological complexity and performance, HC's unconstrained nature compromises the identity mapping property. Recursively, the signal propagation from layer ll to LL is governed by a composite mapping โˆi=lLโˆ’1HLโˆ’ires\prod_{i=l}^{L-1} \mathbf{H}^{\text{res}}_{L-i}. Since Hlres\mathbf{H}^{\text{res}}_l is unconstrained, this composite mapping can lead to unbounded signal amplification or attenuation, causing training instability and exploding/vanishing gradients, particularly in large-scale models. HC also introduces significant memory access overhead proportional to nn, due to the widened residual stream and storage of intermediate activations.

mHC proposes to project the residual connection space of HC onto a specific manifold. The core methodology involves constraining the residual mapping Hlres\mathbf{H}^{\text{res}}_l to be a doubly stochastic matrix. Formally, HlresโˆˆPMres(Hlres)\mathbf{H}^{\text{res}}_l \in \mathcal{P}_{\text{Mres}}(\mathbf{H}^{\text{res}}_l) where PMres(Hlres)โ‰”{HlresโˆˆRnร—nโˆฃHlres1n=1n,1nโŠคHlres=1nโŠค,Hlresโ‰ฅ0}\mathcal{P}_{\text{Mres}}(\mathbf{H}^{\text{res}}_l) \coloneqq \{\mathbf{H}^{\text{res}}_l \in \mathbb{R}^{n \times n} \mid \mathbf{H}^{\text{res}}_l \mathbf{1}_n = \mathbf{1}_n, \mathbf{1}_n^\top \mathbf{H}^{\text{res}}_l = \mathbf{1}_n^\top, \mathbf{H}^{\text{res}}_l \ge 0\}. This manifold constraint ensures:

  1. Norm Preservation: The spectral norm of a doubly stochastic matrix is bounded by 1 (โˆฃโˆฃHlresโˆฃโˆฃ2โ‰ค1||\mathbf{H}^{\text{res}}_l||_2 \le 1), mitigating gradient explosion.
  2. Compositional Closure: The product of doubly stochastic matrices is also doubly stochastic, ensuring that the composite mapping โˆi=lLโˆ’1HLโˆ’ires\prod_{i=l}^{L-1} \mathbf{H}^{\text{res}}_{L-i} remains stable across arbitrary depths.
  3. Feature Fusion: Doubly stochastic matrices act as convex combinations of permutations, promoting robust information mixing across residual streams.
Additionally, mHC imposes non-negativity constraints on the input mappings Hlpre\mathbf{H}^{\text{pre}}_l and output mappings Hlpost\mathbf{H}^{\text{post}}_l to prevent signal cancellation.

For parameterization and manifold projection, given the flattened input x~l=RMSNorm(xl)\tilde{\mathbf{x}}_l = \text{RMSNorm}(\mathbf{x}_l), intermediate dynamic mappings are first computed:
$\begin{cases}
\tilde{H}^{\text{pre}}_l = \alpha^{\text{pre}}_l \cdot (\mathbf{x}_l' \phi^{\text{pre}}_l) + b^{\text{pre}}_l \\
\tilde{H}^{\text{post}}_l = \alpha^{\text{post}}_l \cdot (\mathbf{x}_l' \phi^{\text{post}}_l) + b^{\text{post}}_l \\
\tilde{H}^{\text{res}}_l = \alpha^{\text{res}}_l \cdot \text{mat}(\mathbf{x}_l' \phi^{\text{res}}_l) + b^{\text{res}}_l
\end{cases}$
where ฯ•\phi are linear projections and bb are learnable biases. The final constrained mappings are then obtained via:
$\begin{cases}
H^{\text{pre}}_l = \sigma(\tilde{H}^{\text{pre}}_l) \\
H^{\text{post}}_l = 2\sigma(\tilde{H}^{\text{post}}_l) \\
H^{\text{res}}_l = \text{Sinkhorn-Knopp}(\tilde{H}^{\text{res}}_l)
\end{cases}$
The Sinkhorn-Knopp operator first applies an element-wise exponentiation M(0)=expโก(H~lres)\mathbf{M}^{(0)} = \exp(\tilde{H}^{\text{res}}_l), followed by an iterative normalization process M(t)=Tr(Tc(M(tโˆ’1)))\mathbf{M}^{(t)} = \mathbf{T}_r(\mathbf{T}_c(\mathbf{M}^{(t-1)})) where Tr\mathbf{T}_r and Tc\mathbf{T}_c denote row and column normalization, respectively, ensuring the resulting matrix is doubly stochastic.

To ensure efficiency, mHC incorporates rigorous infrastructure optimizations:

  1. Kernel Fusion: RMSNorm is reordered to follow matrix multiplication for efficiency. Mixed-precision strategies are employed. Multiple operations with shared memory access are fused into unified compute kernels to reduce memory bandwidth bottlenecks. For instance, two scans on xl\mathbf{x}_l are fused into one kernel, and the backward pass is similarly consolidated. Lightweight operations on small coefficients are fused to reduce kernel launch overhead. The Sinkhorn-Knopp iteration is implemented within a single kernel with a custom backward kernel that recomputes intermediate results on-chip.
  2. Recomputing: Intermediate activations of mHC kernels are discarded after the forward pass and recomputed on-the-fly during the backward pass. This significantly reduces GPU memory footprint by avoiding storage of heavy layer function FF activations, requiring storage only of the input xl0\mathbf{x}_{l_0} for a block of LrL_r consecutive layers.
  3. Overlapping Communication: Communication is carefully overlapped within the DualPipe schedule to mitigate the nn-fold increased communication cost in pipeline parallelism.

Empirical experiments demonstrate that mHC offers exceptional stability and scalability, preserving HC's performance advantages with only a 6.7% additional time overhead for an expansion rate n=4n=4.