
mHC: Manifold-Constrained Hyper-Connections
Key Points
- 1Hyper-Connections (HC) enhance model performance by expanding the residual stream, but their unconstrained nature compromises identity mapping, leading to training instability and significant memory access overhead.
- 2Manifold-Constrained Hyper-Connections (mHC) address this by projecting HC's residual mapping onto a manifold of doubly stochastic matrices, restoring the identity mapping property and ensuring stable signal propagation through norm preservation.
- 3Furthermore, mHC incorporates rigorous infrastructure optimizations like kernel fusion and selective recomputing, enabling efficient and stable large-scale training with tangible performance improvements and superior scalability.
Manifold-Constrained Hyper-Connections (mHC) addresses the numerical instability and high memory access overhead encountered in Hyper-Connections (HC) by restoring the identity mapping property inherent to residual connections.
The standard residual connection is formulated as , which offers an identity mapping over multiple layers, ensuring stable signal propagation. HC extends this paradigm by expanding the residual stream width from to and introducing learnable mappings: . While this enhances topological complexity and performance, HC's unconstrained nature compromises the identity mapping property. Recursively, the signal propagation from layer to is governed by a composite mapping . Since is unconstrained, this composite mapping can lead to unbounded signal amplification or attenuation, causing training instability and exploding/vanishing gradients, particularly in large-scale models. HC also introduces significant memory access overhead proportional to , due to the widened residual stream and storage of intermediate activations.
mHC proposes to project the residual connection space of HC onto a specific manifold. The core methodology involves constraining the residual mapping to be a doubly stochastic matrix. Formally, where . This manifold constraint ensures:
- Norm Preservation: The spectral norm of a doubly stochastic matrix is bounded by 1 (), mitigating gradient explosion.
- Compositional Closure: The product of doubly stochastic matrices is also doubly stochastic, ensuring that the composite mapping remains stable across arbitrary depths.
- Feature Fusion: Doubly stochastic matrices act as convex combinations of permutations, promoting robust information mixing across residual streams.
For parameterization and manifold projection, given the flattened input , intermediate dynamic mappings are first computed:
$\begin{cases}
\tilde{H}^{\text{pre}}_l = \alpha^{\text{pre}}_l \cdot (\mathbf{x}_l' \phi^{\text{pre}}_l) + b^{\text{pre}}_l \\
\tilde{H}^{\text{post}}_l = \alpha^{\text{post}}_l \cdot (\mathbf{x}_l' \phi^{\text{post}}_l) + b^{\text{post}}_l \\
\tilde{H}^{\text{res}}_l = \alpha^{\text{res}}_l \cdot \text{mat}(\mathbf{x}_l' \phi^{\text{res}}_l) + b^{\text{res}}_l
\end{cases}$
where are linear projections and are learnable biases. The final constrained mappings are then obtained via:
$\begin{cases}
H^{\text{pre}}_l = \sigma(\tilde{H}^{\text{pre}}_l) \\
H^{\text{post}}_l = 2\sigma(\tilde{H}^{\text{post}}_l) \\
H^{\text{res}}_l = \text{Sinkhorn-Knopp}(\tilde{H}^{\text{res}}_l)
\end{cases}$
The Sinkhorn-Knopp operator first applies an element-wise exponentiation , followed by an iterative normalization process where and denote row and column normalization, respectively, ensuring the resulting matrix is doubly stochastic.
To ensure efficiency, mHC incorporates rigorous infrastructure optimizations:
- Kernel Fusion: RMSNorm is reordered to follow matrix multiplication for efficiency. Mixed-precision strategies are employed. Multiple operations with shared memory access are fused into unified compute kernels to reduce memory bandwidth bottlenecks. For instance, two scans on are fused into one kernel, and the backward pass is similarly consolidated. Lightweight operations on small coefficients are fused to reduce kernel launch overhead. The Sinkhorn-Knopp iteration is implemented within a single kernel with a custom backward kernel that recomputes intermediate results on-chip.
- Recomputing: Intermediate activations of mHC kernels are discarded after the forward pass and recomputed on-the-fly during the backward pass. This significantly reduces GPU memory footprint by avoiding storage of heavy layer function activations, requiring storage only of the input for a block of consecutive layers.
- Overlapping Communication: Communication is carefully overlapped within the DualPipe schedule to mitigate the -fold increased communication cost in pipeline parallelism.
Empirical experiments demonstrate that mHC offers exceptional stability and scalability, preserving HC's performance advantages with only a 6.7% additional time overhead for an expansion rate .