Deep Delta Learning
Paper

Deep Delta Learning

Mengdi Wang
2026.01.10
·Web·by 네루
#Deep Learning#Neural Networks#Residual Networks#Delta Learning#Computer Vision

Key Points

  • 1Deep Delta Learning (DDL) generalizes the standard residual connection by introducing a learnable, data-dependent geometric transformation called the Delta Operator.
  • 2This Delta Operator, a rank-1 perturbation of the identity matrix, uses a gate $\beta$ to dynamically interpolate between identity mapping, orthogonal projection, and geometric reflection.
  • 3The architecture synchronously couples information erasure and writing, enabling explicit control over layer-wise feature transformations and applying the classic Delta Rule over network depth.

Deep Delta Learning (DDL) introduces a novel architecture that generalizes the standard identity shortcut connection in deep residual networks, addressing its limitation of imposing a strictly additive inductive bias that restricts the modeling of complex state transitions. While traditional ResNets approximate an ODE X˙=F(X)\dot{\Xb} = \Fb(\Xb) via an additive update Xl+1=Xl+F(Xl)\Xb_{l+1} = \Xb_l + \Fb(\Xb_l), DDL proposes a more flexible layer-wise transformation.

The core of DDL is the Deep Delta Residual Block, which updates the hidden state matrix XRd×dv\Xb \in \RR^{d \times d_v} using the rule:
Xl+1=(Iβlklkl)Delta Operator A(X)Xl+βlklvl\Xb_{l+1} = \underbrace{(\Ib - \beta_l \kb_l \kb_l^\top)}_{\text{Delta Operator } \Ab(\Xb)} \Xb_l + \beta_l \kb_l \vb_l^\top
Here, Xl\Xb_l is the hidden state at layer ll, I\Ib is the identity matrix, and the parameters βlR\beta_l \in \RR, klRd\kb_l \in \RR^d, and vlRdv\vb_l \in \RR^{d_v} are learnable and data-dependent. Specifically, βl\beta_l is a scalar gate, kl\kb_l is a reflection direction vector, and vl\vb_l is a value vector. The first term, (Iβlklkl)Xl(\Ib - \beta_l \kb_l \kb_l^\top) \Xb_l, represents a rank-1 perturbation of the identity matrix applied to the current state, termed the Delta Operator A(X)\Ab(\Xb). This operator modulates the identity shortcut with a learnable, data-dependent geometric transformation. The second term, βlklvl\beta_l \kb_l \vb_l^\top, is an injection term. This formulation innovatively couples the "erasure" of old information (via projection onto kl\kb_l) with the "writing" of new features (via injection of vl\vb_l), both scaled synchronously by the gate βl\beta_l.

The expressive power of DDL is rooted in the spectral properties of the Delta Operator A(X)\Ab(\Xb). Its eigenvalues are {1,,1,1β}\{1, \dots, 1, 1-\beta\}, meaning the scalar gate β\beta deterministically controls the operator's spectrum and thus its geometric interpretation. This allows the network to dynamically interpolate between three fundamental linear transformations:

  • Identity Mapping (β0\beta \to 0): When β\beta approaches 00, the eigenvalues are {1}\{1\}. This corresponds to a skip connection, preserving the signal for deep propagation, such that Xl+1Xl\Xb_{l+1} \approx \Xb_l.
  • Orthogonal Projection (β1\beta \to 1): When β\beta approaches 11, the eigenvalues become {0,1}\{0, 1\}. This operation represents an orthogonal projection onto the hyperplane k\kb^\perp, effectively "forgetting" or erasing components of the state parallel to k\kb. In this regime, det(A)0\det(\Ab) \to 0.
  • Householder Reflection (β2\beta \to 2): When β\beta approaches 22, the eigenvalues become {1,1}\{-1, 1\}. This corresponds to a Householder reflection, which inverts the state along the direction k\kb. This introduces negative eigenvalues, enabling the modeling of oscillatory or oppositional dynamics within the network's state transitions, with det(A)1\det(\Ab) \to -1.

Furthermore, DDL establishes a theoretical link to efficient sequence models through its "Depth-Wise Delta Rule." Expanding the DDL update rule reveals a classic Delta Rule structure:
Xl+1=Xl+βlkl(vlklXl)\Xb_{l+1} = \Xb_l + \beta_l \kb_l (\vb_l^\top - \kb_l^\top \Xb_l)
This formulation highlights how the network explicitly controls the spectrum of its layer-wise transition operator. It acts like a "target-minus-current-projection" update, where vl\vb_l^\top is the target value and klXl\kb_l^\top \Xb_l is the current projection of the state onto the direction kl\kb_l. Unlike architectures like DeltaNet which apply this "Delta Rule" over the time dimension, Deep Delta Learning applies it over the depth dimension, allowing the network to selectively "clean" or "rewrite" specific feature subspaces layer-by-layer. This mechanism prevents the accumulation of interference, a common issue in standard additive ResNets, while preserving the stable training characteristics of gated residual architectures.