Deep Delta Learning
Paper

Deep Delta Learning

Mengdi Wang
2026.01.10
Β·WebΒ·by 넀루
#Deep Learning#Neural Networks#Residual Networks#Delta Learning#Computer Vision

Key Points

  • 1Deep Delta Learning (DDL) generalizes the standard residual connection by introducing a learnable, data-dependent geometric transformation called the Delta Operator.
  • 2This Delta Operator, a rank-1 perturbation of the identity matrix, uses a gate $\beta$ to dynamically interpolate between identity mapping, orthogonal projection, and geometric reflection.
  • 3The architecture synchronously couples information erasure and writing, enabling explicit control over layer-wise feature transformations and applying the classic Delta Rule over network depth.

Deep Delta Learning (DDL) introduces a novel architecture that generalizes the standard identity shortcut connection in deep residual networks, addressing its limitation of imposing a strictly additive inductive bias that restricts the modeling of complex state transitions. While traditional ResNets approximate an ODE XΛ™=F(X)\dot{\Xb} = \Fb(\Xb) via an additive update Xl+1=Xl+F(Xl)\Xb_{l+1} = \Xb_l + \Fb(\Xb_l), DDL proposes a more flexible layer-wise transformation.

The core of DDL is the Deep Delta Residual Block, which updates the hidden state matrix X∈RdΓ—dv\Xb \in \RR^{d \times d_v} using the rule:
Xl+1=(Iβˆ’Ξ²lklkl⊀)⏟DeltaΒ OperatorΒ A(X)Xl+Ξ²lklvl⊀\Xb_{l+1} = \underbrace{(\Ib - \beta_l \kb_l \kb_l^\top)}_{\text{Delta Operator } \Ab(\Xb)} \Xb_l + \beta_l \kb_l \vb_l^\top
Here, Xl\Xb_l is the hidden state at layer ll, I\Ib is the identity matrix, and the parameters Ξ²l∈R\beta_l \in \RR, kl∈Rd\kb_l \in \RR^d, and vl∈Rdv\vb_l \in \RR^{d_v} are learnable and data-dependent. Specifically, Ξ²l\beta_l is a scalar gate, kl\kb_l is a reflection direction vector, and vl\vb_l is a value vector. The first term, (Iβˆ’Ξ²lklkl⊀)Xl(\Ib - \beta_l \kb_l \kb_l^\top) \Xb_l, represents a rank-1 perturbation of the identity matrix applied to the current state, termed the Delta Operator A(X)\Ab(\Xb). This operator modulates the identity shortcut with a learnable, data-dependent geometric transformation. The second term, Ξ²lklvl⊀\beta_l \kb_l \vb_l^\top, is an injection term. This formulation innovatively couples the "erasure" of old information (via projection onto kl\kb_l) with the "writing" of new features (via injection of vl\vb_l), both scaled synchronously by the gate Ξ²l\beta_l.

The expressive power of DDL is rooted in the spectral properties of the Delta Operator A(X)\Ab(\Xb). Its eigenvalues are {1,…,1,1βˆ’Ξ²}\{1, \dots, 1, 1-\beta\}, meaning the scalar gate Ξ²\beta deterministically controls the operator's spectrum and thus its geometric interpretation. This allows the network to dynamically interpolate between three fundamental linear transformations:

  • Identity Mapping (Ξ²β†’0\beta \to 0): When Ξ²\beta approaches 00, the eigenvalues are {1}\{1\}. This corresponds to a skip connection, preserving the signal for deep propagation, such that Xl+1β‰ˆXl\Xb_{l+1} \approx \Xb_l.
  • Orthogonal Projection (Ξ²β†’1\beta \to 1): When Ξ²\beta approaches 11, the eigenvalues become {0,1}\{0, 1\}. This operation represents an orthogonal projection onto the hyperplane kβŠ₯\kb^\perp, effectively "forgetting" or erasing components of the state parallel to k\kb. In this regime, det⁑(A)β†’0\det(\Ab) \to 0.
  • Householder Reflection (Ξ²β†’2\beta \to 2): When Ξ²\beta approaches 22, the eigenvalues become {βˆ’1,1}\{-1, 1\}. This corresponds to a Householder reflection, which inverts the state along the direction k\kb. This introduces negative eigenvalues, enabling the modeling of oscillatory or oppositional dynamics within the network's state transitions, with det⁑(A)β†’βˆ’1\det(\Ab) \to -1.

Furthermore, DDL establishes a theoretical link to efficient sequence models through its "Depth-Wise Delta Rule." Expanding the DDL update rule reveals a classic Delta Rule structure:
Xl+1=Xl+Ξ²lkl(vlβŠ€βˆ’kl⊀Xl)\Xb_{l+1} = \Xb_l + \beta_l \kb_l (\vb_l^\top - \kb_l^\top \Xb_l)
This formulation highlights how the network explicitly controls the spectrum of its layer-wise transition operator. It acts like a "target-minus-current-projection" update, where vl⊀\vb_l^\top is the target value and kl⊀Xl\kb_l^\top \Xb_l is the current projection of the state onto the direction kl\kb_l. Unlike architectures like DeltaNet which apply this "Delta Rule" over the time dimension, Deep Delta Learning applies it over the depth dimension, allowing the network to selectively "clean" or "rewrite" specific feature subspaces layer-by-layer. This mechanism prevents the accumulation of interference, a common issue in standard additive ResNets, while preserving the stable training characteristics of gated residual architectures.