
Deep Delta Learning
Key Points
- 1Deep Delta Learning (DDL) generalizes the standard residual connection by introducing a learnable, data-dependent geometric transformation called the Delta Operator.
- 2This Delta Operator, a rank-1 perturbation of the identity matrix, uses a gate $\beta$ to dynamically interpolate between identity mapping, orthogonal projection, and geometric reflection.
- 3The architecture synchronously couples information erasure and writing, enabling explicit control over layer-wise feature transformations and applying the classic Delta Rule over network depth.
Deep Delta Learning (DDL) introduces a novel architecture that generalizes the standard identity shortcut connection in deep residual networks, addressing its limitation of imposing a strictly additive inductive bias that restricts the modeling of complex state transitions. While traditional ResNets approximate an ODE via an additive update , DDL proposes a more flexible layer-wise transformation.
The core of DDL is the Deep Delta Residual Block, which updates the hidden state matrix using the rule:
Here, is the hidden state at layer , is the identity matrix, and the parameters , , and are learnable and data-dependent. Specifically, is a scalar gate, is a reflection direction vector, and is a value vector. The first term, , represents a rank-1 perturbation of the identity matrix applied to the current state, termed the Delta Operator . This operator modulates the identity shortcut with a learnable, data-dependent geometric transformation. The second term, , is an injection term. This formulation innovatively couples the "erasure" of old information (via projection onto ) with the "writing" of new features (via injection of ), both scaled synchronously by the gate .
The expressive power of DDL is rooted in the spectral properties of the Delta Operator . Its eigenvalues are , meaning the scalar gate deterministically controls the operator's spectrum and thus its geometric interpretation. This allows the network to dynamically interpolate between three fundamental linear transformations:
- Identity Mapping (): When approaches , the eigenvalues are . This corresponds to a skip connection, preserving the signal for deep propagation, such that .
- Orthogonal Projection (): When approaches , the eigenvalues become . This operation represents an orthogonal projection onto the hyperplane , effectively "forgetting" or erasing components of the state parallel to . In this regime, .
- Householder Reflection (): When approaches , the eigenvalues become . This corresponds to a Householder reflection, which inverts the state along the direction . This introduces negative eigenvalues, enabling the modeling of oscillatory or oppositional dynamics within the network's state transitions, with .
Furthermore, DDL establishes a theoretical link to efficient sequence models through its "Depth-Wise Delta Rule." Expanding the DDL update rule reveals a classic Delta Rule structure:
This formulation highlights how the network explicitly controls the spectrum of its layer-wise transition operator. It acts like a "target-minus-current-projection" update, where is the target value and is the current projection of the state onto the direction . Unlike architectures like DeltaNet which apply this "Delta Rule" over the time dimension, Deep Delta Learning applies it over the depth dimension, allowing the network to selectively "clean" or "rewrite" specific feature subspaces layer-by-layer. This mechanism prevents the accumulation of interference, a common issue in standard additive ResNets, while preserving the stable training characteristics of gated residual architectures.