
Mamba-3: Improved Sequence Modeling using State Space Principles
Key Points
- 1Mamba-3 significantly advances sub-quadratic sequence modeling by introducing three core methodological improvements to enhance model quality, state-tracking capabilities, and inference efficiency.
- 2Its key innovations include an expressive exponential-trapezoidal discretization, a complex-valued state update rule enabling richer state tracking, and a Multi-Input, Multi-Output (MIMO) formulation for improved hardware utilization during decoding.
- 3Empirically, Mamba-3 achieves notable gains in downstream language modeling accuracy and successfully solves synthetic state-tracking tasks previously challenging for linear models, all while maintaining efficient inference.
The paper introduces Mamba-3, an improved sequence modeling architecture based on State Space Models (SSMs), addressing limitations of prior sub-quadratic models like Mamba-2, particularly in model quality, state-tracking capabilities, and hardware-inefficient inference. Mamba-3 is guided by an inference-first paradigm, integrating three core methodological improvements rooted in an SSM-centric viewpoint: Exponential-Trapezoidal Discretization, Complex-valued State Space Models, and Multi-Input, Multi-Output (MIMO) SSMs.
1. Exponential-Trapezoidal Discretization
This method provides a more expressive and theoretically justified discretization for Linear Time-Varying (LTV) SSMs compared to the heuristic approximation used in Mamba-1 and Mamba-2. The core idea originates from analyzing an "exponential-adjusted" system , which allows for separate approximation of the state-transition and state-input integrals.
The Mamba-1/2 heuristic is formalized as "exponential-Euler" discretization, derived by approximating the state-input integral with Euler's rule and holding the right endpoint constant: .
Mamba-3 introduces "exponential-trapezoidal" discretization, which uses a generalized trapezoidal rule for the state-input integral, offering second-order accuracy. The recurrence relation is given by:
This can be denoted as , where , , and . Here, is a data-dependent scalar. This formulation generalizes the classical trapezoidal rule () and exponential-Euler ().
This discretization effectively applies a data-dependent, width-2 convolution on the state-input *within* the core recurrence, distinct from external short convolutions. In the Structured Masked Representation (SSD) framework, Mamba-3's exponential-trapezoidal recurrence corresponds to a structured mask that is a product of a 1-semiseparable matrix (representing the decay terms) and a 2-band matrix (representing the convolutional weights), enabling parallel computation for training.
2. Complex-valued State Space Model
This innovation addresses the inability of prior real-valued SSMs to track state accurately in tasks requiring "rotational" hidden state dynamics (e.g., parity functions). The paper shows that introducing complex-valued states enables these dynamics without significant computational overhead.
Starting with a continuous-time complex-valued SSM:
where .
Under exponential-Euler discretization, this complex SSM is shown to be equivalent to a real-valued SSM with a doubled state dimension (). The transition matrix of this real-valued equivalent is a scalar-decayed block-diagonal matrix composed of data-dependent rotation matrices :
where and . The projections and are formed by concatenating the real and imaginary parts of the original complex projections.
Crucially, this real-valued formulation is further proven to be equivalent to applying data-dependent rotary embeddings (RoPE) on the and components of a vanilla scalar transition matrix-based SSM. This "RoPE trick" allows for efficient implementation with minimal overhead, enabling Mamba-3 to solve synthetic state-tracking tasks that previous linear models could not.
3. Multi-Input, Multi-Output (MIMO) SSM
To improve hardware utilization and FLOP efficiency during decoding, Mamba-3 transitions from an outer-product-based state update to a matrix-multiplication-based state update. This corresponds to generalizing the signal processing foundation from a Single-Input Single-Output (SISO) sequence dynamics to a Multiple-Input Multiple-Output (MIMO) one. MIMO allows for more computation during the memory-bound state update during decoding without increasing the state size or compromising speed. This results in increased decoding FLOPs (up to relative to Mamba-2 at fixed state size) while maintaining similar wall-clock decode latency, concurrently improving perplexity and downstream performance.
These three methodological advancements are combined within the Mamba-3 layer. Empirically, Mamba-3 (MIMO) achieves significant gains in downstream language modeling accuracy, improving by +2.2 points over Transformers and +1.9 points over Mamba-2 at the 1.5B scale. Mamba-3 (MIMO) with a state size of 64 matches the perplexity of Mamba-2 with a state size of 128, effectively halving latency for equivalent performance. The complex-valued state enables Mamba-3 to solve synthetic arithmetic state-tracking tasks that Mamba-2 and Mamba-3 without RoPE cannot. Overall, Mamba-3 advances the performance-efficiency Pareto frontier for sequence models.