Mamba-3: Improved Sequence Modeling using State Space Principles
Paper

Mamba-3: Improved Sequence Modeling using State Space Principles

Albert Gu
2026.03.19
·Arxiv·by 이호민
#Inference Efficiency#LLM#Mamba#Sequence Modeling#State Space Models

Key Points

  • 1Mamba-3 significantly advances sub-quadratic sequence modeling by introducing three core methodological improvements to enhance model quality, state-tracking capabilities, and inference efficiency.
  • 2Its key innovations include an expressive exponential-trapezoidal discretization, a complex-valued state update rule enabling richer state tracking, and a Multi-Input, Multi-Output (MIMO) formulation for improved hardware utilization during decoding.
  • 3Empirically, Mamba-3 achieves notable gains in downstream language modeling accuracy and successfully solves synthetic state-tracking tasks previously challenging for linear models, all while maintaining efficient inference.

The paper introduces Mamba-3, an improved sequence modeling architecture based on State Space Models (SSMs), addressing limitations of prior sub-quadratic models like Mamba-2, particularly in model quality, state-tracking capabilities, and hardware-inefficient inference. Mamba-3 is guided by an inference-first paradigm, integrating three core methodological improvements rooted in an SSM-centric viewpoint: Exponential-Trapezoidal Discretization, Complex-valued State Space Models, and Multi-Input, Multi-Output (MIMO) SSMs.

1. Exponential-Trapezoidal Discretization
This method provides a more expressive and theoretically justified discretization for Linear Time-Varying (LTV) SSMs compared to the heuristic approximation used in Mamba-1 and Mamba-2. The core idea originates from analyzing an "exponential-adjusted" system exp⁑(βˆ’At)x(t)\exp(-At)x(t), which allows for separate approximation of the state-transition and state-input integrals.
The Mamba-1/2 heuristic is formalized as "exponential-Euler" discretization, derived by approximating the state-input integral with Euler's rule and holding the right endpoint constant: ht=eΞ”tAthtβˆ’1+Ξ”tBtxt\mathbf{h}_t = e^{\Delta t \mathbf{A}_t} \mathbf{h}_{t-1} + \Delta t \mathbf{B}_t \mathbf{x}_t.
Mamba-3 introduces "exponential-trapezoidal" discretization, which uses a generalized trapezoidal rule for the state-input integral, offering second-order accuracy. The recurrence relation is given by:
ht=eΞ”tAthtβˆ’1+(1βˆ’Ξ»t)Ξ”teΞ”tAtBtβˆ’1xtβˆ’1+Ξ»tΞ”tBtxt\mathbf{h}_t = e^{\Delta t \mathbf{A}_t} \mathbf{h}_{t-1} + (1 - \lambda_t)\Delta t e^{\Delta t \mathbf{A}_t} \mathbf{B}_{t-1}\mathbf{x}_{t-1} + \lambda_t \Delta t \mathbf{B}_t\mathbf{x}_t
This can be denoted as ht=Ξ±thtβˆ’1+Ξ²tBtβˆ’1xtβˆ’1+Ξ³tBtxt\mathbf{h}_t = \alpha_t \mathbf{h}_{t-1} + \beta_t \mathbf{B}_{t-1}\mathbf{x}_{t-1} + \gamma_t \mathbf{B}_t\mathbf{x}_t, where Ξ±tβ‰œeΞ”tAt\alpha_t \triangleq e^{\Delta t \mathbf{A}_t}, Ξ²tβ‰œ(1βˆ’Ξ»t)Ξ”teΞ”tAt\beta_t \triangleq (1 - \lambda_t)\Delta t e^{\Delta t \mathbf{A}_t}, and Ξ³tβ‰œΞ»tΞ”t\gamma_t \triangleq \lambda_t \Delta t. Here, Ξ»t∈[0,1]\lambda_t \in [0, 1] is a data-dependent scalar. This formulation generalizes the classical trapezoidal rule (Ξ»t=1/2\lambda_t=1/2) and exponential-Euler (Ξ»t=1\lambda_t=1).
This discretization effectively applies a data-dependent, width-2 convolution on the state-input Btxt\mathbf{B}_t \mathbf{x}_t *within* the core recurrence, distinct from external short convolutions. In the Structured Masked Representation (SSD) framework, Mamba-3's exponential-trapezoidal recurrence corresponds to a structured mask L\mathbf{L} that is a product of a 1-semiseparable matrix (representing the decay Ξ±\alpha terms) and a 2-band matrix (representing the Ξ²,Ξ³\beta, \gamma convolutional weights), enabling parallel computation for training.

2. Complex-valued State Space Model
This innovation addresses the inability of prior real-valued SSMs to track state accurately in tasks requiring "rotational" hidden state dynamics (e.g., parity functions). The paper shows that introducing complex-valued states enables these dynamics without significant computational overhead.
Starting with a continuous-time complex-valued SSM:
hΛ™(t)=Diag(A(t)+iΞΈ(t))h(t)+(B(t)+iB^(t))x(t)\dot{\mathbf{h}}(t) = \text{Diag}(\mathbf{A}(t) + i\theta(t)) \mathbf{h}(t) + (\mathbf{B}(t) + i\hat{\mathbf{B}}(t)) x(t)
y(t)=Re((C(t)+iC^(t))⊀h(t))y(t) = \text{Re}\left( (\mathbf{C}(t) + i\hat{\mathbf{C}}(t))^\top \mathbf{h}(t) \right)
where h(t)∈CN/2\mathbf{h}(t) \in \mathbb{C}^{N/2}.
Under exponential-Euler discretization, this complex SSM is shown to be equivalent to a real-valued SSM with a doubled state dimension (ht∈RN\mathbf{h}_t \in \mathbb{R}^N). The transition matrix of this real-valued equivalent is a scalar-decayed block-diagonal matrix composed of 2Γ—22 \times 2 data-dependent rotation matrices Rt\mathbf{R}_t:
ht=eΞ”tAtRthtβˆ’1+Ξ”tBtxt\mathbf{h}_t = e^{\Delta t \mathbf{A}_t} \mathbf{R}_t \mathbf{h}_{t-1} + \Delta t \mathbf{B}_t \mathbf{x}_t
yt=Ct⊀hty_t = \mathbf{C}_t^\top \mathbf{h}_t
where Rt=Block({R(Ξ”tΞΈt[i])}i=1N/2)\mathbf{R}_t = \text{Block}(\{R(\Delta t \theta_t[i])\}_{i=1}^{N/2}) and R(ΞΈ)=(cos⁑(ΞΈ)βˆ’sin⁑(ΞΈ)sin⁑(ΞΈ)cos⁑(ΞΈ))R(\theta) = \begin{pmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{pmatrix}. The projections Bt\mathbf{B}_t and Ct\mathbf{C}_t are formed by concatenating the real and imaginary parts of the original complex projections.
Crucially, this real-valued formulation is further proven to be equivalent to applying data-dependent rotary embeddings (RoPE) on the B\mathbf{B} and C\mathbf{C} components of a vanilla scalar transition matrix-based SSM. This "RoPE trick" allows for efficient implementation with minimal overhead, enabling Mamba-3 to solve synthetic state-tracking tasks that previous linear models could not.

3. Multi-Input, Multi-Output (MIMO) SSM
To improve hardware utilization and FLOP efficiency during decoding, Mamba-3 transitions from an outer-product-based state update to a matrix-multiplication-based state update. This corresponds to generalizing the signal processing foundation from a Single-Input Single-Output (SISO) sequence dynamics to a Multiple-Input Multiple-Output (MIMO) one. MIMO allows for more computation during the memory-bound state update during decoding without increasing the state size or compromising speed. This results in increased decoding FLOPs (up to 4Γ—4 \times relative to Mamba-2 at fixed state size) while maintaining similar wall-clock decode latency, concurrently improving perplexity and downstream performance.

These three methodological advancements are combined within the Mamba-3 layer. Empirically, Mamba-3 (MIMO) achieves significant gains in downstream language modeling accuracy, improving by +2.2 points over Transformers and +1.9 points over Mamba-2 at the 1.5B scale. Mamba-3 (MIMO) with a state size of 64 matches the perplexity of Mamba-2 with a state size of 128, effectively halving latency for equivalent performance. The complex-valued state enables Mamba-3 to solve synthetic arithmetic state-tracking tasks that Mamba-2 and Mamba-3 without RoPE cannot. Overall, Mamba-3 advances the performance-efficiency Pareto frontier for sequence models.