Structured State Space Duality: Unifying SSMs and Attention with Mamba-2 for 2–8× Acceleration
TL;DR
Structured State-Space Duality (SSD) reveals that SSMs and masked attention compute the same function with different contraction orders. It introduces a matmul-dominant kernel based on block decomposition and low-rank properties, achieving 2–8× speedup over Mamba-scan and outperforming FlashAttention-2 for sequences longer than 2k tokens. In a 2.7B / 300B tokens pre-training setup, it matches the quality of Transformer++ and shows further improvement to a PPL of 5.95 when mixed with 6 attention layers (Source: Abstract, Fig. 10, Tab. 3, §9.2–9.3).
Core Ideas
- Theoretical Unification: By formalizing SSMs as semiseparable matrices and masked attention as 1-semiseparable SMAs, the paper proves their relationship as a duality. In essence, they are just different ways of computing the same linear transformation: quadratic (Attention) vs. linear (SSM recurrence) (Source: §5.1–§5.3, Fig. 1).
- Hardware-Friendly Algorithm: A diagonal/off-diagonal low-rank block decomposition maps computations onto GEMMs, ensuring training $(O(TN^2))$, inference $(O(TN))$, and memory $(O(TN))$ complexities (for a typical setup where P=N=Q) (Source: §6, Thm 6.1).
- Architectural Realization (Mamba-2): Input-dependent parallel projections + GroupNorm reduce all-reduce operations to once per block (halving TP communication) and improve PPL from 11.76 to 11.49 (−0.27) (Source: Fig. 7/§8, Tab. 4).
Background: The Problem They Solved
- Transformer Bottlenecks: Costs grow non-linearly with sequence length due to training complexity $(O(T^2))$ and the KV-cache during generation $(O(T))$ (Source: §1).
- SSM’s Strengths and Isolation: SSMs are advantageous for long contexts with training $(O(T))$ and a constant state size $(O(1))$, but their theoretical and system stacks were fragmented, lacking portability and standardization (Source: §1).
- What Was Needed: A common mathematical framework to bridge the two paradigms, a hardware-friendly kernel, and alignment with standard parallelization schemes (TP/SP/CP) for large-scale training (Source: §1, §8).
How It Works: A Concrete Example
1) “Same Function, Different Contraction Orders”
Flatten a 3×3 image (x) in row-major order: $(x_{1:9}=[1,2,\dots,9])$. Scalar SSM (1-SS): $(h_t=a,h_{t-1}+x_t,\ y_t=c,h_t)$ with $(a=0.5,\ c=1)$.
- Recursive computation yields (y_3=4.25), …, (y_9=16.0039) (linear time). With the same SMA(1-SS) mask $(L_{j,i}=a^{j-i})$, computing (y_j=\sum_{i\le j} a^{j-i}x_i)$ produces the same results (quadratic time). → The same linear transform can be implemented both as a scan (linear) and a masked matrix multiply (quadratic) (evidence: §5.1–§5.3).
2) Block–Low-Rank at a Glance
Off-diagonal block terms $(L_{j,i}=a^{j-i}=u_j,v_i)$ can be factorized into rank-1 form, handled by one outer product + short scan → aligns with the GEMM-favored path (evidence: §6).
3) Formula Snippet (Complexity)
[ \text{Attention (Training)}=\Theta(T^2),\quad \text{SSD (Training)}=\Theta(TN^2),\quad \text{SSD (Inference)}=\Theta(TN) ] (evidence: §6, §6.3)
Performance Validation: Key Results
1) Kernel / System Efficiency
- On A100-80GB (PCIe): 2–8× faster than Mamba fused scan, and faster than FlashAttention-2 when (T\ge 2\text{k}) (evidence: Fig.10, §9.3).
- Memory scaling: SSD has no KV-cache, state is (O(N)), hence favorable for long contexts (evidence: §6.3).
[ \text{KV-Cache (GB)} \approx \frac{2 \cdot L \cdot H \cdot d_\text{head} \cdot T \cdot \text{bytes}}{10^9} ] SSD maintains only state (N) instead of a KV-cache (evidence: §6.3).
2) Language Modeling & Scaling
- Medium (≈1.3–1.4B): Mamba-2-1.3B ppl 6.66, improving over Mamba-1.4B 6.80, Zero-shot average is tied (56.4) (evidence: Tab.1).
- Large (2.7B / 300B tokens): Mamba-2 ≈ Transformer++ (same recipe). With 6 attention layers mixed in, achieves ppl 5.95, Avg 61.0, surpassing both (evidence: Tab.3).
- Scaling laws: From 125M–1.3B, SSD sits on the Perplexity–theoretical FLOPs–wallclock Pareto frontier, dominating Transformer++ (evidence: Fig.9).
3) Memory Tasks (MQAR)
- Even with N=16, Mamba-2 ≫ Mamba-1, and as N=16→64→256, accuracy keeps rising (state capacity effect) (evidence: Fig.8, §9.1).
Our Perspective: Strengths, Limitations, and Why This Work Matters
Strengths
- Unified theory + practical kernel: The semi-separable matrix view unifies linear/quadratic algorithms into one framework, concretized into TensorCore-friendly kernels (evidence: §5–§6).
- System design alignment: Input-parallel projection + GroupNorm reduces TP communication from 2→1 per block, and naturally extends to SP/CP (evidence: Fig.7, §8).
- Long-context efficiency proven: Cross-over point with FA-2 is at 2k tokens, with a large gap at 16k (evidence: Fig.10).
Limitations (explicit + potential)
- Short sequence (≈2k) training efficiency may be weaker than Transformer, so SSD:MLP hybrid is recommended (evidence: §9.3).
- Ceiling on pure Mamba-2 quality: Mixing in 6 attention layers consistently yields further gains (evidence: Tab.3).
- Evaluation scope: Reported scale is ≤2.8B, benchmarks focus on Pile/Zero-shot, with limited math, code, multimodal coverage (evidence: Tab.1/§9.2).
Why is this important?
- Long-context fundamentals: Handles long sequences with state (O(N)) instead of KV-cache (O(T)), fundamentally altering training/serving costs (evidence: §6.3).
- Bridge between SSM ↔ Attention: Frames them as interchangeable components, opening the door to hybrid architecture optimization (evidence: §5, §9.2.3).
What’s Next?: Future Directions
- SMA generalization: Explore other structured masks for designing new efficient attention mechanisms (finite feature maps (\psi); evidence: §10).
- Hybrid optimization: Build a Pareto map of attention-layer ratio (≈5–20%) × placement × head width for optimizing ICL/copy/MQAR efficiency (evidence: §9.2.3, Tab.2).
- Expressivity trade-off: Quantify the tradeoff between state expansion (N) and global cache for copy/ICL performance (evidence: §10).
- Transfer interpretability: Verify if phenomena like attention sinks also exist in Mamba/SSM (evidence: §10).
- Scaling & benchmark diversity: Expand to ≥7B models, covering instruction-following, math, coding, multimodal.
Appendix: Quick Numerical Notes (Training / Inference · Memory)
- Pretraining: 2.7B parameters trained on Pile 300B tokens, batch size 1.0M tokens (~300k steps, by simple calc) (evidence: §9.2.3, Appx §D.3).
- Kernel throughput (relative): 2–8× faster than Mamba-scan, superior to FA-2 beyond 2k tokens (A100-80GB PCIe; evidence: Fig.10, §9.3).
- Memory: SSD requires no KV-cache during inference, retaining only state (O(N)), which can be tens of times smaller than Transformer KV (evidence: §6.3).
Every number and claim is backed with evidence from the text, tagged with §/Fig./Tab. references. For derivations, block design, and parallelization schemes, see §5–§8, Fig.7–10, Tab.1–4.
Click the toggle to see a detailed LLM Q&A about the paper.
▶️Click
Prompt 1.1.1 (Research Gap)
Analyze the ‘Introduction’ and ‘Related Work’ sections of the paper to explain the core research gap that this work explicitly aims to solve, the decisive limitations of prior work, or the remaining open questions. Summarize what the authors describe as the ‘state of the art’ at the time of publication.Compressed conclusion (one-line summary): This paper formalizes the duality between SSM ↔ Attention (SSD) to (1) explain the complexity mismatch—Transformer training (O$(T^2)$) / inference with KV-cache (O$(T)$) vs. SSM training (O$(T)$) / state size (O(1))—within a unified framework, (2) provide a common language to transplant Transformer-side system/parallelization optimizations (e.g., TP/Seq-Parallel) into SSMs, and (3) deliver Mamba-2 (core layer improvements) that are 2–8× faster. (Evidence: §1, §5, §7–§9)
1) Core Research Gap
- Separated Evolution: Recent SSMs (S4, Mamba, etc.) demonstrate advantages in training (O$(T)$), inference state (O(1)), and long-range dependencies, matching or surpassing Transformers in small/medium scale tasks. However, they evolved separately from the Transformer community’s theoretical/system-level optimizations, making them harder to analyze, reproduce, and train efficiently. (Evidence: §1)
- Lack of a Unifying Theory: Partial connections (e.g., Linear Attention) existed, but no coherent, general framework existed to view SSMs (linear/recurrent) and masked Attention (quadratic) as mathematically equivalent (semiseparable structure) and mutually transformable. (Evidence: §1, §4–§5, Fig.1)
- Gap in System Transference: Optimizations like TP/Sequence Parallel/Variable-length training, well-developed for Transformers, lacked a design language or algorithmic blueprint for direct application to SSMs. (Evidence: §8 overview)
2) Decisive Limitations of Prior Work
- Complexity/Memory Bottlenecks (Transformer): Training (O$(T^2)$), inference KV-cache (O$(T)$), causing costs to grow sharply with sequence length. (Evidence: §1)
- Heterogeneous Computation Modes (SSM vs. Attention): SSM = linear recurrence/convolution, Attention = quadratic tensor contraction—making direct transfer across paradigms difficult. (Evidence: §3–§4)
- Limits of Partial Connections: The “Linear Attention = functionally same / algorithmically different” view did not fully generalize to SSM (linear/recurrent) ↔ Attention (quadratic/kernel) as merely different contraction orders of the same semiseparable matrix computation. (Evidence: §4.3.1)
3) This Paper’s Solution (Contributions)
- SSD (Structured State-Space Duality): Proves SSM ≡ semiseparable matrix and generalizes masked kernel Attention into Structured Masked Attention (SMA), establishing their duality. ⇒ Shows SSM’s linear-time algorithms and kernel Attention’s quadratic algorithms are just different contraction orders of the same function. (Evidence: §3–§5, Fig.1)
- Efficient Algorithm (Block Decomposition): Proposes a diagonal/low-rank block decomposition of semiseparable structures for hardware-friendly computation paths (e.g., GEMM-dominant). Provides a simple SSD implementation. (Evidence: §6)
- Architecture/System Transfer: Formalizes SSM analogues of Transformer optimizations: multi-head analogy (SSM-Heads ≈ MHA heads), GVA/MES designs, TP-friendly blocks (half the syncs), Sequence Parallel, variable-length training without padding. (Evidence: §7–§8)
- Empirical Results (Representative Numbers): Mamba-2 core layers achieve 2–8× speedup over Mamba under identical settings, while maintaining LM competitiveness. (Evidence: Abstract)
4) State of the Art at Time of Publication (per Authors)
| Axis | Transformer Family | SSM Family | Summary |
|---|---|---|---|
| Training Complexity | (O$(T^2)$) (masked Attention) (Evidence: §1) | (O$(T)$) (linear recurrence) (Evidence: §1) | SSM better in scaling with length |
| Generation Memory | KV-cache (O$(T)$) (Evidence: §1) | State size (O(1)) (Evidence: §1) | SSM better for long-context inference |
| Performance Level | Dominates SOTA (GPT/LLaMA etc.) (Evidence: §1) | Competitive/equal in small–medium scale (Mamba) (Evidence: §1) | Competitive depending on scale/domain |
| Theory/System Stack | Rich (FlashAttention, TP/Seq-Parallel, etc.) (Evidence: §8) | Fragmented, unstandardized (Evidence: §1) | Lack of common language as bottleneck |
| Connecting Work | Linear Attention, kernel view (Evidence: §1, §4) | Selective SSM, state expansion, etc. (Evidence: §10.1) | Need for full duality formalization |
Summary: As of May 2024, SOTA was Transformer-dominated, while SSMs showed structural advantages in long-sequence efficiency. But lack of theory/system standardization hindered scaling. This paper resolves it via SSD, proving speed & transferability (Mamba-2 2–8× faster). (Evidence: §1, Abstract, §8–§10)
5) Positioning in Related Work
- SSM Lineage: Categorized by time-varying (Selective), dimension, and structural axes (diagonal/low-rank). Describes SSD as a Selective SSM (SISO, scalar-identity). (Evidence: §10.1)
- Broader Model Spectrum: Covers GateLoop, GLA, HGRN, Griffin, RecurrentGemma, xLSTM, RWKV, unifying gating/state-expansion/local-attention trends under SSD’s framework. (Evidence: §10 Related Models)
- Meta-Claim: “Every fast recurrent-form kernel Attention is a type of SSM,” effectively erasing boundaries. (Evidence: §5)
Reference Formula (Complexity)
$$ \text{Attention (Training)}=\Theta(T^2),\quad \text{SSM (Training)}=\Theta(T),\quad \text{KV-Cache (Generation)}=\Theta(T),\quad \text{SSM State Size}=\Theta(1) $$ (Evidence: §1)
This section is structured with compiler/serving perspective, naturally linking to later sections (Compute/Cost, Serving Metrics, SOTA tables), centered on evidence tags and quantitative indicators.
Prompt 1.1.2 (Central Hypothesis)
What is the central hypothesis or core claim of this paper? State it clearly in one sentence in the form: “The authors hypothesize that by using [proposed method], they can overcome [limitation] and achieve [specific result].”The authors hypothesize that by using Structured State Space Duality (SSD) to unify and interconvert SSMs and masked Attention under the lens of semiseparable matrices, they can reduce efficiency bottlenecks in long-context scaling, and that their implementation, Mamba-2, can achieve 2–8× faster core layers while maintaining language modeling competitiveness. (Evidence: §Abstract) (arxiv.org )
Prompt 1.2.1 (Identifying Originality)
Based on the entire paper, identify the 1–3 most important and original contributions. Clearly state whether each is a new architecture component, a new training method, a new theoretical insight, a new dataset, or a novel application of existing methods.Compressed conclusion: The originality lies in (1) SSD theory formalization of SSM↔Attention duality (linear/quadratic forms of the same function, with proofs), (2) hardware-efficient SSD algorithms (training O$(TN^2)$, inference O(TN), 2–8× faster than Mamba, surpassing FlashAttn-2 beyond 2k tokens, ~6× at 16k), and (3) Mamba-2 architecture/system design (parallel blocks, GVA heads, TP compatibility, trained at 2.7B / 300B tokens, outperforming peers). (Evidence: §3–§6, §7, §9; Fig.1/6/10)
1) SSD Theoretical Framework — New Theoretical Insight
- Core: Formalizes SSM as semiseparable matrix, masked kernel Attention as 1-semiseparable SMA, proving duality: same function but linear (recurrent) vs. quadratic (kernel) contraction orders. Extends to claim: all fast recurrent kernel Attentions are SSMs. (Evidence: §5.1–§5.3, Fig.1)
- Quantitative Significance: Explains complexity branching within one framework—SSM training (O$(TN^2)$)/inference (O(TN)), Attention training (O$(T^2N)$)/state size (O$(T)$). (Evidence: §6 summary)
2) Hardware-Efficient SSD Algorithm — New Efficient Algorithm
- Core: Proposes diagonal/low-rank block decomposition of semiseparable matrices, mapping to matmul-dominant computation (TensorCore-friendly). Algorithm guarantees: training FLOPs O$(TN^2)$, inference FLOPs O(TN), inference memory O(N^2) (Theorem 6.1). (Evidence: §6)
- Performance (Quantitative): 2–8× faster than Mamba fused scan, crossover with FlashAttn-2 at 2k tokens, ~6× faster at 16k tokens; scaling state size $(N)$ up to 8× with minimal slowdown. (Evidence: §9.3 Fig.10, §6)
3) Mamba-2 Architecture & System Transfer — New Architecture / Systematic Application of Existing Methods
- Core Changes: Introduces parallel Mamba blocks (input-dependent projection moved to block start, plus normalization), and GVA (Grouped-Value Attention)-like heads, enabling Tensor Parallelism. (Evidence: §7.1, Fig.6)
- Effects (Quantitative): Under Chinchilla scaling, Mamba-2 (2.7B Params) trained on Pile 300B tokens achieves Pareto advantage (perplexity–time) over Mamba-2.8B/Pythia-2.8B/6.9B and comparable/better performance than Transformer++ baselines. (Evidence: §7, §9.2)
Summary: (Theory) Unified duality of SSM vs. Attention via SSD → (Algorithm) matmul-dominant SSD achieving 2–8× speedup → (Architecture/System) Mamba-2 with TP/head design improvements and scaling advantages. (Evidence: §5–§7, §9)
Prompt 1.2.2 (Strengths from Authors’ Perspective)
From the authors’ perspective, why is their approach superior to previous methods? Cite or clearly explain their key arguments supporting originality and strengths.Compressed conclusion: The authors argue that SSD (Structured State-Space Duality) provides (i) theoretical unification by framing SSM↔Attention in one formalism, (ii) hardware-friendly algorithms with guarantees (training O$(TN^2)$, inference O(TN), memory O(N^2)), and (iii) empirical proof of 2–8× faster than Mamba and surpassing FlashAttn-2 beyond 2k tokens. (Evidence: §5–§6, §9.3/Fig.10)
Authors’ Key Arguments (Strengths)
Theoretical Unification & Generality — “SSM↔Attention Duality”
- 1-semiseparable SMA ≡ diagonal SSM special case (Cor. 5.1).
- Efficient autoregressive Attention ⇔ semiseparable mask required (Thm. 5.2).
- Frames linear/quadratic forms as contraction-order differences of the same function. (Evidence: §5.2–§5.3)
- Provides common intersection (SSD) unifying SSMs and SMAs, enabling comparative analysis across sequence models. (Evidence: Fig.4)
Hardware-Friendly SSD Algorithms — “Matmul-Dominant with Complexity Guarantees”
- Theorem 6.1: Guarantees training FLOPs O$(TN^2)$, inference FLOPs O(TN), inference memory O(N^2), dominated by GEMM ops. (Evidence: §6)
- Diagonal/low-rank block decomposition allows compromise paths—simpler than pure scan and efficient with just PyTorch implementations. (Evidence: §6)
Empirical Performance — “2–8×, 2k crossover, ~6× at 16k”
- On A100-80GB PCIe, SSD is 2–8× faster than Mamba fused scan, surpasses FlashAttn-2 beyond 2k tokens, and ~6× faster at 16k. (Evidence: §9.3/Fig.10)
- Scaling state size N (64→256) barely affects SSD, while Mamba scan degrades linearly—showing stable scaling. (Evidence: Fig.10)
Language Modeling & Pareto Efficiency
- 2.7B params / 300B tokens (Pile): Mamba-2 vs. Transformer++ shows comparable perplexity/zero-shot, with Mamba-2-Attention hybrid outperforming both. (Evidence: §9.2/Table 3)
- Claims Pareto dominance over Transformer baselines in perplexity–FLOPs–wallclock. (Evidence: §9)
Challenging Memory Tasks (MQAR) — “Scaling N Improves Accuracy”
- Mamba-2 ≫ Mamba-1/vanilla Attention across tasks, accuracy increases with N=16→64→256 (larger memory capacity effect). (Evidence: Fig.8, §9.1)
System-Level Transferability — “TP/SP/Variable-Length”
- TP-Friendly: Only one sync per block, GroupNorm to avoid extra comm. (Evidence: §8)
- SP/Context Parallel: Distributes long-context training by passing only recurrent states. (Evidence: §8.2)
- Variable-length training/inference without padding/realignment. (Evidence: §8)
Compatibility with Attention Approximations
- Absorbs kernel Attention approximations (Performer, etc.) and normalization into Mamba-2 hyperparameters, removing bottlenecks from softmax/position masks. (Evidence: §4.1.3, §7)
Summary (numbers-focused): On A100-80GB, 2–8× faster than Mamba-scan, overtakes FlashAttn-2 at ≥2k tokens, ~6× faster at 16k, while 2.7B/300B training shows Transformer++-level quality and Pareto superiority (ppl/FLOPs/wallclock). (Evidence: §9.3/Fig.10, §9.2)
Prompt 1.3.1 (Step-by-Step Algorithm Explanation)
"Explain the core algorithm, model architecture, or main methodology step by step. Assume the reader is a graduate-level student in AI. Use very simple examples (toy examples) such as a 3×3 pixel image, a small state space, or short sequences. Walk through each step to show how input transforms into output. Define all key terms and variables immediately when they appear."Prompt 1.3.1 — Step-by-Step Algorithm Explanation (SSD ↔ Mamba-2)
Compressed Conclusion: SSD unifies SSMs and masked Attention as two algorithms computing the same function with different contraction orders (linear vs. quadratic). With block decomposition + low-rank structure, it achieves training O$(TN^2)$ FLOPs / inference O(TN) FLOPs / memory O(N^2) using a matmul-dominant path. This computation core is wrapped inside the Mamba-2 block (Conv → SSM → gating → normalization → output), which supports 1 all-reduce/ layer in TP and sequence parallelism as-is. (Evidence: §5 summary; Thm 6.1; Fig.7)
A. Core Objects and Notation (Definitions)
- Sequence length $(T)$ (tokens), head/feature dimension $(P)$, state dimension $(N)$, block size $(Q)$. (Evidence: §6)
- SSM parameters: $(A_t\in\mathbb{R}^{N\times N}), (B_t,C_t\in\mathbb{R}^N)$ (possibly time-dependent). (Evidence: Def.3.2)
- SSS (Sequentially Semiseparable) representation: [ M_{ji} = C_j^\top A_j\cdots A_{i+1} B_i,\quad y = Mx ] —thus SSM ≡ semiseparable matrix representation. (Evidence: Eq.(3), Def.3.1/3.2)
- SMA (Structured Masked Attention): contraction $(M = QK^\top \circ L)$; has both quadratic mode (standard Attention) and linear mode (kernel/scan). (Evidence: §4.3.1)
- SSD (Structured State-Space Duality): establishes equivalence between scalar-identity SSM and 1-SS SMA, proving that SSM’s linear-time recurrence and SMA’s quadratic-time kernel contraction compute the same function (duality). (Evidence: §5.1–5.3)
Key: Semiseparable matrices guarantee that every sub-block below the diagonal has rank ≤ N. Off-diagonal blocks are low-rank → enabling fast algorithms. (Evidence: Def.3.1; Lemma/Prop.3.x)
B. Two Paths for the Same Function
B-1) Quadratic Path (Attention) — Standard SMA Contraction
- Steps: $(G = QK^\top) \to (M = G \circ L) \to (Y = MV)$. (Evidence: Eq.(13))
- Complexity: Training O$(T^2P)$, state size O$(T)$ (KV-cache). (Evidence: comparison table)
B-2) Linear Path (SSM) — Scan/Recurrence
- Scalar SSM (1-SS): recurrence in linear time: [ h_t = a_t h_{t-1} + b_t x_t;\quad y_t = c_t h_t ] (Evidence: §3.2.2)
- Equivalence: The naive quadratic matrix multiplication of 1-SS SSM is identical to masked kernel Attention. (Evidence: §5.1 Eq.(16))
C. SSD Block Algorithm (Core Steps)
Goal: Partition semiseparable matrix $(M)$ into $(Q \times Q)$ blocks; compute diagonal blocks with matmul, off-diagonal blocks with low-rank outer products + short scans. (Evidence: Thm 6.1)
Block decomposition: Split $M \in \mathbb{R}^{T\times T}$ into $(T/Q)\times(T/Q)$ blocks.
Diagonal blocks $(M_{bb})$:
- Compute using local mask $(L)$ and Gram matrices → implemented with batched matmul (BMM).
Off-diagonal blocks $(M_{b>b’})$: low-rank → 3-step pipeline:
- Right: factor multiplication (BMM).
- Center: short 1-SS scan (length $T/Q$).
- Left: factor multiplication (BMM).
Total cost (when N=P=Q): Training O$(TN^2)$ FLOPs, inference O(TN), dominated by matmul.
Theorem 6.1: Guarantees training O$(TN^2)$, inference O(TN), inference memory O(N^2), matmul-dominant.
Comparison: Attention O$(T^2N)$ vs. SSD/SSM O$(TN^2)$ FLOPs; state size T vs. N.
D. Toy Example — 3×3 Image → 9 Tokens, 1-SS SSM vs. SMA
D-0) Setup
- Input image (intensity) $(3\times3)$ → row-major flatten $(x_{1:9} = [1,\dots,9])$.
- Scalar SSM: $a=0.5$, $b_t=c_t=1$, hidden state $h_t \in \mathbb{R}$.
- SMA mask: $L_{j,i} = a^{j-i}$ if $j\ge i$, else 0.
D-1) Linear Path (SSM Recurrence) — One Token at a Time
[ h_t = 0.5 h_{t-1} + x_t,\quad y_t = h_t,\quad h_0 = 0 ]
- $t=1$: $h_1 = 1 \to y_1=1.0$
- $t=2$: $h_2=2.5 \to y_2=2.5$
- $t=3$: $h_3=4.25 \to y_3=4.25$
- … → $y_9=16.0039$
D-2) Quadratic Path (SMA Matrix Multiplication)
[ y_j = \sum_{i=1}^j a^{j-i} x_i ]
- Example: $j=3$: $[a^2,a,1]\cdot[1,2,3] = 0.25+1+3=4.25$
- Matches recurrence results exactly.
D-3) Block/Low-Rank Illustration (Q=3)
- Example: $u=[1,0.5,0.25], v=[0.125,0.25,0.5]$.
- Contribution = $u \cdot (v^\top x_{1:3}) = [2.125,1.0625,0.53125]$.
- Shows off-diagonals handled by low-rank outer products, diagonals by local matmul → SSD block algorithm.
E. Mamba-2 Block — SSD Core in a Training/Serving-Friendly Layer
One head in Mamba-2:
- Projection: $x^{(i)}=uW^{(x)}_i$, $z^{(i)}=uW^{(z)}_i$.
- Local Conv1D: $x_c^{(i)}=\mathrm{conv1d}(x^{(i)})$.
- SSD/SSM transform: $y^{(i)}=\mathrm{SSM}(x_c^{(i)})$.
- Gating: $y_g^{(i)} = y^{(i)} \cdot \phi(z^{(i)})$.
- GroupNorm: TP-friendly normalization.
- Output projection + all-reduce once per layer.
Parallelism:
- TP: input-parallel projection + GN → only 1 all-reduce/layer.
- SP/CP: sequence split, only pass recurrent states across boundaries.
F. Summary (Complexity & Paths)
| Algorithm | State | Training FLOPs | Inference FLOPs | Memory (naive) | Main Ops |
|---|---|---|---|---|---|
| Attention (SMA) | T | T²N | TN | T² | Matmul |
| Linear SSM | N | TN² | N² | TN² | Scan + Matmul |
| SSD (Block) | N | TN² | N² | TN | Matmul (+short scan) |
Prompt 1.3.2 (Secret Ingredient)
"Pick one key component. Show Δ(metric) when removed/replaced/rescaled, in a table, and explain the mechanism behind the change."Prompt 1.3.2 — Secret Ingredient: Mamba-2 Block’s “Parallel Projection + Extra Normalization (GroupNorm)”
Compressed conclusion: Shifting Δ,B,C projections directly from input (u) and inserting GroupNorm makes Mamba-2 achieve ppl 11.76→11.49 (−0.27) with −2.2% fewer params, while reducing TP communication per block from 2 all-reduces to 1. This removes the structural bottleneck present in Mamba-1. (Evidence: §9.4 Tab.4, Fig.7/§8.1)
Ablation Table — Δ(metric)
| Block | Projections | Extra Norm | Params | Perplexity | Δ vs Baseline |
|---|---|---|---|---|---|
| Mamba-1 | Sequential | ✗ | 129.3 M | 11.76 | — |
| Mamba-1 | Sequential | ✓ | 129.3 M | 11.54 | −0.22 |
| Mamba-2 var | Parallel | ✗ | 126.5 M | 11.66 | −0.10 |
| Mamba-2 | Parallel | ✓ | 126.5 M | 11.49 | −0.27 |
(Evidence: Tab.4)
Mechanism
- Projection move (Sequential→Parallel): avoids extra all-reduce.
- GroupNorm: TP-friendly normalization, stabilizes state scaling, improves convergence (ppl −0.22~−0.27).
- Practical effect: reduces TP communication cost; supports SP/CP with state passing.
One-line summary
Mamba-2’s parallel input-based projection + GroupNorm = better perplexity (−0.27), fewer params (−2.2%), 1 all-reduce instead of 2 → a true secret weapon.
Prompt 1.4.1 (Key Results Analysis)
Compressed conclusion: (i) Language modeling: Mamba-2 improves perplexity (6.66 vs 6.80) and matches Transformer++ at 2.7B/300B; hybrid (6 attention layers) further improves ppl 5.95. (ii) Efficiency: SSD core is 2–8× faster than Mamba scan, faster than FA-2 for T≥2k. (iii) Memory tasks (MQAR): Mamba-2 ≫ Mamba-1, scales with N. (iv) Scaling laws: Pareto dominance over Transformer++ in ppl–FLOPs–wallclock.
Benchmarks: Pile, zero-shot suite (LAMBADA, HellaSwag, PIQA, ARC-E/C, WinoGrande, OBQA), MQAR, efficiency on A100-80GB.
Prompt 1.4.2 (Critical Comparison)
Compressed conclusion:
- Efficiency: SSD beats Mamba scan (2–8×) and FA-2 at ≥2k tokens.
- Quality: Mamba-2 ≈ Transformer++ (2.7B/300B); hybrid (M2+Attn) surpasses both.
- MQAR: strong gains over Mamba-1.
- Weak spots: equal performance vs Transformer++ at 2.7B (needs hybrid), limited gains at short context (~2k).
Prompt 1.5.1 (Limitations)
- Short context inefficiency (~2k): SSD-only worse than Transformer (needs SSD+MLP mix).
- Pure Mamba-2 ceiling: hybrid with Attention improves.
- Speed advantage depends on long context (≥2k).
- Evaluation limited: ≤2.8B params, mostly Pile/zero-shot.
Potential: struggles with copy/retrieval tasks, workloads dominated by short contexts, generalization to math/code/multimodal not shown.
Prompt 1.5.2 (Future Work)
- Generalize SMA masks to design new efficient attentions.
- Quantify hybrid ratios (SSM vs. Attention).
- Interpretability transfer (attention sinks in SSMs).
- Trade-off maps (state N vs. cache).
- Identify drivers of MQAR gains.
- Explore larger models (≥7B), more diverse benchmarks.
Additional Question: Compute, Memory, Dependencies
Conclusion: SSD enables matmul-dominant kernels; on A100-80GB PCIe, achieves 2–8× faster kernel time vs. Mamba scan, surpasses FA-2 for T≥2k. Pretraining: 2.7B params, Pile 300B tokens, batch 1M tokens (~300k steps). Exact tokens/s, CUDA version not given.
Dependencies: PyTorch, EleutherAI LM harness, GPT-NeoX/GPT-2 tokenizer. CUDA/Triton versions not specified.
Memory: SSD O(TN) vs Attention O(T²). Example: SSD state ~10 MB vs. Transformer KV-cache ~1.34 GB (4k ctx).
Throughput: kernel-level speedups (2–8× vs Mamba, faster than FA-2 beyond 2k). Model-level tokens/s not reported.
Training cost: 300B tokens, batch 1M = ~300k steps. FLOPs formula O(TN²), PF-days not reported.
![[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality](https://icml.cc/media/PosterPDFs/ICML%202024/32613.png)
Comments