[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper Link

Structured State Space Duality: Unifying SSMs and Attention with Mamba-2 for 2–8× Acceleration

TL;DR

Structured State-Space Duality (SSD) reveals that SSMs and masked attention compute the same function with different contraction orders. It introduces a matmul-dominant kernel based on block decomposition and low-rank properties, achieving 2–8× speedup over Mamba-scan and outperforming FlashAttention-2 for sequences longer than 2k tokens. In a 2.7B / 300B tokens pre-training setup, it matches the quality of Transformer++ and shows further improvement to a PPL of 5.95 when mixed with 6 attention layers (Source: Abstract, Fig. 10, Tab. 3, §9.2–9.3).


Core Ideas

Background: The Problem They Solved

How It Works: A Concrete Example

1) “Same Function, Different Contraction Orders”

Flatten a 3×3 image (x) in row-major order: $(x_{1:9}=[1,2,\dots,9])$. Scalar SSM (1-SS): $(h_t=a,h_{t-1}+x_t,\ y_t=c,h_t)$ with $(a=0.5,\ c=1)$.

2) Block–Low-Rank at a Glance

Off-diagonal block terms $(L_{j,i}=a^{j-i}=u_j,v_i)$ can be factorized into rank-1 form, handled by one outer product + short scan → aligns with the GEMM-favored path (evidence: §6).

3) Formula Snippet (Complexity)

[ \text{Attention (Training)}=\Theta(T^2),\quad \text{SSD (Training)}=\Theta(TN^2),\quad \text{SSD (Inference)}=\Theta(TN) ] (evidence: §6, §6.3)


Performance Validation: Key Results

1) Kernel / System Efficiency

[ \text{KV-Cache (GB)} \approx \frac{2 \cdot L \cdot H \cdot d_\text{head} \cdot T \cdot \text{bytes}}{10^9} ] SSD maintains only state (N) instead of a KV-cache (evidence: §6.3).

2) Language Modeling & Scaling

3) Memory Tasks (MQAR)


Our Perspective: Strengths, Limitations, and Why This Work Matters

Strengths

Limitations (explicit + potential)

Why is this important?


What’s Next?: Future Directions


Appendix: Quick Numerical Notes (Training / Inference · Memory)

Every number and claim is backed with evidence from the text, tagged with §/Fig./Tab. references. For derivations, block design, and parallelization schemes, see §5–§8, Fig.7–10, Tab.1–4.

Click the toggle to see a detailed LLM Q&A about the paper.

▶️Click

Prompt 1.1.1 (Research Gap)

PLAINTEXT
Analyze the ‘Introduction’ and ‘Related Work’ sections of the paper to explain the core research gap that this work explicitly aims to solve, the decisive limitations of prior work, or the remaining open questions. Summarize what the authors describe as the ‘state of the art’ at the time of publication.
Click to expand and view more

Compressed conclusion (one-line summary): This paper formalizes the duality between SSM ↔ Attention (SSD) to (1) explain the complexity mismatch—Transformer training (O$(T^2)$) / inference with KV-cache (O$(T)$) vs. SSM training (O$(T)$) / state size (O(1))—within a unified framework, (2) provide a common language to transplant Transformer-side system/parallelization optimizations (e.g., TP/Seq-Parallel) into SSMs, and (3) deliver Mamba-2 (core layer improvements) that are 2–8× faster. (Evidence: §1, §5, §7–§9)


1) Core Research Gap

  1. Separated Evolution: Recent SSMs (S4, Mamba, etc.) demonstrate advantages in training (O$(T)$), inference state (O(1)), and long-range dependencies, matching or surpassing Transformers in small/medium scale tasks. However, they evolved separately from the Transformer community’s theoretical/system-level optimizations, making them harder to analyze, reproduce, and train efficiently. (Evidence: §1)
  2. Lack of a Unifying Theory: Partial connections (e.g., Linear Attention) existed, but no coherent, general framework existed to view SSMs (linear/recurrent) and masked Attention (quadratic) as mathematically equivalent (semiseparable structure) and mutually transformable. (Evidence: §1, §4–§5, Fig.1)
  3. Gap in System Transference: Optimizations like TP/Sequence Parallel/Variable-length training, well-developed for Transformers, lacked a design language or algorithmic blueprint for direct application to SSMs. (Evidence: §8 overview)

2) Decisive Limitations of Prior Work

  • Complexity/Memory Bottlenecks (Transformer): Training (O$(T^2)$), inference KV-cache (O$(T)$), causing costs to grow sharply with sequence length. (Evidence: §1)
  • Heterogeneous Computation Modes (SSM vs. Attention): SSM = linear recurrence/convolution, Attention = quadratic tensor contraction—making direct transfer across paradigms difficult. (Evidence: §3–§4)
  • Limits of Partial Connections: The “Linear Attention = functionally same / algorithmically different” view did not fully generalize to SSM (linear/recurrent) ↔ Attention (quadratic/kernel) as merely different contraction orders of the same semiseparable matrix computation. (Evidence: §4.3.1)

3) This Paper’s Solution (Contributions)

  • SSD (Structured State-Space Duality): Proves SSM ≡ semiseparable matrix and generalizes masked kernel Attention into Structured Masked Attention (SMA), establishing their duality. ⇒ Shows SSM’s linear-time algorithms and kernel Attention’s quadratic algorithms are just different contraction orders of the same function. (Evidence: §3–§5, Fig.1)
  • Efficient Algorithm (Block Decomposition): Proposes a diagonal/low-rank block decomposition of semiseparable structures for hardware-friendly computation paths (e.g., GEMM-dominant). Provides a simple SSD implementation. (Evidence: §6)
  • Architecture/System Transfer: Formalizes SSM analogues of Transformer optimizations: multi-head analogy (SSM-Heads ≈ MHA heads), GVA/MES designs, TP-friendly blocks (half the syncs), Sequence Parallel, variable-length training without padding. (Evidence: §7–§8)
  • Empirical Results (Representative Numbers): Mamba-2 core layers achieve 2–8× speedup over Mamba under identical settings, while maintaining LM competitiveness. (Evidence: Abstract)

4) State of the Art at Time of Publication (per Authors)

AxisTransformer FamilySSM FamilySummary
Training Complexity(O$(T^2)$) (masked Attention) (Evidence: §1)(O$(T)$) (linear recurrence) (Evidence: §1)SSM better in scaling with length
Generation MemoryKV-cache (O$(T)$) (Evidence: §1)State size (O(1)) (Evidence: §1)SSM better for long-context inference
Performance LevelDominates SOTA (GPT/LLaMA etc.) (Evidence: §1)Competitive/equal in small–medium scale (Mamba) (Evidence: §1)Competitive depending on scale/domain
Theory/System StackRich (FlashAttention, TP/Seq-Parallel, etc.) (Evidence: §8)Fragmented, unstandardized (Evidence: §1)Lack of common language as bottleneck
Connecting WorkLinear Attention, kernel view (Evidence: §1, §4)Selective SSM, state expansion, etc. (Evidence: §10.1)Need for full duality formalization

Summary: As of May 2024, SOTA was Transformer-dominated, while SSMs showed structural advantages in long-sequence efficiency. But lack of theory/system standardization hindered scaling. This paper resolves it via SSD, proving speed & transferability (Mamba-2 2–8× faster). (Evidence: §1, Abstract, §8–§10)


  • SSM Lineage: Categorized by time-varying (Selective), dimension, and structural axes (diagonal/low-rank). Describes SSD as a Selective SSM (SISO, scalar-identity). (Evidence: §10.1)
  • Broader Model Spectrum: Covers GateLoop, GLA, HGRN, Griffin, RecurrentGemma, xLSTM, RWKV, unifying gating/state-expansion/local-attention trends under SSD’s framework. (Evidence: §10 Related Models)
  • Meta-Claim: “Every fast recurrent-form kernel Attention is a type of SSM,” effectively erasing boundaries. (Evidence: §5)

Reference Formula (Complexity)

$$ \text{Attention (Training)}=\Theta(T^2),\quad \text{SSM (Training)}=\Theta(T),\quad \text{KV-Cache (Generation)}=\Theta(T),\quad \text{SSM State Size}=\Theta(1) $$ (Evidence: §1)


This section is structured with compiler/serving perspective, naturally linking to later sections (Compute/Cost, Serving Metrics, SOTA tables), centered on evidence tags and quantitative indicators.


Prompt 1.1.2 (Central Hypothesis)

PLAINTEXT
What is the central hypothesis or core claim of this paper? State it clearly in one sentence in the form: “The authors hypothesize that by using [proposed method], they can overcome [limitation] and achieve [specific result].”
Click to expand and view more

The authors hypothesize that by using Structured State Space Duality (SSD) to unify and interconvert SSMs and masked Attention under the lens of semiseparable matrices, they can reduce efficiency bottlenecks in long-context scaling, and that their implementation, Mamba-2, can achieve 2–8× faster core layers while maintaining language modeling competitiveness. (Evidence: §Abstract) (arxiv.org )


Prompt 1.2.1 (Identifying Originality)

PLAINTEXT
Based on the entire paper, identify the 1–3 most important and original contributions. Clearly state whether each is a new architecture component, a new training method, a new theoretical insight, a new dataset, or a novel application of existing methods.
Click to expand and view more

Compressed conclusion: The originality lies in (1) SSD theory formalization of SSM↔Attention duality (linear/quadratic forms of the same function, with proofs), (2) hardware-efficient SSD algorithms (training O$(TN^2)$, inference O(TN), 2–8× faster than Mamba, surpassing FlashAttn-2 beyond 2k tokens, ~6× at 16k), and (3) Mamba-2 architecture/system design (parallel blocks, GVA heads, TP compatibility, trained at 2.7B / 300B tokens, outperforming peers). (Evidence: §3–§6, §7, §9; Fig.1/6/10)


1) SSD Theoretical Framework — New Theoretical Insight

  • Core: Formalizes SSM as semiseparable matrix, masked kernel Attention as 1-semiseparable SMA, proving duality: same function but linear (recurrent) vs. quadratic (kernel) contraction orders. Extends to claim: all fast recurrent kernel Attentions are SSMs. (Evidence: §5.1–§5.3, Fig.1)
  • Quantitative Significance: Explains complexity branching within one framework—SSM training (O$(TN^2)$)/inference (O(TN)), Attention training (O$(T^2N)$)/state size (O$(T)$). (Evidence: §6 summary)

2) Hardware-Efficient SSD Algorithm — New Efficient Algorithm

  • Core: Proposes diagonal/low-rank block decomposition of semiseparable matrices, mapping to matmul-dominant computation (TensorCore-friendly). Algorithm guarantees: training FLOPs O$(TN^2)$, inference FLOPs O(TN), inference memory O(N^2) (Theorem 6.1). (Evidence: §6)
  • Performance (Quantitative): 2–8× faster than Mamba fused scan, crossover with FlashAttn-2 at 2k tokens, ~6× faster at 16k tokens; scaling state size $(N)$ up to 8× with minimal slowdown. (Evidence: §9.3 Fig.10, §6)

3) Mamba-2 Architecture & System Transfer — New Architecture / Systematic Application of Existing Methods

  • Core Changes: Introduces parallel Mamba blocks (input-dependent projection moved to block start, plus normalization), and GVA (Grouped-Value Attention)-like heads, enabling Tensor Parallelism. (Evidence: §7.1, Fig.6)
  • Effects (Quantitative): Under Chinchilla scaling, Mamba-2 (2.7B Params) trained on Pile 300B tokens achieves Pareto advantage (perplexity–time) over Mamba-2.8B/Pythia-2.8B/6.9B and comparable/better performance than Transformer++ baselines. (Evidence: §7, §9.2)

Summary: (Theory) Unified duality of SSM vs. Attention via SSD → (Algorithm) matmul-dominant SSD achieving 2–8× speedup(Architecture/System) Mamba-2 with TP/head design improvements and scaling advantages. (Evidence: §5–§7, §9)


Prompt 1.2.2 (Strengths from Authors’ Perspective)

PLAINTEXT
From the authors’ perspective, why is their approach superior to previous methods? Cite or clearly explain their key arguments supporting originality and strengths.
Click to expand and view more

Compressed conclusion: The authors argue that SSD (Structured State-Space Duality) provides (i) theoretical unification by framing SSM↔Attention in one formalism, (ii) hardware-friendly algorithms with guarantees (training O$(TN^2)$, inference O(TN), memory O(N^2)), and (iii) empirical proof of 2–8× faster than Mamba and surpassing FlashAttn-2 beyond 2k tokens. (Evidence: §5–§6, §9.3/Fig.10)


Authors’ Key Arguments (Strengths)

  1. Theoretical Unification & Generality — “SSM↔Attention Duality”

    • 1-semiseparable SMA ≡ diagonal SSM special case (Cor. 5.1).
    • Efficient autoregressive Attention ⇔ semiseparable mask required (Thm. 5.2).
    • Frames linear/quadratic forms as contraction-order differences of the same function. (Evidence: §5.2–§5.3)
    • Provides common intersection (SSD) unifying SSMs and SMAs, enabling comparative analysis across sequence models. (Evidence: Fig.4)
  2. Hardware-Friendly SSD Algorithms — “Matmul-Dominant with Complexity Guarantees”

    • Theorem 6.1: Guarantees training FLOPs O$(TN^2)$, inference FLOPs O(TN), inference memory O(N^2), dominated by GEMM ops. (Evidence: §6)
    • Diagonal/low-rank block decomposition allows compromise paths—simpler than pure scan and efficient with just PyTorch implementations. (Evidence: §6)
  3. Empirical Performance — “2–8×, 2k crossover, ~6× at 16k”

    • On A100-80GB PCIe, SSD is 2–8× faster than Mamba fused scan, surpasses FlashAttn-2 beyond 2k tokens, and ~6× faster at 16k. (Evidence: §9.3/Fig.10)
    • Scaling state size N (64→256) barely affects SSD, while Mamba scan degrades linearly—showing stable scaling. (Evidence: Fig.10)
  4. Language Modeling & Pareto Efficiency

    • 2.7B params / 300B tokens (Pile): Mamba-2 vs. Transformer++ shows comparable perplexity/zero-shot, with Mamba-2-Attention hybrid outperforming both. (Evidence: §9.2/Table 3)
    • Claims Pareto dominance over Transformer baselines in perplexity–FLOPs–wallclock. (Evidence: §9)
  5. Challenging Memory Tasks (MQAR) — “Scaling N Improves Accuracy”

    • Mamba-2 ≫ Mamba-1/vanilla Attention across tasks, accuracy increases with N=16→64→256 (larger memory capacity effect). (Evidence: Fig.8, §9.1)
  6. System-Level Transferability — “TP/SP/Variable-Length”

    • TP-Friendly: Only one sync per block, GroupNorm to avoid extra comm. (Evidence: §8)
    • SP/Context Parallel: Distributes long-context training by passing only recurrent states. (Evidence: §8.2)
    • Variable-length training/inference without padding/realignment. (Evidence: §8)
  7. Compatibility with Attention Approximations

    • Absorbs kernel Attention approximations (Performer, etc.) and normalization into Mamba-2 hyperparameters, removing bottlenecks from softmax/position masks. (Evidence: §4.1.3, §7)

Summary (numbers-focused): On A100-80GB, 2–8× faster than Mamba-scan, overtakes FlashAttn-2 at ≥2k tokens, ~6× faster at 16k, while 2.7B/300B training shows Transformer++-level quality and Pareto superiority (ppl/FLOPs/wallclock). (Evidence: §9.3/Fig.10, §9.2)

Prompt 1.3.1 (Step-by-Step Algorithm Explanation)

PLAINTEXT
"Explain the core algorithm, model architecture, or main methodology step by step. Assume the reader is a graduate-level student in AI. Use very simple examples (toy examples) such as a 3×3 pixel image, a small state space, or short sequences. Walk through each step to show how input transforms into output. Define all key terms and variables immediately when they appear."
Click to expand and view more

Prompt 1.3.1 — Step-by-Step Algorithm Explanation (SSD ↔ Mamba-2)

Compressed Conclusion: SSD unifies SSMs and masked Attention as two algorithms computing the same function with different contraction orders (linear vs. quadratic). With block decomposition + low-rank structure, it achieves training O$(TN^2)$ FLOPs / inference O(TN) FLOPs / memory O(N^2) using a matmul-dominant path. This computation core is wrapped inside the Mamba-2 block (Conv → SSM → gating → normalization → output), which supports 1 all-reduce/ layer in TP and sequence parallelism as-is. (Evidence: §5 summary; Thm 6.1; Fig.7)


A. Core Objects and Notation (Definitions)

  • Sequence length $(T)$ (tokens), head/feature dimension $(P)$, state dimension $(N)$, block size $(Q)$. (Evidence: §6)
  • SSM parameters: $(A_t\in\mathbb{R}^{N\times N}), (B_t,C_t\in\mathbb{R}^N)$ (possibly time-dependent). (Evidence: Def.3.2)
  • SSS (Sequentially Semiseparable) representation: [ M_{ji} = C_j^\top A_j\cdots A_{i+1} B_i,\quad y = Mx ] —thus SSM ≡ semiseparable matrix representation. (Evidence: Eq.(3), Def.3.1/3.2)
  • SMA (Structured Masked Attention): contraction $(M = QK^\top \circ L)$; has both quadratic mode (standard Attention) and linear mode (kernel/scan). (Evidence: §4.3.1)
  • SSD (Structured State-Space Duality): establishes equivalence between scalar-identity SSM and 1-SS SMA, proving that SSM’s linear-time recurrence and SMA’s quadratic-time kernel contraction compute the same function (duality). (Evidence: §5.1–5.3)

Key: Semiseparable matrices guarantee that every sub-block below the diagonal has rank ≤ N. Off-diagonal blocks are low-rank → enabling fast algorithms. (Evidence: Def.3.1; Lemma/Prop.3.x)


B. Two Paths for the Same Function

B-1) Quadratic Path (Attention) — Standard SMA Contraction

  • Steps: $(G = QK^\top) \to (M = G \circ L) \to (Y = MV)$. (Evidence: Eq.(13))
  • Complexity: Training O$(T^2P)$, state size O$(T)$ (KV-cache). (Evidence: comparison table)

B-2) Linear Path (SSM) — Scan/Recurrence

  • Scalar SSM (1-SS): recurrence in linear time: [ h_t = a_t h_{t-1} + b_t x_t;\quad y_t = c_t h_t ] (Evidence: §3.2.2)
  • Equivalence: The naive quadratic matrix multiplication of 1-SS SSM is identical to masked kernel Attention. (Evidence: §5.1 Eq.(16))

C. SSD Block Algorithm (Core Steps)

Goal: Partition semiseparable matrix $(M)$ into $(Q \times Q)$ blocks; compute diagonal blocks with matmul, off-diagonal blocks with low-rank outer products + short scans. (Evidence: Thm 6.1)

  1. Block decomposition: Split $M \in \mathbb{R}^{T\times T}$ into $(T/Q)\times(T/Q)$ blocks.

  2. Diagonal blocks $(M_{bb})$:

    • Compute using local mask $(L)$ and Gram matrices → implemented with batched matmul (BMM).
  3. Off-diagonal blocks $(M_{b>b’})$: low-rank → 3-step pipeline:

    • Right: factor multiplication (BMM).
    • Center: short 1-SS scan (length $T/Q$).
    • Left: factor multiplication (BMM).
  4. Total cost (when N=P=Q): Training O$(TN^2)$ FLOPs, inference O(TN), dominated by matmul.

  5. Theorem 6.1: Guarantees training O$(TN^2)$, inference O(TN), inference memory O(N^2), matmul-dominant.

Comparison: Attention O$(T^2N)$ vs. SSD/SSM O$(TN^2)$ FLOPs; state size T vs. N.


D. Toy Example — 3×3 Image → 9 Tokens, 1-SS SSM vs. SMA

D-0) Setup

  • Input image (intensity) $(3\times3)$ → row-major flatten $(x_{1:9} = [1,\dots,9])$.
  • Scalar SSM: $a=0.5$, $b_t=c_t=1$, hidden state $h_t \in \mathbb{R}$.
  • SMA mask: $L_{j,i} = a^{j-i}$ if $j\ge i$, else 0.

D-1) Linear Path (SSM Recurrence) — One Token at a Time

[ h_t = 0.5 h_{t-1} + x_t,\quad y_t = h_t,\quad h_0 = 0 ]

  • $t=1$: $h_1 = 1 \to y_1=1.0$
  • $t=2$: $h_2=2.5 \to y_2=2.5$
  • $t=3$: $h_3=4.25 \to y_3=4.25$
  • … → $y_9=16.0039$

D-2) Quadratic Path (SMA Matrix Multiplication)

[ y_j = \sum_{i=1}^j a^{j-i} x_i ]

  • Example: $j=3$: $[a^2,a,1]\cdot[1,2,3] = 0.25+1+3=4.25$
  • Matches recurrence results exactly.

D-3) Block/Low-Rank Illustration (Q=3)

  • Example: $u=[1,0.5,0.25], v=[0.125,0.25,0.5]$.
  • Contribution = $u \cdot (v^\top x_{1:3}) = [2.125,1.0625,0.53125]$.
  • Shows off-diagonals handled by low-rank outer products, diagonals by local matmul → SSD block algorithm.

E. Mamba-2 Block — SSD Core in a Training/Serving-Friendly Layer

One head in Mamba-2:

  1. Projection: $x^{(i)}=uW^{(x)}_i$, $z^{(i)}=uW^{(z)}_i$.
  2. Local Conv1D: $x_c^{(i)}=\mathrm{conv1d}(x^{(i)})$.
  3. SSD/SSM transform: $y^{(i)}=\mathrm{SSM}(x_c^{(i)})$.
  4. Gating: $y_g^{(i)} = y^{(i)} \cdot \phi(z^{(i)})$.
  5. GroupNorm: TP-friendly normalization.
  6. Output projection + all-reduce once per layer.

Parallelism:

  • TP: input-parallel projection + GN → only 1 all-reduce/layer.
  • SP/CP: sequence split, only pass recurrent states across boundaries.

F. Summary (Complexity & Paths)

AlgorithmStateTraining FLOPsInference FLOPsMemory (naive)Main Ops
Attention (SMA)TT²NTNMatmul
Linear SSMNTN²TN²Scan + Matmul
SSD (Block)NTN²TNMatmul (+short scan)

Prompt 1.3.2 (Secret Ingredient)

PLAINTEXT
"Pick one key component. Show Δ(metric) when removed/replaced/rescaled, in a table, and explain the mechanism behind the change."
Click to expand and view more

Prompt 1.3.2 — Secret Ingredient: Mamba-2 Block’s “Parallel Projection + Extra Normalization (GroupNorm)”

Compressed conclusion: Shifting Δ,B,C projections directly from input (u) and inserting GroupNorm makes Mamba-2 achieve ppl 11.76→11.49 (−0.27) with −2.2% fewer params, while reducing TP communication per block from 2 all-reduces to 1. This removes the structural bottleneck present in Mamba-1. (Evidence: §9.4 Tab.4, Fig.7/§8.1)


Ablation Table — Δ(metric)

BlockProjectionsExtra NormParamsPerplexityΔ vs Baseline
Mamba-1Sequential129.3 M11.76
Mamba-1Sequential129.3 M11.54−0.22
Mamba-2 varParallel126.5 M11.66−0.10
Mamba-2Parallel126.5 M11.49−0.27

(Evidence: Tab.4)


Mechanism

  1. Projection move (Sequential→Parallel): avoids extra all-reduce.
  2. GroupNorm: TP-friendly normalization, stabilizes state scaling, improves convergence (ppl −0.22~−0.27).
  3. Practical effect: reduces TP communication cost; supports SP/CP with state passing.

One-line summary

Mamba-2’s parallel input-based projection + GroupNorm = better perplexity (−0.27), fewer params (−2.2%), 1 all-reduce instead of 2 → a true secret weapon.


Prompt 1.4.1 (Key Results Analysis)

Compressed conclusion: (i) Language modeling: Mamba-2 improves perplexity (6.66 vs 6.80) and matches Transformer++ at 2.7B/300B; hybrid (6 attention layers) further improves ppl 5.95. (ii) Efficiency: SSD core is 2–8× faster than Mamba scan, faster than FA-2 for T≥2k. (iii) Memory tasks (MQAR): Mamba-2 ≫ Mamba-1, scales with N. (iv) Scaling laws: Pareto dominance over Transformer++ in ppl–FLOPs–wallclock.

Benchmarks: Pile, zero-shot suite (LAMBADA, HellaSwag, PIQA, ARC-E/C, WinoGrande, OBQA), MQAR, efficiency on A100-80GB.


Prompt 1.4.2 (Critical Comparison)

Compressed conclusion:

  • Efficiency: SSD beats Mamba scan (2–8×) and FA-2 at ≥2k tokens.
  • Quality: Mamba-2 ≈ Transformer++ (2.7B/300B); hybrid (M2+Attn) surpasses both.
  • MQAR: strong gains over Mamba-1.
  • Weak spots: equal performance vs Transformer++ at 2.7B (needs hybrid), limited gains at short context (~2k).

Prompt 1.5.1 (Limitations)

  • Short context inefficiency (~2k): SSD-only worse than Transformer (needs SSD+MLP mix).
  • Pure Mamba-2 ceiling: hybrid with Attention improves.
  • Speed advantage depends on long context (≥2k).
  • Evaluation limited: ≤2.8B params, mostly Pile/zero-shot.

Potential: struggles with copy/retrieval tasks, workloads dominated by short contexts, generalization to math/code/multimodal not shown.


Prompt 1.5.2 (Future Work)

  • Generalize SMA masks to design new efficient attentions.
  • Quantify hybrid ratios (SSM vs. Attention).
  • Interpretability transfer (attention sinks in SSMs).
  • Trade-off maps (state N vs. cache).
  • Identify drivers of MQAR gains.
  • Explore larger models (≥7B), more diverse benchmarks.

Additional Question: Compute, Memory, Dependencies

Conclusion: SSD enables matmul-dominant kernels; on A100-80GB PCIe, achieves 2–8× faster kernel time vs. Mamba scan, surpasses FA-2 for T≥2k. Pretraining: 2.7B params, Pile 300B tokens, batch 1M tokens (~300k steps). Exact tokens/s, CUDA version not given.

Dependencies: PyTorch, EleutherAI LM harness, GPT-NeoX/GPT-2 tokenizer. CUDA/Triton versions not specified.

Memory: SSD O(TN) vs Attention O(T²). Example: SSD state ~10 MB vs. Transformer KV-cache ~1.34 GB (4k ctx).

Throughput: kernel-level speedups (2–8× vs Mamba, faster than FA-2 beyond 2k). Model-level tokens/s not reported.

Training cost: 300B tokens, batch 1M = ~300k steps. FLOPs formula O(TN²), PF-days not reported.

Copyright Notice

Author: Jaehun Ryu

Link: https://jaehun.me/en/posts/paper-review-transformers-are-ssms-generalized-models-and-efficient-algorithms-through-structured-state-space-duality/

License: CC BY 4.0

This work is licensed under the Creative Commons Attribution 4.0 International License. You are free to use it for any purpose, including commercial use, as long as you provide proper attribution.

Comments

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut