[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs

Paper Link

Marconi: Rethinking Prefix Caching for the Hybrid LLM Era

TL;DR

Marconi introduces a prefix-caching framework for hybrid LLM inference (Attention + SSM).
It selectively admits only highly-reusable prefixes and applies FLOP-per-byte–aware eviction, yielding up to 34.4× higher token-hit rate and 71.1 % lower P95 TTFT (−617 ms) compared with enhanced baselines (vLLM+, SGLang+).
(Evidence: Abstract, Fig. 7–9, § 5.2)


Core Idea


Background — What Problem Does It Solve?

Transformers suffer severe serving inefficiency for long contexts due to
prefill cost O(L²) and KV memory O(L).
Hybrid models that interleave State Space Models (SSMs)—typically with ratios around Attn:SSM ≈ 1 : 6–10—mitigate this,
but because SSM states are fixed-size and must exactly match,
traditional prefix caching schemes (mostly LRU and KV-centric) lose their effectiveness (Fig. 1, § 2.1–2.2).

Example of the problem: In a 7 B hybrid model, a single 10 K-token input already occupies 17.4 GB of state memory,
3.3× larger than a Transformer baseline, causing cache thrashing and low hit rates (§ 3).


A New Approach — Marconi

Marconi is the first prefix-caching system designed for hybrid LLMs, combining
(i) reuse-potential–based admission and (ii) FLOP-efficiency–based eviction.
All states (KV and SSM) are managed within a single radix tree, ensuring precise and consistent state reuse (§ 4, Fig. 4).

How It Works — A Step-by-Step Example

Let’s illustrate the mechanism using a simple example.
Assume we have two sentences: NYC is a busy city → followed by NYC is very huge.
(Tokens are segmented by spaces.) (Ref: Fig. 4)

  1. Request 1 — Prefill & Decode

    • The radix tree creates a path for NYC is a busy city.
      Only the SSM state at the final token of each edge is checkpointed. (§ 4 / Fig. 4)
  2. Request 2 — Speculative Insertion

    • The new sequence NYC is very huge is virtually inserted into the tree.
      If a branch point appears, that node’s SSM state is checkpointed (Fig. 4 (b)).
  3. Exact-Match Hit & Cache Reuse

    • Because hybrid models permit only exact-match reuse (due to in-place SSM updates),
      if the prefix up to the branch point matches, the node’s (SSM, KV) pair is reused,
      skipping the redundant prefill (§ 1, Fig. 2).
  4. Eviction Example

    • When the cache fills, compute each node’s utility (S(n)).
      Nodes with low recency but high FLOP efficiency (typically long sequences) are retained,
      while short, low-value entries are evicted (§ 4.2, Fig. 5).

Performance Verification — Key Results

Experimental Setup (Summary)
Hybrid 7 B model with layers {Attn = 4, SSM = 24, MLP = 28}, FP16 precision,
running on AWS p4d.24xlarge (A100-40 GB × 8).
For TTFT analysis, they also used Jamba-1.5-Mini (12 B active / 52 B total, state dim = 128, A100-40 GB × 4) (§ 5.1).

A. Headline Numbers


B. Summary Table — Equal Settings Comparison

BenchmarkMetricBaselineMarconiΔ vs Baseline
LMSysToken Hit RatevLLM+4.5×+4.5× (Fig. 7)
ShareGPTToken Hit RatevLLM+7.3×+7.3× (Fig. 7)
SWEBenchToken Hit RatevLLM+34.4×+34.4× (Fig. 7)
LMSysP95 TTFTvLLM+36.1 % ↓ (275.4 ms)−36.1 % (Fig. 9)
ShareGPTP95 TTFTvLLM+71.1 % ↓ (103.3 ms)−71.1 % (Fig. 9)
SWEBenchP95 TTFTvLLM+46.8 % ↓ (617.0 ms)−46.8 % (Fig. 9)
LMSysP95 TTFTSGLang+17.2 % ↓ (131.1 ms)−17.2 % (Fig. 9)
ShareGPTP95 TTFTSGLang+12.8 % ↓ (18.5 ms)−12.8 % (Fig. 9)
SWEBenchP95 TTFTSGLang+24.7 % ↓ (325.7 ms)−24.7 % (Fig. 9)

C. Detailed Observations


Our View — Strengths, Limitations, and Why It Matters

Strengths

  1. Accurate Problem Formulation — Marconi quantifies how the “exact-match restriction” and fixed-size SSM states break the assumptions behind LRU and size-based policies, formalizing it via Eq. (1),(2) and micro-benchmarks (§ 3–4, Fig. 5).

  2. Empirically Validated Policy Design — Combining Admission and Eviction policies boosts hit rate by up to 34.4× and reduces P95 TTFT by 617 ms on long-sequence workloads (Fig. 7–9).

  3. Trend Alignment — The higher the SSM ratio and state dimension, the larger the gain — well aligned with the direction of next-gen hybrid architectures (Fig. 12).


Limitations (Recognized and Observed)


Why It Matters

Hybrid architectures are a practical path to supporting long contexts (Attention + SSM).
At its core, the problem is “which prefixes should be kept?
By introducing FLOP-per-byte as a first-principles metric, Marconi normalizes cache decisions in terms of the model’s physical workload properties.
This provides a bridge to future stack-level optimizations such as KV paging and prefix-driven attention acceleration (e.g., Hydragen) (§ 4–6).

What Comes Next? — The Road Ahead


Appendix — Experimental Notes (for Reproducibility)

In one sentence:
By redefining the twin axes of Admission (reuse probability) and Eviction (FLOPs per byte),
Marconi consistently achieves higher hit-rates (×) and lower TTFT ↓
in realistic hybrid serving environments dominated by long-context SSM workloads (Fig. 7–9, § 5.4).


Click to Expand — Detailed LLM Q&A Analysis

▶️ Click to View Detailed Q&A Analysis

Prompt 1.1.1 — Research Gap

Summary:
Due to the in-place updates of SSM states, conventional prefix caches can only reuse exact matches,
causing cache bloat and low reuse rates (§ 1).
Fine-grained checkpointing further fragments the cache—reuse drops to 3.3 % (block = 128) and occupies 17.4 GB per 10 K tokens in 7 B hybrids (§ 3).
Marconi introduces the first hybrid-aware caching system combining
reuse-aware admission and FLOP-efficient eviction,
achieving up to 34.4× token hit rate and 71.1 % TTFT reduction (617 ms) (Abstract, § 5).


1) State of the Art at Publication

Transformers suffer from O(L²) prefill costs and O(L) KV memory;
SSM layers offer O(L) compute and O(1) memory (§ 1, Fig. 1 c).
Hybrid architectures (Attn:SSM ≈ 1:6–1:10) emerged as an efficiency trade-off (Fig. 1 a).
However, existing prefix-sharing optimizations (e.g., Hydragen) target Transformers only,
and LRU-based caches (vLLM, SGLang) assume constant FLOP efficiency per KV state.
These fail for hybrids where SSM updates are in-place and states are massive (§ 1, § 3).


2) Defined Research Gaps

GapDetailsWhy It’s Hard
Partial-reuse absenceSSM states update in place → only exact hits allowedTransformer-centric cache assumptions break down (§ 1)
Over-checkpointingHuge entry explosion → 3.3 % reuse / 17.4 GB per 10 K tokensSSM dominates memory over KV (§ 3)
Policy criteria gapLRU ignores FLOP/byte → short sequences kept, long ones evictedSSM and KV have different FLOP/memory ratios (§ 4.2)
Heterogeneous state managementKV and SSM tracked separately → prefix consistency breaksAll layers must share identical prefix tokens (§ 4, Fig. 4)

3) Core Idea to Bridge the Gap

  • Admission: Estimate reuse probability via two prefix types — Purely-Input vs Input+Output (§ 4.1).
  • Eviction: Score each entry by utility (S = \text{recency} + α · \text{FLOP efficiency}) (§ 4.2).
  • Unified Radix Tree: Store KV and SSM jointly and checkpoint only branch points (Fig. 4).

4) Empirical Validation

  • Token hit rate: vLLM+ baseline → 4.5× / 7.3× / 34.4× gain (Fig. 7).
  • P95 TTFT: up to −71.1 % (617 ms) vs vLLM+, −24.7 % (325 ms) vs SGLang+ (Fig. 9).
  • LRU → FLOP-aware: +19 – 219 % hit rate, +90 % FLOP savings (Fig. 8, 10).

Hydragen accelerates prefix reuse in Transformers via matrix-matrix decomposition but ignores SSM’s in-place states.
Traditional LRU caches assume constant FLOP/byte, making them inefficient for hybrids (§ 4.2).
Marconi fills this gap by jointly optimizing admission and eviction across KV + SSM.


TL;DR of Gap Analysis

Hybrid LLMs’ in-place SSM updates invalidate partial-reuse and LRU assumptions.
Marconi’s reuse-aware admission and FLOP-based eviction restore efficiency, achieving up to 34.4× hit-rate and 71 % TTFT reduction (§ 1, 4, 5).


Prompt 1.1.2 — Central Hypothesis

By combining reuse-based admission, FLOP/byte-efficient eviction, and unified KV + SSM management, Marconi overcomes the in-place SSM update limitation that prevents partial prefix reuse — achieving up to 34.4× higher hit rate and 71.1 % (617 ms) lower TTFT (Abstract).


Prompt 1.5.2 — Future Research Trajectory

The authors suggest several next steps (§ 6, § 7, Fig. 12–13):

  1. Generalization to All Recurrent Hybrids — Extend Marconi beyond Mamba-style SSMs to any recurrent layer with large, cyclic states.
  2. Integration with KV Paging — Combine non-paged SSM states with paged attention memory using a shared FLOP-aware utility score.
  3. Specialized Chunked-Prefill Kernels — Develop custom kernels to avoid two-pass prefill overheads.
  4. Load-Sensitive Policy Scheduling — Adapt admission/eviction weights α(t) online to workload intensity.
  5. Synergy with Hydragen — Combine Marconi’s “what to keep” policy with Hydragen’s “how to reuse efficiently,” which maintains < 15 % throughput loss for 16 K prefixes.

Reviewer-Proposed Extensions

Observed LimitationFuture DirectionExpected Metric Impact
Heterogeneous paging (SSM vs KV)Unify Marconi × PagedAttention using global FLOP-aware utilityTTFT↓, Hit Rate↑, Page Faults↓
Immature chunked prefill kernelsDevelop dedicated 1-pass prefill kernelsPrefill Latency↓, Memory Δ≈0
Load contention sensitivityScheduler-linked adaptive α(t)SLA P95↓, Recompute FLOPs↓
Unaccelerated Attention costsCombine with Hydragen (prefix reuse acceleration)Throughput↑, $/1 M tok ↓
Single-GPU scopeCluster-level routing via “longest-prefix-first”Cross-GPU Hit ↑, Network Traffic ↓
Heuristic policyLearning-based admission/eviction (bandit style)Hit ↑, Over-admit ↓

Key Quotes

Chunked prefill … requires specialized kernels under development” (§ 6).
We only evaluate Mamba/SSMs … but it can be extended” (§ 6).


In Short

The next research frontier lies in co-evolving three axes:
(1) Policy — what to retain, (2) Kernel / Paging — how to store, and (3) Attention Acceleration — how to reuse fast.
Together, these can push hybrid LLM serving toward truly compute-normalized efficiency (§ 4–6; Hydragen § 3–4).

Copyright Notice

Author: Jaehun Ryu

Link: https://jaehun.me/en/posts/paper-review-marconi-prefix-caching-for-the-era-of-hybrid-llms/

License: CC BY 4.0

This work is licensed under the Creative Commons Attribution 4.0 International License. You are free to use it for any purpose, including commercial use, as long as you provide proper attribution.

Comments

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut