Marconi: Rethinking Prefix Caching for the Hybrid LLM Era

TL;DR

Marconi introduces a prefix-caching framework for hybrid LLM inference (Attention + SSM).
It selectively admits only highly-reusable prefixes and applies FLOP-per-byte–aware eviction, yielding up to 34.4× higher token-hit rate and 71.1 % lower P95 TTFT (−617 ms) compared with enhanced baselines (vLLM+, SGLang+).
(Evidence: Abstract, Fig. 7–9, § 5.2)

Core Idea

Underlying Problem — “Exact-Match Only” Limitation:
In hybrid LLMs, the SSM states are updated in place, so partial prefix reuse is impossible.
Only exact matches can hit the cache, causing state explosion and poor reuse rates (§ 1).
Solution Strategy:
1. Admission Policy: Classify prefix reuse scenarios into
  (a) Purely Input and (b) Input + Output,
  checkpointing only prefixes with high reuse potential (§ 4.1).
2. Eviction Policy: Score each entry by FLOP efficiency (saved FLOPs / state memory),
  then combine with recency to form S = recency + α · FLOP efficiency,
  determining which entries to evict (§ 4.2, Eq. 1–2).

Background — What Problem Does It Solve?

Transformers suffer severe serving inefficiency for long contexts due to
prefill cost O(L²) and KV memory O(L).
Hybrid models that interleave State Space Models (SSMs)—typically with ratios around Attn:SSM ≈ 1 : 6–10—mitigate this,
but because SSM states are fixed-size and must exactly match,
traditional prefix caching schemes (mostly LRU and KV-centric) lose their effectiveness (Fig. 1, § 2.1–2.2).

Example of the problem: In a 7 B hybrid model, a single 10 K-token input already occupies 17.4 GB of state memory,
3.3× larger than a Transformer baseline, causing cache thrashing and low hit rates (§ 3).

A New Approach — Marconi

Marconi is the first prefix-caching system designed for hybrid LLMs, combining
(i) reuse-potential–based admission and (ii) FLOP-efficiency–based eviction.
All states (KV and SSM) are managed within a single radix tree, ensuring precise and consistent state reuse (§ 4, Fig. 4).

Admission Policy
- Classification: Distinguish between
  Purely-Input prefixes (e.g., system prompts or examples) and
  Input + Output prefixes (e.g., dialog histories) (§ 4.1).
- Rule: Checkpoint only the SSM state at the final token of each edge—the minimal coverage for maximal reuse—and, for interactive sessions, preserve the state after the last decoded token (Fig. 4, § 4.1).
Eviction Policy
- FLOP Efficiency Definition: [ \text{FLOP-eff} = \frac{\text{Saved FLOPs (across all layers)}}% {\text{State Memory (SSM + KV)}} ] (Eq. 1)
- Utility Score: [ S(n) = \text{recency}(n) + \alpha · \text{FLOP-eff}(n) ] When α = 0, Marconi reduces to LRU.
  The authors replay workload snapshots and use grid search to select α automatically (Eq. 2, § 4.2).

How It Works — A Step-by-Step Example

Let’s illustrate the mechanism using a simple example.
Assume we have two sentences: NYC is a busy city → followed by NYC is very huge.
(Tokens are segmented by spaces.) (Ref: Fig. 4)

Request 1 — Prefill & Decode
- The radix tree creates a path for NYC is a busy city.
  Only the SSM state at the final token of each edge is checkpointed. (§ 4 / Fig. 4)
Request 2 — Speculative Insertion
- The new sequence NYC is very huge is virtually inserted into the tree.
  If a branch point appears, that node’s SSM state is checkpointed (Fig. 4 (b)).
Exact-Match Hit & Cache Reuse
- Because hybrid models permit only exact-match reuse (due to in-place SSM updates),
  if the prefix up to the branch point matches, the node’s (SSM, KV) pair is reused,
  skipping the redundant prefill (§ 1, Fig. 2).
Eviction Example
- When the cache fills, compute each node’s utility (S(n)).
  Nodes with low recency but high FLOP efficiency (typically long sequences) are retained,
  while short, low-value entries are evicted (§ 4.2, Fig. 5).

Performance Verification — Key Results

Experimental Setup (Summary)
Hybrid 7 B model with layers {Attn = 4, SSM = 24, MLP = 28}, FP16 precision,
running on AWS p4d.24xlarge (A100-40 GB × 8).
For TTFT analysis, they also used Jamba-1.5-Mini (12 B active / 52 B total, state dim = 128, A100-40 GB × 4) (§ 5.1).

A. Headline Numbers

Token Hit Rate: Compared with vLLM+, Marconi achieves 4.5× / 7.3× / 34.4× higher rates on LMSys / ShareGPT / SWEBench (Fig. 7).
P95 TTFT Reduction (vs vLLM+): 36.1 % / 71.1 % / 46.8 %, corresponding to −275.4 ms / −103.3 ms / −617.0 ms (Fig. 9).
P95 Hit Gain (vs SGLang+): +45.6 % / +19.0 % / +219.7 % (Fig. 8).

B. Summary Table — Equal Settings Comparison

Benchmark	Metric	Baseline	Marconi	Δ vs Baseline
LMSys	Token Hit Rate	vLLM+	4.5×	+4.5× (Fig. 7)
ShareGPT	Token Hit Rate	vLLM+	7.3×	+7.3× (Fig. 7)
SWEBench	Token Hit Rate	vLLM+	34.4×	+34.4× (Fig. 7)
LMSys	P95 TTFT	vLLM+	36.1 % ↓ (275.4 ms)	−36.1 % (Fig. 9)
ShareGPT	P95 TTFT	vLLM+	71.1 % ↓ (103.3 ms)	−71.1 % (Fig. 9)
SWEBench	P95 TTFT	vLLM+	46.8 % ↓ (617.0 ms)	−46.8 % (Fig. 9)
LMSys	P95 TTFT	SGLang+	17.2 % ↓ (131.1 ms)	−17.2 % (Fig. 9)
ShareGPT	P95 TTFT	SGLang+	12.8 % ↓ (18.5 ms)	−12.8 % (Fig. 9)
SWEBench	P95 TTFT	SGLang+	24.7 % ↓ (325.7 ms)	−24.7 % (Fig. 9)

C. Detailed Observations

Length-Dependent Gains:
For short requests (< 7 K tokens), the hit rate drops slightly (up to −3.0 pp) while P5 TTFT rises only +2.1 ms.
For long requests (≥ 7 K), the hit rate improves by +25.5 pp and FLOP savings reach +90.3 % (Fig. 10).
Cache Contention (60 → 140 GB):
Marconi’s advantage peaks under moderate contention, with improvements of 24.3 / 51.5 / 68.3 / 30.0 / 10.0 % across settings (Fig. 11).
Architecture Scaling:
As the Attn:SSM ratio grows from 1:2 to 1:8, Marconi’s gain over vLLM+ / SGLang+ increases from 13.5 % / 5.8 % to 2.6× / 59.7 %.
In pure Transformer models, all systems perform identically (Fig. 12a).

Our View — Strengths, Limitations, and Why It Matters

Strengths

Accurate Problem Formulation — Marconi quantifies how the “exact-match restriction” and fixed-size SSM states break the assumptions behind LRU and size-based policies, formalizing it via Eq. (1),(2) and micro-benchmarks (§ 3–4, Fig. 5).
Empirically Validated Policy Design — Combining Admission and Eviction policies boosts hit rate by up to 34.4× and reduces P95 TTFT by 617 ms on long-sequence workloads (Fig. 7–9).
Trend Alignment — The higher the SSM ratio and state dimension, the larger the gain — well aligned with the direction of next-gen hybrid architectures (Fig. 12).

Limitations (Recognized and Observed)

Short Requests (< 7 K): Hit rate drops by up to −3 pp and P5 TTFT rises by +2.1 ms, though P50/P95 still improve (Fig. 10).
Contention Extremes: At high (60 GB) or low (140 GB) cache sizes, Marconi’s advantage shrinks since either useful prefixes cannot fit or eviction decisions rarely matter (Fig. 11).
Pure Transformers: All methods perform identically (Fig. 12a).
System Dependency: Chunked prefill for hybrid models still requires specialized kernels under development in most frameworks, and KV attention still depends on paging management (§ 6).

Why It Matters

Hybrid architectures are a practical path to supporting long contexts (Attention + SSM).
At its core, the problem is “which prefixes should be kept?”
By introducing FLOP-per-byte as a first-principles metric, Marconi normalizes cache decisions in terms of the model’s physical workload properties.
This provides a bridge to future stack-level optimizations such as KV paging and prefix-driven attention acceleration (e.g., Hydragen) (§ 4–6).

What Comes Next? — The Road Ahead

KV Paging × FLOP-aware Integration
Currently, SSM states are non-paged, while KV attention still requires paging.
By extending Marconi’s FLOP-efficiency scoring to page replacement,
both can share a unified global utility function for consistent optimization (§ 6).
Dissemination of Chunked-Prefill Kernels
To eliminate the overhead of 2-pass or approximate checkpointing,
the authors call for widespread adoption of specialized kernels in mainstream frameworks (§ 6).
Load-Adaptive Policy Tuning
Because Marconi’s benefit peaks under medium contention,
an online adaptation of α(t) based on session arrival rate could maximize hit-rate efficiency (Fig. 11 & 13).
Synergy with Attention Accelerators
Prefix-sharing systems like Hydragen already sustain throughput losses below 15 %
even for 16 K-token prefixes—suggesting a natural synergy:
Marconi decides what to keep; Hydragen decides how to use it fast (Hydragen Abstract & § 4).

Appendix — Experimental Notes (for Reproducibility)

Datasets / Traces: LMSys, ShareGPT, SWEBench — each comprising multi-turn sessions with variable request intervals to mimic real-world latency (§ 5.1).
Models / Hardware: 7 B Hybrid {Attn 4 | SSM 24 | MLP 28}, Jamba-1.5-Mini (12 B active / 52 B total, state dim = 128), FP16 on A100-40 GB (§ 5.1).
Baselines: vLLM and SGLang lack native hybrid support,
so the authors built hybrid-compatible versions (vLLM+, SGLang+) for fair comparison (§ 5.1).

In one sentence:
By redefining the twin axes of Admission (reuse probability) and Eviction (FLOPs per byte),
Marconi consistently achieves higher hit-rates (×) and lower TTFT ↓
in realistic hybrid serving environments dominated by long-context SSM workloads (Fig. 7–9, § 5.4).

Click to Expand — Detailed LLM Q&A Analysis

▶️ Click to View Detailed Q&A Analysis

Prompt 1.1.1 — Research Gap

Summary:
Due to the in-place updates of SSM states, conventional prefix caches can only reuse exact matches,
causing cache bloat and low reuse rates (§ 1).
Fine-grained checkpointing further fragments the cache—reuse drops to 3.3 % (block = 128) and occupies 17.4 GB per 10 K tokens in 7 B hybrids (§ 3).
Marconi introduces the first hybrid-aware caching system combining
reuse-aware admission and FLOP-efficient eviction,
achieving up to 34.4× token hit rate and 71.1 % TTFT reduction (617 ms) (Abstract, § 5).

1) State of the Art at Publication

Transformers suffer from O(L²) prefill costs and O(L) KV memory;
SSM layers offer O(L) compute and O(1) memory (§ 1, Fig. 1 c).
Hybrid architectures (Attn:SSM ≈ 1:6–1:10) emerged as an efficiency trade-off (Fig. 1 a).
However, existing prefix-sharing optimizations (e.g., Hydragen) target Transformers only,
and LRU-based caches (vLLM, SGLang) assume constant FLOP efficiency per KV state.
These fail for hybrids where SSM updates are in-place and states are massive (§ 1, § 3).

2) Defined Research Gaps

Gap	Details	Why It’s Hard
Partial-reuse absence	SSM states update in place → only exact hits allowed	Transformer-centric cache assumptions break down (§ 1)
Over-checkpointing	Huge entry explosion → 3.3 % reuse / 17.4 GB per 10 K tokens	SSM dominates memory over KV (§ 3)
Policy criteria gap	LRU ignores FLOP/byte → short sequences kept, long ones evicted	SSM and KV have different FLOP/memory ratios (§ 4.2)
Heterogeneous state management	KV and SSM tracked separately → prefix consistency breaks	All layers must share identical prefix tokens (§ 4, Fig. 4)

3) Core Idea to Bridge the Gap

Admission: Estimate reuse probability via two prefix types — Purely-Input vs Input+Output (§ 4.1).
Eviction: Score each entry by utility (S = \text{recency} + α · \text{FLOP efficiency}) (§ 4.2).
Unified Radix Tree: Store KV and SSM jointly and checkpoint only branch points (Fig. 4).

4) Empirical Validation

Token hit rate: vLLM+ baseline → 4.5× / 7.3× / 34.4× gain (Fig. 7).
P95 TTFT: up to −71.1 % (617 ms) vs vLLM+, −24.7 % (325 ms) vs SGLang+ (Fig. 9).
LRU → FLOP-aware: +19 – 219 % hit rate, +90 % FLOP savings (Fig. 8, 10).

Hydragen accelerates prefix reuse in Transformers via matrix-matrix decomposition but ignores SSM’s in-place states.
Traditional LRU caches assume constant FLOP/byte, making them inefficient for hybrids (§ 4.2).
Marconi fills this gap by jointly optimizing admission and eviction across KV + SSM.

TL;DR of Gap Analysis

Hybrid LLMs’ in-place SSM updates invalidate partial-reuse and LRU assumptions.
Marconi’s reuse-aware admission and FLOP-based eviction restore efficiency, achieving up to 34.4× hit-rate and 71 % TTFT reduction (§ 1, 4, 5).

Prompt 1.1.2 — Central Hypothesis

By combining reuse-based admission, FLOP/byte-efficient eviction, and unified KV + SSM management, Marconi overcomes the in-place SSM update limitation that prevents partial prefix reuse — achieving up to 34.4× higher hit rate and 71.1 % (617 ms) lower TTFT (Abstract).

Prompt 1.5.2 — Future Research Trajectory

The authors suggest several next steps (§ 6, § 7, Fig. 12–13):

Generalization to All Recurrent Hybrids — Extend Marconi beyond Mamba-style SSMs to any recurrent layer with large, cyclic states.
Integration with KV Paging — Combine non-paged SSM states with paged attention memory using a shared FLOP-aware utility score.
Specialized Chunked-Prefill Kernels — Develop custom kernels to avoid two-pass prefill overheads.
Load-Sensitive Policy Scheduling — Adapt admission/eviction weights α(t) online to workload intensity.
Synergy with Hydragen — Combine Marconi’s “what to keep” policy with Hydragen’s “how to reuse efficiently,” which maintains < 15 % throughput loss for 16 K prefixes.

Reviewer-Proposed Extensions

Observed Limitation	Future Direction	Expected Metric Impact
Heterogeneous paging (SSM vs KV)	Unify Marconi × PagedAttention using global FLOP-aware utility	TTFT↓, Hit Rate↑, Page Faults↓
Immature chunked prefill kernels	Develop dedicated 1-pass prefill kernels	Prefill Latency↓, Memory Δ≈0
Load contention sensitivity	Scheduler-linked adaptive α(t)	SLA P95↓, Recompute FLOPs↓
Unaccelerated Attention costs	Combine with Hydragen (prefix reuse acceleration)	Throughput↑, $/1 M tok ↓
Single-GPU scope	Cluster-level routing via “longest-prefix-first”	Cross-GPU Hit ↑, Network Traffic ↓
Heuristic policy	Learning-based admission/eviction (bandit style)	Hit ↑, Over-admit ↓

Key Quotes

“Chunked prefill … requires specialized kernels under development” (§ 6).
“We only evaluate Mamba/SSMs … but it can be extended” (§ 6).

In Short

The next research frontier lies in co-evolving three axes:
(1) Policy — what to retain, (2) Kernel / Paging — how to store, and (3) Attention Acceleration — how to reuse fast.
Together, these can push hybrid LLM serving toward truly compute-normalized efficiency (§ 4–6; Hydragen § 3–4).

[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs

Marconi: Rethinking Prefix Caching for the Hybrid LLM Era

TL;DR

Core Idea

Background — What Problem Does It Solve?

A New Approach — Marconi

How It Works — A Step-by-Step Example

Performance Verification — Key Results

A. Headline Numbers

B. Summary Table — Equal Settings Comparison

C. Detailed Observations

Our View — Strengths, Limitations, and Why It Matters

Strengths

Limitations (Recognized and Observed)

Why It Matters

What Comes Next? — The Road Ahead

Appendix — Experimental Notes (for Reproducibility)

Click to Expand — Detailed LLM Q&A Analysis

Prompt 1.1.1 — Research Gap

1) State of the Art at Publication

2) Defined Research Gaps

3) Core Idea to Bridge the Gap

4) Empirical Validation

TL;DR of Gap Analysis

Prompt 1.1.2 — Central Hypothesis

Prompt 1.5.2 — Future Research Trajectory

Reviewer-Proposed Extensions

Key Quotes

In Short

Copyright Notice

Comments

Table of Contents

Marconi: Rethinking Prefix Caching for the Hybrid LLM Era

TL;DR

Core Idea

Background — What Problem Does It Solve?

A New Approach — Marconi

How It Works — A Step-by-Step Example

Performance Verification — Key Results

A. Headline Numbers

B. Summary Table — Equal Settings Comparison

C. Detailed Observations

Our View — Strengths, Limitations, and Why It Matters

Strengths

Limitations (Recognized and Observed)

Why It Matters

What Comes Next? — The Road Ahead

Appendix — Experimental Notes (for Reproducibility)

Click to Expand — Detailed LLM Q&A Analysis

Prompt 1.1.1 — Research Gap

1) State of the Art at Publication

2) Defined Research Gaps

3) Core Idea to Bridge the Gap

4) Empirical Validation

5) Related Work Summary

TL;DR of Gap Analysis

Prompt 1.1.2 — Central Hypothesis

Prompt 1.5.2 — Future Research Trajectory

Reviewer-Proposed Extensions

Key Quotes

In Short

Copyright Notice

Related Posts

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Comments

Start searching

No results found