Marconi: Rethinking Prefix Caching for the Hybrid LLM Era
TL;DR
Marconi introduces a prefix-caching framework for hybrid LLM inference (Attention + SSM).
It selectively admits only highly-reusable prefixes and applies FLOP-per-byte–aware eviction, yielding up to 34.4× higher token-hit rate and 71.1 % lower P95 TTFT (−617 ms) compared with enhanced baselines (vLLM+, SGLang+).
(Evidence: Abstract, Fig. 7–9, § 5.2)
Core Idea
Underlying Problem — “Exact-Match Only” Limitation:
In hybrid LLMs, the SSM states are updated in place, so partial prefix reuse is impossible.
Only exact matches can hit the cache, causing state explosion and poor reuse rates (§ 1).Solution Strategy:
- Admission Policy: Classify prefix reuse scenarios into
(a) Purely Input and (b) Input + Output,
checkpointing only prefixes with high reuse potential (§ 4.1). - Eviction Policy: Score each entry by FLOP efficiency (saved FLOPs / state memory),
then combine with recency to form S = recency + α · FLOP efficiency,
determining which entries to evict (§ 4.2, Eq. 1–2).
- Admission Policy: Classify prefix reuse scenarios into
Background — What Problem Does It Solve?
Transformers suffer severe serving inefficiency for long contexts due to
prefill cost O(L²) and KV memory O(L).
Hybrid models that interleave State Space Models (SSMs)—typically with ratios around Attn:SSM ≈ 1 : 6–10—mitigate this,
but because SSM states are fixed-size and must exactly match,
traditional prefix caching schemes (mostly LRU and KV-centric) lose their effectiveness (Fig. 1, § 2.1–2.2).
Example of the problem: In a 7 B hybrid model, a single 10 K-token input already occupies 17.4 GB of state memory,
3.3× larger than a Transformer baseline, causing cache thrashing and low hit rates (§ 3).
A New Approach — Marconi
Marconi is the first prefix-caching system designed for hybrid LLMs, combining
(i) reuse-potential–based admission and (ii) FLOP-efficiency–based eviction.
All states (KV and SSM) are managed within a single radix tree, ensuring precise and consistent state reuse (§ 4, Fig. 4).
Admission Policy
- Classification: Distinguish between
Purely-Input prefixes (e.g., system prompts or examples) and
Input + Output prefixes (e.g., dialog histories) (§ 4.1). - Rule: Checkpoint only the SSM state at the final token of each edge—the minimal coverage for maximal reuse—and, for interactive sessions, preserve the state after the last decoded token (Fig. 4, § 4.1).
- Classification: Distinguish between
Eviction Policy
- FLOP Efficiency Definition: [ \text{FLOP-eff} = \frac{\text{Saved FLOPs (across all layers)}}% {\text{State Memory (SSM + KV)}} ] (Eq. 1)
- Utility Score:
[
S(n) = \text{recency}(n) + \alpha · \text{FLOP-eff}(n)
]
When α = 0, Marconi reduces to LRU.
The authors replay workload snapshots and use grid search to select α automatically (Eq. 2, § 4.2).
How It Works — A Step-by-Step Example
Let’s illustrate the mechanism using a simple example.
Assume we have two sentences:
NYC is a busy city → followed by NYC is very huge.
(Tokens are segmented by spaces.) (Ref: Fig. 4)
Request 1 — Prefill & Decode
- The radix tree creates a path for
NYC is a busy city.
Only the SSM state at the final token of each edge is checkpointed. (§ 4 / Fig. 4)
- The radix tree creates a path for
Request 2 — Speculative Insertion
- The new sequence
NYC is very hugeis virtually inserted into the tree.
If a branch point appears, that node’s SSM state is checkpointed (Fig. 4 (b)).
- The new sequence
Exact-Match Hit & Cache Reuse
- Because hybrid models permit only exact-match reuse (due to in-place SSM updates),
if the prefix up to the branch point matches, the node’s (SSM, KV) pair is reused,
skipping the redundant prefill (§ 1, Fig. 2).
- Because hybrid models permit only exact-match reuse (due to in-place SSM updates),
Eviction Example
- When the cache fills, compute each node’s utility (S(n)).
Nodes with low recency but high FLOP efficiency (typically long sequences) are retained,
while short, low-value entries are evicted (§ 4.2, Fig. 5).
- When the cache fills, compute each node’s utility (S(n)).
Performance Verification — Key Results
Experimental Setup (Summary)
Hybrid 7 B model with layers {Attn = 4, SSM = 24, MLP = 28}, FP16 precision,
running on AWS p4d.24xlarge (A100-40 GB × 8).
For TTFT analysis, they also used Jamba-1.5-Mini (12 B active / 52 B total, state dim = 128, A100-40 GB × 4) (§ 5.1).
A. Headline Numbers
- Token Hit Rate: Compared with vLLM+, Marconi achieves 4.5× / 7.3× / 34.4× higher rates on LMSys / ShareGPT / SWEBench (Fig. 7).
- P95 TTFT Reduction (vs vLLM+): 36.1 % / 71.1 % / 46.8 %, corresponding to −275.4 ms / −103.3 ms / −617.0 ms (Fig. 9).
- P95 Hit Gain (vs SGLang+): +45.6 % / +19.0 % / +219.7 % (Fig. 8).
B. Summary Table — Equal Settings Comparison
| Benchmark | Metric | Baseline | Marconi | Δ vs Baseline |
|---|---|---|---|---|
| LMSys | Token Hit Rate | vLLM+ | 4.5× | +4.5× (Fig. 7) |
| ShareGPT | Token Hit Rate | vLLM+ | 7.3× | +7.3× (Fig. 7) |
| SWEBench | Token Hit Rate | vLLM+ | 34.4× | +34.4× (Fig. 7) |
| LMSys | P95 TTFT | vLLM+ | 36.1 % ↓ (275.4 ms) | −36.1 % (Fig. 9) |
| ShareGPT | P95 TTFT | vLLM+ | 71.1 % ↓ (103.3 ms) | −71.1 % (Fig. 9) |
| SWEBench | P95 TTFT | vLLM+ | 46.8 % ↓ (617.0 ms) | −46.8 % (Fig. 9) |
| LMSys | P95 TTFT | SGLang+ | 17.2 % ↓ (131.1 ms) | −17.2 % (Fig. 9) |
| ShareGPT | P95 TTFT | SGLang+ | 12.8 % ↓ (18.5 ms) | −12.8 % (Fig. 9) |
| SWEBench | P95 TTFT | SGLang+ | 24.7 % ↓ (325.7 ms) | −24.7 % (Fig. 9) |
C. Detailed Observations
Length-Dependent Gains:
For short requests (< 7 K tokens), the hit rate drops slightly (up to −3.0 pp) while P5 TTFT rises only +2.1 ms.
For long requests (≥ 7 K), the hit rate improves by +25.5 pp and FLOP savings reach +90.3 % (Fig. 10).Cache Contention (60 → 140 GB):
Marconi’s advantage peaks under moderate contention, with improvements of 24.3 / 51.5 / 68.3 / 30.0 / 10.0 % across settings (Fig. 11).Architecture Scaling:
As the Attn:SSM ratio grows from 1:2 to 1:8, Marconi’s gain over vLLM+ / SGLang+ increases from 13.5 % / 5.8 % to 2.6× / 59.7 %.
In pure Transformer models, all systems perform identically (Fig. 12a).
Our View — Strengths, Limitations, and Why It Matters
Strengths
Accurate Problem Formulation — Marconi quantifies how the “exact-match restriction” and fixed-size SSM states break the assumptions behind LRU and size-based policies, formalizing it via Eq. (1),(2) and micro-benchmarks (§ 3–4, Fig. 5).
Empirically Validated Policy Design — Combining Admission and Eviction policies boosts hit rate by up to 34.4× and reduces P95 TTFT by 617 ms on long-sequence workloads (Fig. 7–9).
Trend Alignment — The higher the SSM ratio and state dimension, the larger the gain — well aligned with the direction of next-gen hybrid architectures (Fig. 12).
Limitations (Recognized and Observed)
- Short Requests (< 7 K): Hit rate drops by up to −3 pp and P5 TTFT rises by +2.1 ms, though P50/P95 still improve (Fig. 10).
- Contention Extremes: At high (60 GB) or low (140 GB) cache sizes, Marconi’s advantage shrinks since either useful prefixes cannot fit or eviction decisions rarely matter (Fig. 11).
- Pure Transformers: All methods perform identically (Fig. 12a).
- System Dependency: Chunked prefill for hybrid models still requires specialized kernels under development in most frameworks, and KV attention still depends on paging management (§ 6).
Why It Matters
Hybrid architectures are a practical path to supporting long contexts (Attention + SSM).
At its core, the problem is “which prefixes should be kept?”
By introducing FLOP-per-byte as a first-principles metric, Marconi normalizes cache decisions in terms of the model’s physical workload properties.
This provides a bridge to future stack-level optimizations such as KV paging and prefix-driven attention acceleration (e.g., Hydragen) (§ 4–6).
What Comes Next? — The Road Ahead
KV Paging × FLOP-aware Integration
Currently, SSM states are non-paged, while KV attention still requires paging.
By extending Marconi’s FLOP-efficiency scoring to page replacement,
both can share a unified global utility function for consistent optimization (§ 6).Dissemination of Chunked-Prefill Kernels
To eliminate the overhead of 2-pass or approximate checkpointing,
the authors call for widespread adoption of specialized kernels in mainstream frameworks (§ 6).Load-Adaptive Policy Tuning
Because Marconi’s benefit peaks under medium contention,
an online adaptation of α(t) based on session arrival rate could maximize hit-rate efficiency (Fig. 11 & 13).Synergy with Attention Accelerators
Prefix-sharing systems like Hydragen already sustain throughput losses below 15 %
even for 16 K-token prefixes—suggesting a natural synergy:
Marconi decides what to keep; Hydragen decides how to use it fast (Hydragen Abstract & § 4).
Appendix — Experimental Notes (for Reproducibility)
- Datasets / Traces: LMSys, ShareGPT, SWEBench — each comprising multi-turn sessions with variable request intervals to mimic real-world latency (§ 5.1).
- Models / Hardware: 7 B Hybrid {Attn 4 | SSM 24 | MLP 28}, Jamba-1.5-Mini (12 B active / 52 B total, state dim = 128), FP16 on A100-40 GB (§ 5.1).
- Baselines: vLLM and SGLang lack native hybrid support,
so the authors built hybrid-compatible versions (vLLM+, SGLang+) for fair comparison (§ 5.1).
In one sentence:
By redefining the twin axes of Admission (reuse probability) and Eviction (FLOPs per byte),
Marconi consistently achieves higher hit-rates (×) and lower TTFT ↓
in realistic hybrid serving environments dominated by long-context SSM workloads (Fig. 7–9, § 5.4).
Click to Expand — Detailed LLM Q&A Analysis
▶️ Click to View Detailed Q&A Analysis
Prompt 1.1.1 — Research Gap
Summary:
Due to the in-place updates of SSM states, conventional prefix caches can only reuse exact matches,
causing cache bloat and low reuse rates (§ 1).
Fine-grained checkpointing further fragments the cache—reuse drops to 3.3 % (block = 128) and occupies 17.4 GB per 10 K tokens in 7 B hybrids (§ 3).
Marconi introduces the first hybrid-aware caching system combining
reuse-aware admission and FLOP-efficient eviction,
achieving up to 34.4× token hit rate and 71.1 % TTFT reduction (617 ms) (Abstract, § 5).
1) State of the Art at Publication
Transformers suffer from O(L²) prefill costs and O(L) KV memory;
SSM layers offer O(L) compute and O(1) memory (§ 1, Fig. 1 c).
Hybrid architectures (Attn:SSM ≈ 1:6–1:10) emerged as an efficiency trade-off (Fig. 1 a).
However, existing prefix-sharing optimizations (e.g., Hydragen) target Transformers only,
and LRU-based caches (vLLM, SGLang) assume constant FLOP efficiency per KV state.
These fail for hybrids where SSM updates are in-place and states are massive (§ 1, § 3).
2) Defined Research Gaps
| Gap | Details | Why It’s Hard |
|---|---|---|
| Partial-reuse absence | SSM states update in place → only exact hits allowed | Transformer-centric cache assumptions break down (§ 1) |
| Over-checkpointing | Huge entry explosion → 3.3 % reuse / 17.4 GB per 10 K tokens | SSM dominates memory over KV (§ 3) |
| Policy criteria gap | LRU ignores FLOP/byte → short sequences kept, long ones evicted | SSM and KV have different FLOP/memory ratios (§ 4.2) |
| Heterogeneous state management | KV and SSM tracked separately → prefix consistency breaks | All layers must share identical prefix tokens (§ 4, Fig. 4) |
3) Core Idea to Bridge the Gap
- Admission: Estimate reuse probability via two prefix types — Purely-Input vs Input+Output (§ 4.1).
- Eviction: Score each entry by utility (S = \text{recency} + α · \text{FLOP efficiency}) (§ 4.2).
- Unified Radix Tree: Store KV and SSM jointly and checkpoint only branch points (Fig. 4).
4) Empirical Validation
- Token hit rate: vLLM+ baseline → 4.5× / 7.3× / 34.4× gain (Fig. 7).
- P95 TTFT: up to −71.1 % (617 ms) vs vLLM+, −24.7 % (325 ms) vs SGLang+ (Fig. 9).
- LRU → FLOP-aware: +19 – 219 % hit rate, +90 % FLOP savings (Fig. 8, 10).
5) Related Work Summary
Hydragen accelerates prefix reuse in Transformers via matrix-matrix decomposition but ignores SSM’s in-place states.
Traditional LRU caches assume constant FLOP/byte, making them inefficient for hybrids (§ 4.2).
Marconi fills this gap by jointly optimizing admission and eviction across KV + SSM.
TL;DR of Gap Analysis
Hybrid LLMs’ in-place SSM updates invalidate partial-reuse and LRU assumptions.
Marconi’s reuse-aware admission and FLOP-based eviction restore efficiency, achieving up to 34.4× hit-rate and 71 % TTFT reduction (§ 1, 4, 5).
Prompt 1.1.2 — Central Hypothesis
By combining reuse-based admission, FLOP/byte-efficient eviction, and unified KV + SSM management, Marconi overcomes the in-place SSM update limitation that prevents partial prefix reuse — achieving up to 34.4× higher hit rate and 71.1 % (617 ms) lower TTFT (Abstract).
Prompt 1.5.2 — Future Research Trajectory
The authors suggest several next steps (§ 6, § 7, Fig. 12–13):
- Generalization to All Recurrent Hybrids — Extend Marconi beyond Mamba-style SSMs to any recurrent layer with large, cyclic states.
- Integration with KV Paging — Combine non-paged SSM states with paged attention memory using a shared FLOP-aware utility score.
- Specialized Chunked-Prefill Kernels — Develop custom kernels to avoid two-pass prefill overheads.
- Load-Sensitive Policy Scheduling — Adapt admission/eviction weights α(t) online to workload intensity.
- Synergy with Hydragen — Combine Marconi’s “what to keep” policy with Hydragen’s “how to reuse efficiently,” which maintains < 15 % throughput loss for 16 K prefixes.
Reviewer-Proposed Extensions
| Observed Limitation | Future Direction | Expected Metric Impact |
|---|---|---|
| Heterogeneous paging (SSM vs KV) | Unify Marconi × PagedAttention using global FLOP-aware utility | TTFT↓, Hit Rate↑, Page Faults↓ |
| Immature chunked prefill kernels | Develop dedicated 1-pass prefill kernels | Prefill Latency↓, Memory Δ≈0 |
| Load contention sensitivity | Scheduler-linked adaptive α(t) | SLA P95↓, Recompute FLOPs↓ |
| Unaccelerated Attention costs | Combine with Hydragen (prefix reuse acceleration) | Throughput↑, $/1 M tok ↓ |
| Single-GPU scope | Cluster-level routing via “longest-prefix-first” | Cross-GPU Hit ↑, Network Traffic ↓ |
| Heuristic policy | Learning-based admission/eviction (bandit style) | Hit ↑, Over-admit ↓ |
Key Quotes
“Chunked prefill … requires specialized kernels under development” (§ 6).
“We only evaluate Mamba/SSMs … but it can be extended” (§ 6).
In Short
The next research frontier lies in co-evolving three axes:
(1) Policy — what to retain, (2) Kernel / Paging — how to store, and (3) Attention Acceleration — how to reuse fast.
Together, these can push hybrid LLM serving toward truly compute-normalized efficiency (§ 4–6; Hydragen § 3–4).
![[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs](https://pbs.twimg.com/media/GdyLXO9W4AADox0.jpg)
Comments