Paper Link

DroidSpeak: Reducing Prefill Latency by 1.7–3.1× through Cross-LLM Prefix-KV Reuse

TL;DR

When multiple LLMs share the same architecture but different weights, DroidSpeak allows the receiving model to reuse the sender model’s prefix KV cache by performing contiguous-layer partial recomputation and E-first pipelining.
This achieves 1.7–3.1× lower TTFT (prefill latency), up to 4× higher online throughput, and negligible quality loss (F1/ROUGE/CodeSim), assuming an offline O(L²) profiling stage (see §4.2–§5.3, Fig.13–15).


Core Idea


Background — What Problem Does It Solve?

In production settings, multiple LLMs (same architecture, different weights) often process similar prefixes repeatedly.
Existing systems (e.g., vLLM, PagedAttention) are optimized for intra-model prefix sharing and KV management but fail when reusing KV caches across different models, causing severe quality collapse.
This paper targets that gap — answering the question:

“Can cross-LLM KV reuse be achieved without sacrificing output quality?” (§1, §Related Work)


New Approach — DroidSpeak

Definition:
Given two models (S, R) with identical architectures and a prefix input x₁:ₙ, DroidSpeak lets receiver R reuse sender S’s KV/E-cache while recomputing only a contiguous subset of layers and reusing the rest.
E-cache transmission and layer recomputation are overlapped via CUDA streams, reducing TTFT (§4.1–§4.4).

How It Works — A Concrete Example

Toy Setup

KV Cache Memory (reference formula)

$$ \text{KV(GB)} \approx \frac{2 \cdot L \cdot H \cdot d_\text{head} \cdot \text{seq} \cdot \text{bs} \cdot \text{bytes/elt}}{10^9} $$

Substituting the toy values gives 1.28×10⁻⁷ GB, negligible in scale — but real workloads (large L·H·d_head·seq·bs) scale this to several GBs (§2).


Pipeline Illustration

  sequenceDiagram
  participant S as Sender (S)
  participant R as Receiver (R)
  Note over S,R: Transition layer = 3 (recompute block: 3–4)
  S->>R: send E_cache[L3]      %% transmit E first
  activate R
  R->>R: recompute L3–L4       %% start recomputation (compute stream)
  par
    S-->>R: send KV[L1–L2, L5–L6]  %% KV loading (transfer stream)
  and
    R->>R: continue recompute
  end
  R->>R: assemble KV (reuse + recompute) → prefill done → decode

This strategy improves latency from (a) Load all then compute: 47 → (b) KV preloading only: 30 → (c) full pipeline overlap: 17, yielding ≈2× speed-up (dimensionless) (§4.3, Fig.13).


Performance Evaluation — Key Results

Most persuasive comparison: “At equal latency, +5–33 pp quality (avg +16 pp) over CacheBlend.” This shows contiguous-layer recomputation preserves accuracy far better than token-level corrections (§5.2).


Our Perspective — Strengths, Limitations, and Significance

Strengths

Limitations

Why It Matters

In multi-model production (different weights, same architecture), redundant prefix prefill dominates serving time. DroidSpeak is the first practical framework to exploit this redundancy across models, maintaining output quality while substantially cutting compute and latency (§1, §5).


Next Steps — Future Directions


Appendix Notes (for Reproducibility)


One-line summary: “Recompute only the critical contiguous layers, send E first, and overlap compute with transfer.” These three simple principles yield measurable improvements in TTFT/TBT/E2E, making cross-LLM prefix reuse a deployable reality (§4–§5).

PLAINTEXT


좋아요. 이제 `<details>` 내부의 **Prompt 1.1.1 ~ 1.5.2** (, 질의응답 기반 해설 전체) 부분을 영어로 자연스럽게 번역하겠습니다.
원문의 기술적 정밀성과 citation 구조는 유지하면서, 영어 독자가 읽기 좋은 문체로 정리했습니다.

---

````markdown
<details markdown="block">
<summary>▶️<strong>Click to expand detailed Q&A analysis</strong></summary>

---

## Prompt 1.1.1 — Research Gap

**Summary:**  
Previous optimization studies focused only on **intra-LLM prefix sharing**reusing and managing KV caches *within* a single model (e.g., **vLLM, PagedAttention, LMCache, Hydragen**).  
However, **cross-LLM KV reuse**sharing KV caches **across models** with the *same architecture but different weights*remained unsolved.  
DroidSpeak bridges this gap by selectively **recomputing only ~10% of critical contiguous layers** while **reusing KV caches** from another model, and overlaps **transmission and recomputation** to achieve  
**1.73.1× TTFT reduction** and **up to 4× throughput increase**, with **minimal F1/ROUGE-L/code similarity loss** (§Abstract/§1/§45).

---

### Key Numbers (Summary)

* **Scope:** Cross-LLM KV reuse for models with identical **architecture**, differing **weights** (§1)  
* **Recomputation ratio:** Only **~10% of layers** are critical and need recomputation (§3.2)  
* **Serving:** **TTFT 1.73.1×**, **Throughput  up to 4×**, **Avg. prefill speed 2.1×** (§5)  
* **Quality:** F1/ROUGE-L/CodeSim preserved (negligible loss) (§Abstract, §5.2)  
* **Runtime pipeline:** **2× TTFT reduction** from E-first pipelining (§4.34.4, Fig.13)  
* **Profiling Cost:** **O(L²)**; Llama-3-8B (32L)  **3 h @ A100**, reducible by **3×** via 2-layer grouping (§4.2)  

> **TPOT = ms/token.** The paper focuses mainly on **prefill latency (TTFT)** and **throughput** (§5).

---

### Research Gap Filled by This Work

1. **From Single-LLM to Multi-LLM Sharing**  
   - Prior optimizations improved prefix sharing **within one model** (vLLMs *PagedAttention*, LMCache, SGLang, etc.), boosting memory and cache hit rate.  
   - None extended **computation reuse** to *different* models.  
   - The authors explicitly pose: *Can KV caches from one LLM be reused by another?”* (§1, §Related Work)

2. **No Framework for Accuracy-Preserving KV Translation**  
   - Naively reusing KV across models breaks accuracy due to *representation mismatch*.  
   - Layerwise **sensitivity differences** suggest partial recomputation, but no method existed to identify layers or integrate it systemically (§13).

3. **Lack of TransmissionCompute Overlap in Distributed Serving**  
   - Prior works focused on single-model cache compression/offloading but ignored overlapping **remote KV/E transfer** with **partial prefill recomputation** (§4.34.4).

---

### SOTA Summary at Publication Time

| Axis | Method | Key Idea | Limitation (from this papers view) |
|------|---------|-----------|------------------------------------|
| **KV Management** | vLLM / PagedAttention, LMCache | Paginated KV management to reduce memory | Does not address compute cost; single-LLM only (§1, §RW) |
| **Prefix Sharing Acceleration** | Hydragen | Prefix/suffix attention factorization and inter-sequence batching | Limited to single-model attention ops (§Hydragen Intro, §RW) |
| **KV Quality Correction** | CacheBlend | Token-wise selective recomputation | Assumes same model, token-level only (§1, §4) |

> In short: abundant **memory/cache optimization**, but no **cross-LLM KV reuse**.  
> DroidSpeak fills this gap with **contiguous-layer recomputation + pipelining** (§Abstract/§1/§4).

---

### Quantitative Contributions

1. **Empirical Insight:** In **8 model pairs**, only ~**10% of layers** are critical and consistent across inputs  reusable via **offline profiling** (§3.2, §4.2)  
2. **Algorithm:** **Recompute contiguous layer blocks** to minimize transition errors  Pareto trade-off between accuracy and latency (§4.14.2, Fig.1011)  
3. **System:** Overlap **remote KV/E transfer** and recomputation via CUDA streams  **2× TTFT reduction** (§4.34.4, Fig.13)  
4. **Effect:** **TTFT 1.73.1×**, **Throughput  up to 4×**, **CacheBlend +533 pp quality @ same latency** (avg +16 pp) (§5.25.3)

---

### Why Contiguous Recompute?

Non-contiguous (spot) recomputation introduces multiple *transition boundaries*, each amplifying mismatch errors in E-cache  large quality loss.  
Contiguous recomputation minimizes transitions, reducing cumulative error and stabilizing quality (§4.1, Fig.1011).  
Profiling produces a **recompute layers  F1 trade-off frontier**, allowing SLO-based selection (§4.2, Fig.1112).

---

### Relation to Prior Work

* **KV management/offloading/compression:** single-model only (vLLM, LMCache)  not directly related to cross-LLM reuse (§Related Work).  
* **Prefix attention acceleration (Hydragen):** single-model optimization (§Hydragen, §RW).  
* **CacheBlend:** token-level correction within same model vs. DroidSpeaks layer-group recomputation (§1, §4).

---

### Explicit and Potential Limitations

* **Profiling Overhead:** O(L²), e.g. 32 layers  3h@A100 (one-time, reducible by grouping) (§4.2).  
* **Data Drift Sensitivity:** profiling mismatch may degrade quality  periodic re-profiling suggested (§6).  
* **Bandwidth Sensitivity:** performance gains shrink as network bandwidth increases (§6).  

---

**In essence:**  
DroidSpeak reframes prefix reuse from a **memory-only problem** to a **compute-sharing problem across models**, introducing layer sensitivity analysis and practical distributed pipelining.

---

## Prompt 1.1.2 — Central Hypothesis

The authors hypothesize that **for LLMs sharing the same architecture**, the *receiver model* can safely **reuse prefix KV caches** from a *sender model* by combining **contiguous-layer selective recomputation** and **E-first pipelined KV loading (DroidSpeak)**.  
This overcomes two limitations(1) **single-model-only caching** and (2) **accuracy collapse under cross-model KV reuse**achieving **1.73.1× TTFT reduction** and **up to 4× throughput increase** with **negligible loss in F1/ROUGE-L/CodeSim** (§Abstract, §1).

---

## Prompt 1.2.1 — Novel Contributions

**Summary:**  
DroidSpeaks originality lies in (1) identifying **layer-wise sensitivity in cross-LLM KV reuse (~10% critical layers)**, (2) designing a **system architecture** that couples contiguous-layer recomputation with E/KV pipelining, and (3) demonstrating **substantial serving gains (TTFT 1.73.1×, throughput  up to 4×)** with quality retention (§3.2, §4, §5.2, Fig.915).

---

### 1) Layer Sensitivity Discovery (Analytical Insight)

* **Type:** Theoretical & empirical insight.  
* **Finding:** Across 8 model pairs and 6 datasets, reusing all layers causes >50 pp accuracy loss (e.g., on HotpotQA), but only ~10% of layers are critical and stable across inputs (critical layers) (§3.13.2, Fig.7).  
* **Significance:** Defines a new optimization axis between *full reuse (poor quality)* and *full recomputation (high latency)* (§3).

---

### 2) Contiguous-Layer Recompute + Pipelining (System Innovation)

* **Type:** New system/runtime architecture with algorithmic design.  
* **Key Ideas:**  
  - Recomputing scattered layers increases transition mismatches; grouping **contiguous layers** minimizes error (§4.14.2, Fig.1011).  
  - **E-first pipelining:** start recomputation upon receiving transition-layer E-cache while loading other KVs in parallel  TTFT reduced from 4717 (2×) (§4.3, Fig.13).  
  - **Offline profiling:** O(L²) search; 32-layer Llama-3-8B  3h@A100, reducible 3× with 2-layer grouping (§4.2).  
  - **Implementation:** ~3K LoC, PyTorch 2.0/CUDA 12.0, integrated into vLLM/LMCache with new APIs (`store`, `fetch`, `partial_prefill`) (§4.4).

---

### 3) Real Serving Results (Empirical Validation)

* **Type:** Application of existing methods to a new cross-LLM scenario.  
* **Metrics:**  
  - TTFT 1.73.1× (avg 2.1×), throughput  up to 4× (8 model pairs, 3 datasets) (§5.25.3).  
  - Quality preserved within negligible F1/ROUGE-L/code similarity loss (§Abstract, §5.2).  
  - Outperforms CacheBlend by +533 pp (avg +16 pp) at equal latency (§5.2).  

> Note: E-cache may be 24× larger than KV, especially in GQA models (§4.1).

---

**In short:**  
DroidSpeak establishes the first systematic **cross-LLM KV reuse framework** with  
layer-sensitivity insight  contiguous recomputation + pipelining  measurable serving gains.

---

## Prompt 1.2.2 — Claimed Strengths (Author’s View)

The authors claim superiority by demonstrating that their **cross-LLM selective recomputation + pipelined KV loading** achieves **1.73.1× TTFT reduction** and **up to 4× throughput increase** with **negligible quality loss**, outperforming all **single-LLM-only caching** methods (§1, §5, Fig.13).

---

### Strength 1 — Empirical Basis for Selective Recomputation

* **Layer Sensitivity Evidence:** Only ~11% of layers are critical; recomputing just these preserves accuracy (§3.2, Fig.7).  
* **Consistency Across Inputs:** Critical layer patterns remain stable, enabling one-time **offline profiling** of O(L²) cost (§3.2, §4.2).

### Strength 2 — Contiguous Layer Grouping to Reduce Error Propagation

* **Why Contiguous:** Noncontiguous recomputation increases transition mismatches, amplifying errors (§4.1, Fig.1011).  
* **Benefit:** Pareto frontier between recompute count and F1 loss allows precise SLO tuning (§4.2, Fig.11).

### Strength 3 — Transmission–Compute Pipelining for Distributed Efficiency

* **Problem:** Remote KV/E transfer latency grows with node distance (§4.3).  
* **Solution:** Transmit E-cache first, overlap recomputation and KV loading  TTFT 3017 (2× improvement) (§4.3, Fig.13).  
* **Integration:** Unified API within vLLM/LMCache, lightweight (~3K LoC) (§4.4).

---

### Comparison Summary

| Axis | Prior Methods | Limitation | DroidSpeaks Advantage |
|------|----------------|-------------|------------------------|
| Cache Scope | Intra-LLM only | Cross-LLM reuse breaks quality | Enables reuse across models via partial recomputation (§1) |
| Recomputation Unit | Token/scattered layers (CacheBlend) | Multiple transitions  error | Contiguous layer groups minimize propagation (§4.1, Fig.1011) |
| Distributed Transfer | Sequential KV loading | Idle latency | Pipelined E-first loading hides latency (~2× TTFT, §4.3) |
| End-to-End Metrics |  |  | Prefill 1.73.1×, throughput 4×, negligible quality loss (§5, Fig.1) |

> In summary: the **simple, consistent principle—“recompute contiguous layers + pipeline transfers**achieves joint improvements in latency, throughput, and quality (§3–§5).

---

## Prompt 1.3.1 — Step-by-Step Algorithm Explanation

**Summary:**  
DroidSpeak enables **partial prefill reuse** between LLM pairs (sender S, receiver R) sharing architecture but differing weights.  
It (1) profiles **which contiguous layer groups** to recompute (O(L²)), (2) pipelines **E-cache-first transfer and recomputation**, and (3) achieves **1.73.1× TTFT ** and **4× throughput ** (§4.2, Fig.11; §4.3, Fig.13; §5.25.3).

---

### 0) Background & Terms

* **KV cache:** Per-layer key/value tensors; **E-cache** is per-layer embedding input (§2, Fig.2).  
* **Prefill vs Decode:** Prefill processes the entire input at once (heavy), while decode generates token-by-token (§2, Fig.3).  
* **Motivation:** Repeated prefixes across models waste computation; cross-LLM KV reuse was unsolved (§Abstract, §1).

---

### 1) Offline Stage — Finding Contiguous Recompute Groups

* **Input:** Model pair (S,R) and profiling dataset (HotpotQA, 50 contexts) (§4.2, §5).  
* **Procedure:**
  1. Generate all contiguous layer-group candidates (can group by 2 for efficiency).  
  2. Measure quality (F1/ROUGE-L/code similarity) vs recompute layers.  
  3. Extract Pareto-optimal points with 5 pp quality loss (§4.2, Fig.11).  
  4. Complexity: O(L²); 32-layer model  3h@A100, reducible 3× with 2-layer grouping (§4.2).  
* **Rationale:** Each transition adds mismatch; recomputing contiguous blocks (e.g., 1627) minimizes cumulative error (§4.1, Fig.10).

---

### 2) Online Stage — Partial Prefill & Smart Loading

* **Inputs:** Chosen recompute group (e.g., L4L10), SLO-based Pareto selection (§4.3).  
* **Core Operations:**
  - Transmit **E-cache** of transition layers first (24× larger than KV).  
  - Overlap recomputation and KV loading using separate CUDA streams (§4.3, Fig.13).  
  - Achieves TTFT  from 3017 (~2×).  
* **Runtime API:**  
  `store(context, LLM)`, `fetch(context, LLM, layer)`, `partial_prefill(recompute_config, context)` integrated with vLLM/LMCache via `torch.distributed` (§4.4).

---

### 3) Toy Example — L = 4, seq = 4, d_head = 2, fp16

* **Setup:**  
  - Models S and R share architecture but differ in weights.  
  - Profiling yields recompute block {L3,L4}, reuse {L1,L2} (§4.2, Fig.11).  
  - KV size formula:
    $$
    \text{KV(GB)} \approx \frac{2 L H d_\text{head} \text{seq} \text{batch} \text{bytes/elt}}{10^9}.
    $$

* **Process:**  
  1. **Lookup & Load:** R checks cache from S; if absent, S generates and sends (§4.4).  
  2. **E-cache Transfer:** R receives E[L3], starts recomputation (L3L4) while loading reuse KVs (L1L2) concurrently (§4.3).  
  3. **Assemble:** Merge recomputed and reused KV, finish prefill, begin decoding (§2, Fig.3).  
  4. **Output:** Quality  Rs baseline (5 pp diff), TTFT 1.73.1× (§5.2).

---

### 4) Pipeline Timeline (Concept)

```mermaid
sequenceDiagram
  participant S as Sender (S)
  participant R as Receiver (R)

  Note over S,R: Transition at L3 (reuse L1L2 KV, recompute L3L4)
  S->>R: send E_cache[L3]  (start recompute)
  activate R
  R->>R: recompute L3L4 (compute stream)
  par
    S-->>R: send KV[L1L2] (transfer stream)
  and
    R->>R: continue recompute
  end
  R->>R: assemble {KV[L1L2], KV[L3L4]}  decode
Click to expand and view more

Three pipeline strategies: (a) Sequential (load all then compute): 47 units (b) KV preloading: 30 units (c) Pipelined overlap: 17 units (≈2× faster) (§4.3, Fig.13)


5) Accuracy & Performance Summary

MetricDroidSpeak Improvement
Prefill1.7–3.1× faster (avg 2.1×, 8 model pairs × 3 datasets) (§5.2, Fig.14)
ThroughputUp to 4× higher in online serving (§5.3, Fig.15–16)
QualityF1/ROUGE-L/code similarity stable, +5–33 pp vs CacheBlend (§5.2)

6) Why E-cache First and Contiguous Groups?

start immediately while hiding transfer latency (§4.1, Fig.9).


7) Implementation Notes

Summary: Offline: find contiguous recompute groups → Online: send E first, overlap recompute + KV load → Improved prefill & throughput (§4–5).


Prompt 1.3.2 — The “Secret Weapon”

Key Component: Smart KV loading pipeline — sending transition-layer E-cache first, overlapping recomputation and KV transfer.

VariantMechanismTTFT (units)Δ vs PipelineNote
Proposed (E-first)Start recompute immediately, overlap KV loading17~2× faster (§4.3, Fig.13c)
KV-preload onlyLoad KVs sequentially, no overlap30+76%Slower (§4.3, Fig.13b)
No pipeline (naive)Load all (E+KV) then compute47+176%Worst (§4.3, Fig.13a)

Mechanism:

  1. E-first → immediate compute start.
  2. Overlap transfer and compute to hide latency → TTFT ↓30→17.
  3. Larger gain when recompute ratio is smaller (more KV to transfer) (§4.3).

Prompt 1.4.1 — Key Results

Benchmarks include HotpotQA, 2wikimQA, MultiNews, LCC, RepoBench (§5). Baselines: Full prefill (vLLM), full KV reuse, CacheBlend (§5.1).

Figures: Fig.14: Prefill–Quality Pareto Fig.15: Online throughput curves Fig.16: Code agent latency.


Prompt 1.4.2 — Critical Comparison

Conclusion: DroidSpeak achieves practical SOTA on latency–quality balance. Full reuse is faster but unusable (quality collapse); CacheBlend corrects tokens, not structure; DroidSpeak corrects layers.


Prompt 1.5.1 — Acknowledged & Potential Limitations

Explicitly stated limitations (§6):

  1. No support for cross-foundation models (different KV shapes).
  2. No bandwidth-aware adaptive recomputation.
  3. Vulnerable to data drift (profiling mismatch).
  4. Diminished absolute gain under high-bandwidth networks (§5.7, Fig.20).
  5. E-cache overhead (2–4× KV size, esp. in 70B models) (§4.1).
  6. Profiling cost O(L²) ≈ 3h@A100 for 32 layers (§4.2).

Potential issues: E-cache transfer spikes, drift-triggered quality drop, engineering complexity (~3K LoC, multi-node sync). Yet relative improvements remain stable (§4.1–§6).


Prompt 1.5.2 — Future Research Trajectories

Next Steps (§6):

  1. Cross-foundation KV Alignment: Normalize head count / hidden size mismatch via projection or low-rank adapters (e.g., RoPE-scale alignment).

  2. Bandwidth-Aware Scheduler: Optimize recompute ratio r* s.t. TTFT minimized under link bandwidth B, RTT, and GPU load (§6, Fig.20).

  3. Online Reprofiling: Sliding-window drift detection (entropy/self-consistency) to refresh profiling every T minutes (§6).

Proposed Extensions:

One-line takeaway: Future work aims to generalize KV reuse across heterogeneous models and variable-bandwidth systems—making cross-LLM caching adaptive, robust, and real-time.


Copyright Notice

Author: Jaehun Ryu

Link: https://jaehun.me/en/posts/paper-review-droidspeak-kv-cache-sharing-for-cross-llm-communication-and-multi-llm-serving/

License: CC BY 4.0

This work is licensed under the Creative Commons Attribution 4.0 International License. You are free to use it for any purpose, including commercial use, as long as you provide proper attribution.

Comments

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut