[Paper Review] DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
Paper Link DroidSpeak: Reducing Prefill Latency by 1.7–3.1× through Cross-LLM Prefix-KV Reuse TL;DR When multiple LLMs share the same …
Here are all published articles, sorted by date in descending order.
Paper Link DroidSpeak: Reducing Prefill Latency by 1.7–3.1× through Cross-LLM Prefix-KV Reuse TL;DR When multiple LLMs share the same …
![[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs](https://pbs.twimg.com/media/GdyLXO9W4AADox0.jpg)
Paper Link Marconi: Rethinking Prefix Caching for the Hybrid LLM Era TL;DR Marconi introduces a prefix-caching framework for hybrid LLM …
![[Paper Review] SGLang: Efficient Execution of Structured Language Model Programs](https://cdn.bytez.com/mobilePapers/v2/neurips/94872/images/20-0.png)
Paper Link SGLang & RadixAttention: How Execution Optimization for “LM Programs” Achieved a 6.4x Speedup TL;DR By combining …
![[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality](https://icml.cc/media/PosterPDFs/ICML%202024/32613.png)
Paper Link Structured State Space Duality: Unifying SSMs and Attention with Mamba-2 for 2–8× Acceleration TL;DR Structured State-Space …
Link to Paper Dynamic Memory Sparsification (DMS): Making LLM Hyper-Scaling a Reality with 8× KV Cache Compression One-Line Summary (TL;DR) …
Paper Link Hydragen: The Secret Weapon for Decoding Large Batches with Shared Prefixes up to 32× Faster TL;DR By decomposing the prefix and …
Enter keywords to search articles