[Paper Review] DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
Paper Link DroidSpeak: Reducing Prefill Latency by 1.7–3.1× through Cross-LLM Prefix-KV Reuse TL;DR When multiple LLMs share the same …
All posts under category "Paper-Review"
Paper Link DroidSpeak: Reducing Prefill Latency by 1.7–3.1× through Cross-LLM Prefix-KV Reuse TL;DR When multiple LLMs share the same …
![[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs](https://pbs.twimg.com/media/GdyLXO9W4AADox0.jpg)
Paper Link Marconi: Rethinking Prefix Caching for the Hybrid LLM Era TL;DR Marconi introduces a prefix-caching framework for hybrid LLM …
![[Paper Review] SGLang: Efficient Execution of Structured Language Model Programs](https://cdn.bytez.com/mobilePapers/v2/neurips/94872/images/20-0.png)
Paper Link SGLang & RadixAttention: How Execution Optimization for “LM Programs” Achieved a 6.4x Speedup TL;DR By combining …
![[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality](https://icml.cc/media/PosterPDFs/ICML%202024/32613.png)
Paper Link Structured State Space Duality: Unifying SSMs and Attention with Mamba-2 for 2–8× Acceleration TL;DR Structured State-Space …
Link to Paper Dynamic Memory Sparsification (DMS): Making LLM Hyper-Scaling a Reality with 8× KV Cache Compression One-Line Summary (TL;DR) …
Paper Link Hydragen: The Secret Weapon for Decoding Large Batches with Shared Prefixes up to 32× Faster TL;DR By decomposing the prefix and …
![[Paper Review] KIMI K2: OPEN AGENTIC INTELLIGENCE](https://github.com/MoonshotAI/Kimi-K2/raw/main/figures/kimi-logo.png)
Paper Link Kimi K2: An Open-Source LLM’s Leap Toward Agentic Intelligence TL;DR With a 3-stage pipeline consisting of MuonClip pretraining + …
Paper Link Qwen 3: The Evolution of a Giant MoE Language Model with Adjustable Reasoning Depth TL;DR (in one line) Qwen 3 couples a …
![[Paper Review] Massive Activations in Large Language Models](https://eric-mingjie.github.io/massive-activations/assets/main_teaser_final.png)
Paper Link Massive Activations, Hidden Biases: A Reinterpretation of Self-Attention’s Secrets TL;DR Just 4–10 extreme scalar values …
Paper Link Peri-LayerNorm: A Third Option Beyond Post-LN and Pre-LN TL;DR By simply adding another LayerNorm right after the residual …
![[paper review] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training](https://cdn-uploads.huggingface.co/production/uploads/66c0a08bac74db25de8427ec/Tb20E3IJSV6PjcD9Nkvfg.png)
[Paper Review] SageAttention 3 & SageBwd — FP4-Powered Inference and 8-bit Training Paper link: https://arxiv.org/abs/2505.11594v1 📝 …
![[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding](https://www.storagereview.com/wp-content/uploads/2025/07/image2-2-png-e1752234784623.webp)
Paper Link Helix Parallelism: Breaking the Latency-Throughput Wall of Ultra-Long LLM Decoding TL;DR Helix Parallelism schedules Attention …
Enter keywords to search articles