Posts

Here are all published articles, sorted by date in descending order.

16 posts total

3 pages total

[Paper Review] DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

Paper Link DroidSpeak: Reducing Prefill Latency by 1.7–3.1× through Cross-LLM Prefix-KV Reuse TL;DR When multiple LLMs share the same …

October 08, 2025

2411.02820v4 droidspeak cross-llm-kv-reuse prefix-kv / e-cache contiguous-layer-recompute

[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs

[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs

[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs

Paper Link Marconi: Rethinking Prefix Caching for the Hybrid LLM Era TL;DR Marconi introduces a prefix-caching framework for hybrid LLM …

October 08, 2025

2411.19379v3 Marconi Hybrid LLM Prefix Caching Inference Optimization FLOP-aware Scheduling SSM vLLM Serving Efficiency

[Paper Review] SGLang: Efficient Execution of Structured Language Model Programs

[Paper Review] SGLang: Efficient Execution of Structured Language Model Programs

[Paper Review] SGLang: Efficient Execution of Structured Language Model Programs

Paper Link SGLang & RadixAttention: How Execution Optimization for “LM Programs” Achieved a 6.4x Speedup TL;DR By combining …

October 03, 2025

SGLang RadixAttention KV Cache LLM Inference Programming Language and Runtime Constrained Decoding Speculative Execution Distributed Inference Prompt Optimization Multimodal

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper Link Structured State Space Duality: Unifying SSMs and Attention with Mamba-2 for 2–8× Acceleration TL;DR Structured State-Space …

October 03, 2025

Mamba Mamba-2 Structured State Space Duality SSD State Space Models SSM Transformer Attention Mechanism Long Context Efficient Training FlashAttention Sequence Modeling Scaling Laws Parallelism GPU Acceleration 2405.21060v1

[Paper Review] Inference-Time Hyper-Scaling with KV Cache Compression

[Paper Review] Inference-Time Hyper-Scaling with KV Cache Compression

[Paper Review] Inference-Time Hyper-Scaling with KV Cache Compression

Link to Paper Dynamic Memory Sparsification (DMS): Making LLM Hyper-Scaling a Reality with 8× KV Cache Compression One-Line Summary (TL;DR) …

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

Paper Link Hydragen: The Secret Weapon for Decoding Large Batches with Shared Prefixes up to 32× Faster TL;DR By decomposing the prefix and …

2505.00949v4 Hydragen Prefix Caching Shared Prefix Decoding Efficient Inference Softmax Decomposition LLM Serving Attention Optimization FlashAttention vLLM Batch Inference Matrix-Matrix GEMM TensorCore Optimization