![[Paper Review] Inference-Time Hyper-Scaling with KV Cache Compression](https://cdn-thumbnails.huggingface.co/social-thumbnails/papers/2506.05345/gradient.png)
[Paper Review] Inference-Time Hyper-Scaling with KV Cache Compression
Link to Paper Dynamic Memory Sparsification (DMS): Making LLM Hyper-Scaling a Reality with 8× KV Cache Compression One-Line Summary (TL;DR) DMS, combining a …
All posts under category "With-Gpt"
Link to Paper Dynamic Memory Sparsification (DMS): Making LLM Hyper-Scaling a Reality with 8× KV Cache Compression One-Line Summary (TL;DR) DMS, combining a …
Paper Link Hydragen: The Secret Weapon for Decoding Large Batches with Shared Prefixes up to 32× Faster TL;DR By decomposing the prefix and suffix using softmax …
Paper Link Kimi K2: An Open-Source LLM’s Leap Toward Agentic Intelligence TL;DR With a 3-stage pipeline consisting of MuonClip pretraining + large-scale agentic …
Paper Link Qwen 3: The Evolution of a Giant MoE Language Model with Adjustable Reasoning Depth TL;DR (in one line) Qwen 3 couples a user-controllable Thinking …
Paper Link Massive Activations, Hidden Biases: A Reinterpretation of Self-Attention’s Secrets TL;DR Just 4–10 extreme scalar values (×10,000) out of tens of …
Paper Link Peri-LayerNorm: A Third Option Beyond Post-LN and Pre-LN TL;DR By simply adding another LayerNorm right after the residual …
[Paper Review] SageAttention 3 & SageBwd — FP4-Powered Inference and 8-bit Training Paper link: https://arxiv.org/abs/2505.11594v1 📝 TL;DR The SageAttention …
Paper Link Helix Parallelism: Breaking the Latency-Throughput Wall of Ultra-Long LLM Decoding TL;DR Helix Parallelism schedules Attention and FFN with different …
Enter keywords to search articles