![[Paper Review] Llama-Nemotron: Efficient Reasoning Models](https://cdn-thumbnails.huggingface.co/social-thumbnails/collections/nvidia/llama-nemotron-67d92346030a2691293f200b.png)
[Paper Review] Llama-Nemotron: Efficient Reasoning Models
Paper Link Hydragen: The Secret Weapon for Decoding Large Batches with Shared Prefixes up to 32× Faster TL;DR By decomposing the prefix and suffix using softmax …
23 minute
All posts under tag "FlashAttention"
Paper Link Hydragen: The Secret Weapon for Decoding Large Batches with Shared Prefixes up to 32× Faster TL;DR By decomposing the prefix and suffix using softmax …
Paper Link Helix Parallelism: Breaking the Latency-Throughput Wall of Ultra-Long LLM Decoding TL;DR Helix Parallelism schedules Attention and FFN with different …
Enter keywords to search articles