FlashAttention

All posts under tag "FlashAttention"

3 posts total

Sorted by date

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper Link Structured State Space Duality: Unifying SSMs and Attention with Mamba-2 for 2–8× Acceleration TL;DR Structured State-Space …

October 03, 2025

Mamba Mamba-2 Structured State Space Duality SSD State Space Models SSM Transformer Attention Mechanism Long Context Efficient Training FlashAttention Sequence Modeling Scaling Laws Parallelism GPU Acceleration 2405.21060v1

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

Paper Link Hydragen: The Secret Weapon for Decoding Large Batches with Shared Prefixes up to 32× Faster TL;DR By decomposing the prefix and …

2505.00949v4 Hydragen Prefix Caching Shared Prefix Decoding Efficient Inference Softmax Decomposition LLM Serving Attention Optimization FlashAttention vLLM Batch Inference Matrix-Matrix GEMM TensorCore Optimization

[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

Paper Link Helix Parallelism: Breaking the Latency-Throughput Wall of Ultra-Long LLM Decoding TL;DR Helix Parallelism schedules Attention …

2505.09343v1 Helix Parallelism Tensor Parallelism KV Parallelism Mixture of Experts Grouped Query Attention (GQA) FlashAttention Parallelism for LLMs System-Aware ML Efficient Transformer Inference Serving LLMs at Scale Long Context Inference