![[논문 리뷰] Hardware-Efficient Attention for Fast Decoding](https://moonlight-paper-snapshot.s3.ap-northeast-2.amazonaws.com/arxiv/hardware-efficient-attention-for-fast-decoding-1.png)
[논문 리뷰] Hardware-Efficient Attention for Fast Decoding
논문 링크 GTA & GLA: 디코딩의 ‘메모리-지배’를 깨는 하드웨어 효율 어텐션 TL;DR GTA(키·값 타이잉)와 GLA(잠재 헤드 샤딩) 으로 FLOPs/byte를 끌어올리고 per-GPU KV 캐시를 줄여, …
33 분
2505.21487v1
Attention Optimization
Arithmetic Intensity
Inference Acceleration
GPU Memory Bottleneck
Grouped-Tied Attention (GTA)
Grouped-Latent Attention (GLA)
FlashMLA
KV-Cache Optimization
Tensor Parallelism
Long-Context Decoding