Inference Speedup

'Inference Speedup' 태그의 모든 글

총 1개의 글

시간순 정렬

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

논문 링크 Native Sparse Attention (NSA) — 64 k 토큰도 11× 빠르게, 정확도는 그대로 한 줄 요약 (TL;DR) NSA는 ‘압축 → 선택 → 슬라이딩’ 3 분기 희소 어텐션과 GQA/MQA-친화 커널을 결합해 64 k 컨 …

2025년 07월 07일

2502.11089v2 Sparse Attention Long Context Transformer Optimization Efficient LLM GPU Acceleration FlashAttention Memory Efficiency Inference Speedup Trainable Sparsity Triton Kernel Deep Learning Language Models DeepSeek