Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
논문 링크 DeepSeek-V3: 2 048 대 H800으로 405 B-급 LLM을 돌린다는 것의 의미 TL;DR ― 한 줄 요약 Multi-Head Latent Attention (MLA) + FP8 MoE + Dual-Pipe + 2-계층 MPFT …
26 분
2505.09343v1
Large Language Models
Mixture of Experts
FP8 Training
Transformer Optimization
Memory Efficiency
Distributed Training
Inference Acceleration
DeepSeek