Posts

Here are all published articles, sorted by date in descending order.

16 posts total

3 pages total

[Paper Review] Massive Activations in Large Language Models

[Paper Review] Massive Activations in Large Language Models

[Paper Review] Massive Activations in Large Language Models

Paper Link Massive Activations, Hidden Biases: A Reinterpretation of Self-Attention’s Secrets TL;DR Just 4–10 extreme scalar values …

2402.17762v2 Transformer SelfAttention BiasMechanism RepresentationLearning Interpretability NeuralMechanisms Massive Activations Explicit Attention Bias

[Paper Review] Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

Paper Link Peri-LayerNorm: A Third Option Beyond Post-LN and Pre-LN TL;DR By simply adding another LayerNorm right after the residual …

2502.02732v3 LayerNorm Transformer Architecture Training Stability Large Language Models FP16 Training Empirical Evaluation Gradient Explosion Benchmark Evaluation

[paper review] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training

[paper review] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training

[paper review] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training

[Paper Review] SageAttention 3 & SageBwd — FP4-Powered Inference and 8-bit Training Paper link: https://arxiv.org/abs/2505.11594v1 📝 …

2505.11594v1 LowPrecision FP4 INT8Training EfficientAttention SageAttention TransformerOptimization BlackwellGPU Quantization InferenceAcceleration TrainingEfficiency CUDA Triton

[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

Paper Link Helix Parallelism: Breaking the Latency-Throughput Wall of Ultra-Long LLM Decoding TL;DR Helix Parallelism schedules Attention …

2505.09343v1 Helix Parallelism Tensor Parallelism KV Parallelism Mixture of Experts Grouped Query Attention (GQA) FlashAttention Parallelism for LLMs System-Aware ML Efficient Transformer Inference Serving LLMs at Scale Long Context Inference