With-Gpt

All posts under category "With-Gpt"

11 posts total

Sorted by date

[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs

Paper Link Marconi: Rethinking Prefix Caching for the Hybrid LLM Era TL;DR Marconi introduces a prefix-caching framework for hybrid LLM …

October 08, 2025

11 minute

2411.19379v3 Marconi Hybrid LLM Prefix Caching Inference Optimization FLOP-aware Scheduling SSM vLLM Serving Efficiency

[Paper Review] SGLang: Efficient Execution of Structured Language Model Programs

Paper Link SGLang & RadixAttention: How Execution Optimization for “LM Programs” Achieved a 6.4x Speedup TL;DR By combining …

October 03, 2025

19 minute

SGLang RadixAttention KV Cache LLM Inference Programming Language and Runtime Constrained Decoding Speculative Execution Distributed Inference Prompt Optimization Multimodal

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Paper Link Structured State Space Duality: Unifying SSMs and Attention with Mamba-2 for 2–8× Acceleration TL;DR Structured State-Space …

October 03, 2025

18 minute

Mamba Mamba-2 Structured State Space Duality SSD State Space Models SSM Transformer Attention Mechanism Long Context Efficient Training FlashAttention Sequence Modeling Scaling Laws Parallelism GPU Acceleration 2405.21060v1

[Paper Review] Inference-Time Hyper-Scaling with KV Cache Compression

Link to Paper Dynamic Memory Sparsification (DMS): Making LLM Hyper-Scaling a Reality with 8× KV Cache Compression One-Line Summary (TL;DR) …

July 29, 2025

30 minute

2506.05345v1

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

Paper Link Hydragen: The Secret Weapon for Decoding Large Batches with Shared Prefixes up to 32× Faster TL;DR By decomposing the prefix and …

July 29, 2025

23 minute

2505.00949v4 Hydragen Prefix Caching Shared Prefix Decoding Efficient Inference Softmax Decomposition LLM Serving Attention Optimization FlashAttention vLLM Batch Inference Matrix-Matrix GEMM TensorCore Optimization

[Paper Review] KIMI K2: OPEN AGENTIC INTELLIGENCE

Paper Link Kimi K2: An Open-Source LLM’s Leap Toward Agentic Intelligence TL;DR With a 3-stage pipeline consisting of MuonClip pretraining + …

July 26, 2025

13 minute

KimiK2 MuonClip tool-use self-critique-RL agentic-llm long-context MoE-models open-source-LLM Tau2-bench SWE-bench

[Paper Review] Qwen 3 Technical Report

Paper Link Qwen 3: The Evolution of a Giant MoE Language Model with Adjustable Reasoning Depth TL;DR (in one line) Qwen 3 couples a …

July 26, 2025

13 minute

Qwen3 Mixture-of-Experts LongContext ThinkingBudget MultilingualModel ChainOfThought BenchmarkEvaluation OpenSourceModel

[Paper Review] Massive Activations in Large Language Models

Paper Link Massive Activations, Hidden Biases: A Reinterpretation of Self-Attention’s Secrets TL;DR Just 4–10 extreme scalar values …

July 09, 2025

20 minute

2402.17762v2 Transformer SelfAttention BiasMechanism RepresentationLearning Interpretability NeuralMechanisms Massive Activations Explicit Attention Bias

[Paper Review] Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

Paper Link Peri-LayerNorm: A Third Option Beyond Post-LN and Pre-LN TL;DR By simply adding another LayerNorm right after the residual …

July 09, 2025

22 minute

2502.02732v3 LayerNorm Transformer Architecture Training Stability Large Language Models FP16 Training Empirical Evaluation Gradient Explosion Benchmark Evaluation

[paper review] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training

[Paper Review] SageAttention 3 & SageBwd — FP4-Powered Inference and 8-bit Training Paper link: https://arxiv.org/abs/2505.11594v1 📝 …

July 09, 2025

23 minute

2505.11594v1 LowPrecision FP4 INT8Training EfficientAttention SageAttention TransformerOptimization BlackwellGPU Quantization InferenceAcceleration TrainingEfficiency CUDA Triton

[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

Paper Link Helix Parallelism: Breaking the Latency-Throughput Wall of Ultra-Long LLM Decoding TL;DR Helix Parallelism schedules Attention …

July 08, 2025

17 minute

2505.09343v1 Helix Parallelism Tensor Parallelism KV Parallelism Mixture of Experts Grouped Query Attention (GQA) FlashAttention Parallelism for LLMs System-Aware ML Efficient Transformer Inference Serving LLMs at Scale Long Context Inference

[Paper Review] Marconi: Prefix Caching for the Era of Hybrid LLMs

[Paper Review] SGLang: Efficient Execution of Structured Language Model Programs

[Paper Review] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

[Paper Review] Inference-Time Hyper-Scaling with KV Cache Compression

[Paper Review] Llama-Nemotron: Efficient Reasoning Models

[Paper Review] KIMI K2: OPEN AGENTIC INTELLIGENCE

[Paper Review] Qwen 3 Technical Report

[Paper Review] Massive Activations in Large Language Models

[Paper Review] Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

[paper review] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training

[Paper Review] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

Start searching

No results found