Archives

시간순으로 모든 글을 둘러보세요.

총 511개의 글

타임라인

2025

161 개의 글

2025년 10월

8 개의 글

[논문 리뷰] Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin

10-23

paper-review, with-gpt

34 분

[논문 리뷰] Hardware-Efficient Attention for Fast Decoding

10-23

paper-review, with-gpt, Large Language Models, Model Efficiency, Compiler & Systems

33 분

[논문리뷰] Pretraining Large Language Models with NVFP4

10-09

37 분

[논문리뷰]Towards Efficient and Practical GPU Multitasking in the Era of LLM

10-09

paper-review, with-gpt, LLM Systems, Model Serving & Scheduling, GPU Virtualization & Multi-Tenancy

39 분

[논문리뷰] DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

10-08

paper-review, LLM Serving & Systems, Model Optimization & Acceleration

38 분

[논문리뷰] Marconi: Prefix Caching for the Era of Hybrid LLMs

10-08

paper-review, with-gpt, LLM Systems, Model Serving

39 분

[논문리뷰] SGLang: Efficient Execution of Structured Language Model Programs

10-03

paper-review, with-gpt

41 분

[논문리뷰] Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

10-03

paper-review, with-gpt

42 분

2025년 07월

28 개의 글

[논문리뷰] Inference-Time Hyper-Scaling with KV Cache Compression

07-29

paper-review, with-gpt

26 분

[논문리뷰] Llama-Nemotron: Efficient Reasoning Models

07-29

paper-review, with-gpt, efficient-llm, system-optimization, inference-acceleration

26 분

[논문리뷰] KIMI K2: OPEN AGENTIC INTELLIGENCE

07-26

paper-review, with-gpt, open-source, agentic-intelligence, RL-alignment, foundation-models

28 분

[논문리뷰] Qwen3 Technical Report

07-26

paper-review, foundation-models, with-gpt

26 분

K-뷰티, '어쩌다 성공'을 넘어 '구조적 성장'으로

07-26

산업분석, 화장품산업

4 분

Jekyll에서 Hugo로의 마이그레이션 가이드

07-20

일상

4 분

K-뷰티, 새로운 도약의 기로에 설까?

07-20

산업분석, 화장품산업

12 분

블로그 플랫폼 이전(jekyll to hugo)

07-20

일상

3 분

[논문리뷰] Massive Activations in Large Language Models

07-09

paper-review, with-gpt

24 분

[논문리뷰] Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

07-09

paper-review, with-gpt

24 분

[논문리뷰] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training

07-09

paper-review, with-gpt

26 분

[논문리뷰] Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

07-08

paper-review, with-gpt

31 분

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

07-08

paper-review, with-gpt

23 분

Inference-Time Scaling for Generalist Reward Modeling

07-08

paper-review, with-gpt

22 분

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

07-08

paper-review, with-gpt

26 분

Code I/O: Condensing Reasoning Patterns via Code Input-Output Prediction

07-07

paper-review, with-gpt

31 분

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

07-07

paper-review, with-gpt

31 분

디지털 드래곤의 심장: AI 시대, 중국 데이터센터 배터리 시장에 투자해야 하는 이유

07-07

투자, 2차전지, deep-research, with-gemini

11 분

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

07-06

paper-review, with-gpt, DeepSeek

30 분

Janus-Pro: UnifiedMultimodalUnderstanding and Generation with Data and Model Scaling

07-06

paper-review, with-gpt, Janus, DeepSeek

31 분

셀프 로드 자전거 정비 가이드

07-06

자전거정비, 일상, with gpt

7 분

DeepSeek-V3 Technical Report

07-05

paper-review, with-gpt

35 분

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

07-05

paper-review, with-gpt

31 분

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

07-02

paper-review, with-gpt

30 분

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

07-02

paper-review, with-gpt

26 분

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

07-01

paper-review, with-gpt

28 분

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

07-01

paper-review, with-gpt

29 분

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

07-01

paper-review, with-gpt

27 분

2025년 06월

29 개의 글

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

06-30

paper-review, with-gpt

30 분

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

06-30

paper-review, with-gpt

28 분

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

06-30

paper-review, with-gpt

26 분

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

06-29

paper-review, with-gpt, DeepSeek

25 분

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

06-29

paper-review, with-gpt

36 분

DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion Prior

06-29

paper-review, with-gpt, 3D, Diffusion

28 분

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

06-26

paper-review, with-gpt-o3

27 분

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

06-26

paper-review, with-gpt-o3

24 분

Compress, Gather, and Recompute: REFORMingLong-Context Processing in Transformers

06-24

paper-review, with-gpt-o3

26 분

Mamba Drafters for Speculative Decoding

06-24

paper-review, with-gpt-o3

23 분

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

06-23

paper-review, with-gpt-o3

37 분

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

06-19

paper-review, with-gemini-2.5-pro(preview)

39 분

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

06-19

paper-review, with-gemini-2.5-pro(preview)

38 분

MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention

06-19

paper-review, with-gemini-2.5-pro(preview)

42 분

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

06-19

paper-review, with-gemini-2.5-pro(preview)

41 분

Slim attention: cut your context memory in half without loss– K-cache is all you need for MHA

06-16

paper-review, with-gemini-2.5-pro(preview)

30 분

Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs

06-16

paper-review, with-gemini-2.5-pro(preview)

35 분

TransMLA: Multi-Head Latent Attention Is All You Need

06-16

paper-review, with-gemini-2.5-pro(preview)

33 분

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression

06-16

paper-review, with-gemini-2.5-pro(preview)

32 분

A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

06-10

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

32 분

Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal Framework

06-10

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

35 분

ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation

06-10

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

30 분

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

06-10

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

27 분

Supply-Chain Attacks in Machine Learning Frameworks

06-10

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

27 분

Accelerating MoE Model Inference with Expert Sharding

06-05

paper-review, with-gemini-2.5-pro(preview)

34 분

FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference

06-05

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

35 분

ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation

06-05

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

41 분

SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling

06-05

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

38 분

XGRAMMAR: FLEXIBLE AND EFFICIENT STRUCTURED GENERATION ENGINE FOR LARGE LANGUAGE MODELS

06-02

paper-review, with-gemini-2.5-pro(preview), MLSYS2025

28 분

2025년 05월

7 개의 글

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

05-17

paper-review, with-gemini-2.5-pro(preview)

69 분

RODIMUS*: BREAKING THE ACCURACY-EFFICIENCY TRADE-OFF WITH EFFICIENT ATTENTIONS

05-17

paper-review, with-gemini-2.5-pro(preview)

46 분

An Empirical Study of Qwen3 Quantization

05-12

paper-review, with-gpt

15 분

Gemini Embedding: Generalizable Embeddings from Gemini

05-12

paper-review, with-gpt

17 분

Gemma 3 Technical Report

05-12

paper-review, with-gpt

15 분

MELODI: Exploring Memory Compression for Long Contexts

05-12

paper-review, with-gpt

24 분

Seesaw: High-throughput LLM Inference via Model Re-sharding

05-12

paper-review, with-gpt

18 분

2025년 04월

14 개의 글

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

04-16

paper-review, with-gpt

15 분

Toward Efficient Inference for Mixture of Experts

04-16

paper-review, with-gpt

12 분

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

04-14

paper-review, with-gpt

23 분

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

04-14

paper-review, with-gpt

21 분

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

04-14

paper-review, with-gpt

23 분

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

04-13

paper-review, with-gpt

22 분

Mirage: A Multi-Level Superoptimizer for Tensor Programs

04-13

paper-review, with-gpt

30 분

MoEUT: Mixture-of-Experts Universal Transformers

04-13

paper-review, with-gpt

23 분

FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS

04-07

paper-review, with-gpt

28 분

Inference-Time Scaling for Generalist Reward Modeling

04-07

paper-review, with-gpt

26 분

LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

04-07

paper-review, with-gpt

26 분

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

04-07

paper-review, with-gpt

24 분

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

04-02

paper-review, with-gpt

15 분

SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations

04-02

paper-review, with-gpt

15 분

2025년 03월

24 개의 글

Context Parallelism for Scalable Million-Token Inference

03-31

paper-review, with-gpt, MLSYS2025

18 분

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

03-31

paper-review, with-gpt, MLSYS2025

25 분

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

03-25

paper-review, with-gpt

25 분

SELF-DATA DISTILLATION FOR RECOVERING QUALITY IN PRUNED LARGE LANGUAGE MODELS

03-25

paper-review, with-gpt

21 분

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

03-24

paper-review, with-gpt

22 분

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

03-24

paper-review, with-gpt

12 분

TRAINING ULTRA LONG CONTEXT LANGUAGE MODEL WITH FULLY PIPELINED DISTRIBUTED TRANSFORMER

03-24

paper-review, with-gpt

19 분

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

03-18

paper-review, with-gpt

29 분

Venn: Resource Management Across Federated Learning Jobs

03-18

paper-review, with-gpt

16 분

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

03-17

paper-review, with-gpt, MLSYS2025

19 분

Balancing Pipeline Parallelism with Vocabulary Parallelism

03-17

paper-review, with-gpt

35 분

DIFFSERVE: EFFICIENTLY SERVING TEXT-TO-IMAGE DIFFUSION MODELS WITH QUERY-AWARE MODEL SCALING

03-17

paper-review, with-gpt, MLSYS2025

31 분

EFFICIENT LLM INFERENCE USING DYNAMIC INPUT PRUNING AND CACHE-AWARE MASKING

03-12

paper-review, with-gpt, MLSYS2025

37 분

Marconi: Prefix Caching for the Era of Hybrid LLMs

03-12

paper-review, with-gpt, MLSYS2025

37 분

LAVA: LIFETIME-AWARE VM ALLOCATION WITH LEARNED DISTRIBUTIONS AND ADAPTATION TO MISPREDICTIONS

03-11

paper-review, with-gpt, MLSYS2025

39 분

TurboAttention: Efficient Attention Approximation for High Throughputs LLMs

03-11

paper-review, with-gpt, MLSYS2025

35 분

A PRACTICAL CROSS-LAYER APPROACH FOR ML-DRIVEN STORAGE PLACEMENT IN WAREHOUSE-SCALE COMPUTERS

03-10

paper-review, with-gpt, MLSYS2025

36 분

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

03-10

paper-review, with-gpt, MLSYS2025

35 분

Scaling Deep Learning Training with MPMD Pipeline Parallelism

03-10

paper-review, with-gpt, MLSYS2025

28 분

LSERVE: EFFICIENT LONG-SEQUENCE LLM SERVING WITH UNIFIED SPARSE ATTENTION

03-06

paper-review, with-gpt, MLSYS2025

29 분

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

03-06

paper-review, with-gpt, MLSYS2025

24 분

VOLUT: EFFICIENT VOLUMETRIC STREAMING ENHANCED BY LUT-BASED SUPER-RESOLUTION

03-06

paper-review, with-gpt, MLSYS2025

27 분

Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences

03-04

paper-review, with-gpt

19 분

Forget the Data and Fine-Tuning! Just Fold the Network to Compress

03-04

paper-review, with-gpt

33 분

2025년 02월

31 개의 글

Dynamic Diffusion Transformer

02-25

paper-review, with-gpt, ICLR2025

36 분

HEXGEN-2: DISAGGREGATED GENERATIVE INFERENCE OF LLMS IN HETEROGENEOUS ENVIRONMENT

02-25

paper-review, with-gpt, ICLR2025

39 분

Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

02-25

paper-review, with-gpt, ICLR2025

38 분

You OnlyPruneOnce: DESIGNING CALIBRATION-FREE MODEL COMPRESSION WITH POLICY LEARNING

02-25

paper-review, with-gpt, ICLR2025

36 분

FlashMask: Efficient and Rich Mask Extension of FlashAttention

02-24

paper-review, with-gpt, ICLR2025

37 분

TypedThinker: Typed Thinking Improves Large Language Model Reasoning

02-24

paper-review, with-gpt, ICLR2025

41 분

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

02-17

paper-review, with-gpt

26 분

BitsAI-CR: Automated Code Review via LLM in Practice

02-13

paper-review, with-gpt

36 분

Robust and Secure Code Watermarking for Large Language Models via ML/Crypto Codesign

02-13

paper-review, with-gpt

35 분

SmolLM2: When Smol Goes Big Data-Centric Training of a Small Language Mode

02-13

paper-review, with-gpt

38 분

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

02-12

paper-review, with-gpt

36 분

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

02-12

paper-review, with-gpt

39 분

Humanity's Last Exam

02-12

paper-review, with-gpt

16 분

Qwen2.5-1M Technical Report

02-12

paper-review, with-gpt

34 분

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

02-11

paper-review, with-gpt

39 분

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

02-11

paper-review, with-gpt

37 분

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at AnyResolution

02-11

paper-review, with-gpt, Qwen

41 분

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

02-10

paper-review, with-gpt, DeepSeek

43 분

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

02-10

paper-review, with-gpt

37 분

Qwen2 Technical Report

02-10

paper-review, with-gpt, Qwen

66 분

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

02-09

paper-review, with-gpt, DeepSeek

39 분

DeepSeek-VL: Towards Real-World Vision-Language Understanding

02-09

paper-review, with-gpt

40 분

How to Train Data-Efficient LLMs

02-09

paper-review, with-gpt

39 분

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

02-07

paper-review, with-gpt

37 분

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

02-07

paper-review, with-gpt

38 분

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

02-05

paper-review, with-gpt

36 분

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

02-05

paper-review, with-gpt

39 분

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

02-05

paper-review, with-gpt

38 분

Qwen Technical Report

02-04

paper-review, with-gpt

41 분

Qwen-VL: AVersatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

02-04

paper-review, with-gpt

34 분

Janus:DecouplingVisualEncoding for Unified Multimodal Understanding and Generation

02-03

paper-review, with-gpt

40 분

2025년 01월

20 개의 글

A Hardware Evaluation Framework for Large Language Model Inference

01-21

paper-review, with-gpt

25 분

Compressed Context Memory For Online Language Model Interaction

01-21

paper-review, with-gpt

28 분

DeepSeek-V3 Technical Report

01-21

paper-review, with-gpt, DeepSeek

29 분

Fast State Restoration in LLM Serving with HCache

01-21

paper-review, with-gpt

29 분

Qwen2.5 Technical Report

01-21

paper-review, with-gpt

34 분

AIOS: LLM Agent Operating System

01-20

paper-review, with-gpt

41 분

DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

01-20

paper-review, with-gpt

23 분

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

01-20

paper-review, with-gpt

29 분

TAIPAN: EFFICIENT AND EXPRESSIVE STATE SPACE LANGUAGE MODELS WITH SELECTIVE ATTENTION

01-20

paper-review, with-gpt

34 분

TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

01-20

paper-review, with-gpt

21 분

SANA: EFFICIENT HIGH-RESOLUTION IMAGE SYN THESIS WITH LINEAR DIFFUSION TRANSFORMERS

01-15

paper-review, with-gpt

25 분

Block Transformer: Global-to-Local Language Modeling for Fast Inference

01-15

paper-review, with-gpt

29 분

FLAME: Factuality-Aware Alignment for Large Language Models

01-15

paper-review, with-gpt

24 분

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

01-15

paper-review, with-gpt

22 분

Rethinking Optimization and Architecture for Tiny Language Models

01-15

paper-review, with-gpt

22 분

Cascade Speculative Drafting for Even Faster LLM Inference

01-02

paper-review, with-gpt

22 분

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

01-02

paper-review, with-gpt

22 분

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

01-02

paper-review, with-gpt

21 분

Gated Linear Attention Transformers with Hardware-Efficient Training

01-02

paper-review, with-gpt

28 분

LLM in a flash : Efficient Large Language Model Inference with Limited Memory

01-02

paper-review, with-gpt

24 분

2024

326 개의 글

2024년 12월

190 개의 글

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

12-31

paper-review, with-gpt

23 분

Improving alignment of dialogue agents via targeted human judgements

12-31

paper-review, with-gpt

26 분

Language Models are General-Purpose Interfaces

12-31

paper-review, with-gpt

23 분

OPT: Open Pre-trained Transformer Language Models

12-31

paper-review, with-gpt

25 분

SparQ Attention: Bandwidth-Efficient LLM Inference

12-31

paper-review, with-gpt

21 분

CBQ: Cross-Block Quantization for Large Language Models

12-30

paper-review, with-gpt

23 분

SCCA: Shifted Cross Chunk Attention for long contextual semantic expansion

12-30

paper-review, with-gpt

22 분

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

12-26

paper-review, with-gpt

22 분

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

12-26

paper-review, with-gpt

28 분

Gemma: Open Models Based on Gemini Research and Technology

12-26

paper-review, with-gpt

24 분

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

12-26

paper-review, with-gpt

22 분

The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

12-26

paper-review, with-gpt

25 분

Abseil Tip 231 여기와 저기 사이 – 간과되기 쉬운 몇 가지 알고리즘

12-24

cpp, abseil

3 분

Abseil Tip 232 변수 선언 시 auto를 언제 사용할 것인가

12-24

cpp, abseil

4 분

Abseil Tip 234 값, 포인터, 참조로 전달하기

12-24

cpp, abseil

3 분

Gemma 2: Improving Open Language Models at a Practical Size

12-24

paper-review, with-gpt

25 분

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

12-24

paper-review, with-gpt

24 분

SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION

12-24

paper-review, with-gpt

24 분

The Llama 3 Herd of Models

12-24

paper-review, with-gpt

26 분

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

12-24

paper-review, with-gpt

34 분

Communication Compression for Tensor Parallel LLM Inference

12-23

paper-review, with-gpt

20 분

Context Parallelism for Scalable Million-Token Inference

12-23

paper-review, with-gpt

24 분

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs

12-23

paper-review, with-gpt

22 분

SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

12-23

paper-review, with-gpt

22 분

ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

12-20

paper-review, with-gpt

25 분

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

12-20

paper-review, with-gpt

24 분

Large Concept Models: Language Modeling in a Sentence Representation Space

12-20

paper-review, with-gpt

26 분

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

12-20

paper-review, with-gpt

25 분

Sharing and Throughput-oriented Token Batching

12-20

paper-review, with-gpt

23 분

SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

12-20

paper-review, with-gpt

21 분

Star Attention: Efficient LLM Inference over Long Sequences

12-20

paper-review, with-gpt

23 분

Byte Latent Transformer: Patches Scale Better Than Tokens

12-19

paper-review, with-gpt

25 분

Efficient Memory Management for Large Language Model Serving with PagedAttention

12-19

paper-review, with-gpt

28 분

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

12-19

paper-review, with-gpt

30 분

GSPMD: General and Scalable Parallelization for ML Computation Graphs

12-19

paper-review, with-gpt

24 분

Memory Layers at Scale

12-19

paper-review, with-gpt

25 분

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

12-19

paper-review, with-gpt

29 분

Abseil Tip 197 Reader Lock은 드물게 사용해야 합니다

12-18

cpp, abseil

3 분

Abseil Tip 224 vector.at() 사용 피하기

12-18

cpp, abseil

3 분

Abseil Tip 227 빈 컨테이너와 부호 없는 정수 연산 주의하기

12-18

cpp, abseil

3 분

Abseil Tip 229 템플릿 메타프로그래밍을 위한 순위 기반 오버로드

12-18

cpp, abseil

3 분

Fast Inference of Mixture-of-Experts Language Models with Offloading

12-18

paper-review, with-gpt

25 분

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

12-18

paper-review, with-gpt

28 분

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

12-18

paper-review, with-gpt

26 분

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

12-18

paper-review, with-gpt

26 분

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

12-18

paper-review, with-gpt

25 분

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

12-18

paper-review, with-gpt

9 분

Abseil Tip 124 absl::StrFormat()

12-17

cpp, abseil

3 분

Abseil Tip 18 Substitute를 활용한 문자열 포맷팅

12-17

cpp, abseil

2 분

Abseil Tip 198 태그 타입(Tag Types)

12-17

cpp, abseil

4 분

Abseil Tip 215 AbslStringify()를 사용한 사용자 정의 타입 문자열화"

12-17

cpp, abseil

3 분

Abseil Tip 218 FTADLE로 확장 지점 설계하기

12-17

cpp, abseil

4 분

Abseil Tip 3 문자열 연결과 operator+ vs. StrCat()

12-17

cpp, abseil

3 분

Fast and Effective Weight Update for Pruned Large Language Models

12-17

paper-review, with-gpt

24 분

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

12-17

paper-review, with-gpt

23 분

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

12-17

paper-review, with-gpt

26 분

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

12-17

paper-review, with-gpt

21 분

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

12-17

paper-review, with-gpt

26 분

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

12-16

paper-review, with-gpt

29 분

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

12-16

paper-review, with-gpt

36 분

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

12-16

paper-review, with-gpt

28 분

INFERFLOW: AN EFFICIENT AND HIGHLY CONFIG URABLE INFERENCE ENGINE FOR LARGE LANGUAGE MODELS

12-16

paper-review, with-gpt

30 분

MEDUSA: Simple LLMInference Acceleration Framework with Multiple Decoding Heads

12-16

paper-review, with-gpt

23 분

Abseil Tip 186 함수는 무명 네임스페이스에 두는 것을 선호하세요

12-15

cpp, abseil

4 분

Abseil Tip 187 std::unique_ptr Must Be Moved"

12-15

cpp, abseil

5 분

Abseil Tip 188 스마트 포인터를 함수 매개변수로 사용할 때 주의하세요

12-15

cpp, abseil

2 분

Break the Sequential Dependency of LLM Inference Using LOOKAHEAD DECODING

12-15

paper-review, with-gpt

33 분

Decoding Speculative Decoding

12-15

paper-review, with-gpt

37 분

Efficient Prompt Caching via Embedding Similarity

12-15

paper-review, with-gpt

25 분

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

12-15

paper-review, with-gpt

50 분

TP-Aware Dequantization

12-15

paper-review, with-gpt

31 분

Abseil Tip 116 함수 인자에서 참조 사용 시 주의사항

12-14

cpp, abseil

4 분

Abseil Tip 165 초기화 구문을 포함한 if와 switch 문 사용하기

12-14

cpp, abseil

4 분

Abseil Tip 181 StatusOr<T> 값 접근하기

12-14

cpp, abseil

4 분

Abseil Tip 76 absl::Status 사용하기

12-14

cpp, abseil

3 분

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

12-14

paper-review, with-gpt

42 분

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inferenc

12-14

paper-review, with-gpt

17 분

Hydragen: High-Throughput LLM Inference with Shared Prefixes

12-14

paper-review, with-gpt

26 분

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

12-14

paper-review, with-gpt

40 분

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

12-14

paper-review, with-gpt

52 분

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

12-14

paper-review, with-gpt

19 분

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

12-13

paper-review, with-gpt

14 분

Fast Transformer Decoding: One Write-Head is All You Need

12-13

paper-review, with-gpt

16 분

GQA:Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

12-13

paper-review, with-gpt

17 분

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Model

12-13

paper-review, with-gpt

17 분

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

12-13

paper-review, with-gpt

16 분

Token Merging: Your ViT But Faster

12-13

paper-review, with-gpt

15 분

Abseil Tip 140 상수(Constant) 처리 안전한 관용구

12-12

cpp, abseil

4 분

Abseil Tip 163 std::optional 매개변수 전달하기

12-12

cpp, abseil

3 분

Abseil Tip 171 Sentinel 값 피하기

12-12

cpp, abseil

3 분

Abseil Tip 172 지정 초기화자(Designated Initializers)

12-12

cpp, abseil

3 분

Abseil Tip 173 옵션 구조체로 인수 래핑하기

12-12

cpp, abseil

4 분

Abseil Tip 175 C++14와 C++17의 리터럴 상수 변경 사항

12-12

cpp, abseil

3 분

Abseil Tip 176 출력 매개변수 대신 반환 값을 선호하세요

12-12

cpp, abseil

4 분

Abseil Tip 177 할당 가능성과 데이터 멤버 타입

12-12

cpp, abseil

4 분

Abseil Tip 5 사라지는 객체의 함정

12-12

cpp, abseil

3 분

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

12-12

paper-review, with-gpt

15 분

GLM-130B: An Open Bilingual Pre-trained Model

12-12

paper-review, with-gpt

22 분

LLM Inference Unveiled: Survey and Roofline Model Insights

12-12

paper-review, with-gpt

8 분

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

12-12

paper-review, with-gpt

19 분

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

12-12

paper-review, with-gpt

13 분

No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization

12-12

paper-review, with-gpt

14 분

Scaling Instruction-Finetuned Language Models

12-12

paper-review, with-gpt

19 분

Benchmarks as Limits to Arbitrage: Understanding the Low-Volatility Anomaly

12-10

paper-review, with-gpt, finance

12 분

Abseil Tip 108 std::bind를 피하세요

12-10

cpp, abseil

3 분

Abseil Tip 132 Avoid Redundant Map Lookups

12-10

cpp, abseil

3 분

Abseil Tip 146 기본 초기화와 값 초기화

12-10

cpp, abseil

3 분

Abseil Tip 161 좋은 지역 변수와 나쁜 지역 변수

12-10

cpp, abseil

3 분

Abseil Tip 166 복사가 복사가 아닐 때

12-10

cpp, abseil

3 분

Abseil Tip 168 inline 변수

12-10

cpp, abseil

2 분

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

12-10

paper-review, with-gpt, LLM

18 분

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

12-10

paper-review, with-gpt, LLM-Inference

21 분

CHAI: Clustered Head Attention for Efficient LLM Inference

12-10

paper-review, with-gpt, LLM-Inference

19 분

Compressed Context Memory For Online Language Model Interaction

12-10

paper-review, with-gpt, LLM-Inference

14 분

Galactica: A Large Language Model for Science

12-10

paper-review, with-gpt, LLM

18 분

High Idiosyncratic Volatility and Low Returns: International and Further U.S. Evidence

12-10

paper-review, with-gpt, finance

12 분

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

12-10

paper-review, with-gpt, LLM-Inference

20 분

QAQ: Quality Adaptive Quantization for LLM KV Cache

12-10

paper-review, with-gpt, LLM-Inference

17 분

Transformers are Multi-State RNNs

12-10

paper-review, with-gpt, LLM

15 분

Abseil Tip 147 Exhaustive switch 문을 책임감 있게 사용하기

12-09

cpp, abseil

4 분

Abseil Tip 158 Abseil 연관 컨테이너와 contains()

12-09

cpp, abseil

2 분

Abseil Tip 180 Dangling References(유효하지 않은 참조) 피하기

12-09

cpp, abseil

4 분

Abseil Tip 182 정수형 변수를 초기화하세요!

12-09

cpp, abseil

3 분

DeepCache: Accelerating Diffusion Models for Free

12-09

paper-review, with-gpt, LLM-Inference

19 분

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

12-09

paper-review, with-gpt, LLM-Inference

19 분

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

12-09

paper-review, with-gpt, LLM-Inference

14 분

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

12-09

paper-review, with-gpt, LLM-Inference

14 분

Mixed Precision Quantization

12-09

paper-review, with-gpt, LLM-Inference

17 분

Momentum Strategies

12-09

paper-review, with-gpt, finance

27 분

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

12-09

paper-review, with-gpt, LLM-Inference

16 분

Abseil Tip 103 플래그는 전역 변수입니다

12-08

cpp, abseil

2 분

Abseil Tip 45 플래그를 피하라, 특히 라이브러리 코드에서

12-08

cpp, abseil

3 분

Abseil Tip 90 Retired Flags(사용 중단된 플래그)

12-08

cpp, abseil

3 분

Improving Language Understanding by Generative Pre-Training

12-08

paper-review, with-gpt, LLM

18 분

PaLM 2 Technical Report

12-08

paper-review, with-gpt, LLM

20 분

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

12-08

paper-review, with-gpt, LLM-Inference

13 분

QAQ: Quality Adaptive Quantization for LLM KV Cache

12-08

paper-review, with-gpt, LLM-Inference

17 분

Abseil Tip 136 Unordered Containers

12-06

cpp, abseil

2 분

Abseil Tip 144 연관 컨테이너에서의 이종 조회(Heterogeneous Lookup)

12-06

cpp, abseil

4 분

Abseil Tip 152 AbslHashValue과 함께

12-06

cpp, abseil

2 분

Abseil Tip 153 using-directives를 사용하지 마세요

12-06

cpp, abseil

4 분

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

12-06

paper-review, with-gpt, LLM-Inference

16 분

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

12-06

paper-review, with-gpt, LLM

19 분

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

12-06

paper-review, with-gpt, Dynamic Memory Compression, LLM-Inference

14 분

Fast Inference from Transformers via Speculative Decoding

12-06

paper-review, with-gpt, LLM-Infernce, Speculative Decoding

21 분

FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

12-06

paper-review, with-gpt, FASTDECODE, LLM-Inference

15 분

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12-06

paper-review, with-gpt, LLMLingua-2, LLM-Inference

15 분

Abseil Tip 117 복사 생략과 값으로 전달하기

12-05

cpp, abseil

4 분

Abseil Tip 148 Overload Sets

12-05

cpp, abseil

4 분

Abseil Tip 149 Object Lifetimes vs = delete

12-05

cpp, abseil

4 분

Abseil Tip 24 복사, 축약

12-05

cpp, abseil

2 분

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching

12-05

paper-review, with-gpt

15 분

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

12-05

paper-review, with-gpt

13 분

HIERARCHICAL CONTEXT MERGING: BETTER LONG CONTEXT UNDERSTANDING FOR PRE-TRAINED LLMS

12-05

paper-review, with-gpt

12 분

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

12-05

paper-review, with-gpt

16 분

MELTing point: Mobile Evaluation of Language Transformers

12-05

paper-review, with-gpt

13 분

MuxServe:FlexibleSpatial-TemporalMultiplexingforMultipleLLMServing

12-05

paper-review, with-gpt

14 분

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

12-05

paper-review, with-gpt

22 분

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

12-05

paper-review, with-gpt

19 분

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

12-05

paper-review, with-gpt

22 분

Abseil Tip 11 반환 정책

12-04

cpp, abseil

5 분

Abseil Tip 120 반환 값은 건드리지 마세요

12-04

cpp, abseil

4 분

Abseil Tip 143 C++11 삭제된 함수 (= delete)

12-04

cpp, abseil

6 분

CORM: Cache Optimization with Recent Message for Large Language Model Inference

12-04

paper-review, with-gpt

17 분

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

12-04

paper-review, with-gpt

21 분

Llama 2: Open Foundation and Fine-Tuned Chat Models

12-04

paper-review, with-gpt

28 분

Mistral 7B

12-04

paper-review, with-gpt

25 분

Retrieval Head Mechanistically Explains Long-Context Factuality

12-04

paper-review, with-gpt

22 분

SnapKV: LLM Knows What You are Looking for Before Generation

12-04

paper-review, with-gpt

16 분

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

12-04

paper-review, with-gpt

24 분

Toward Inference-optimal Mixture-of-Expert Large Language Models

12-04

paper-review, with-gpt

14 분

Abseil Tip 134 make_unique와 private 생성자

12-03

cpp, abseil

4 분

Abseil Tip 141 bool로의 암시적 변환에 주의하라

12-03

cpp, abseil

3 분

Abseil Tip 61 기본 멤버 초기화 (Default Member Initializers)

12-03

cpp, abseil

3 분

Abseil Tip 93 absl::Span 사용하기

12-03

cpp, abseil

4 분

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

12-03

paper-review, with-gpt

18 분

Fast Inference from Transformers via Speculative Decoding

12-03

paper-review, with-gpt

17 분

MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

12-03

paper-review, with-gpt

7 분

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

12-03

paper-review, with-gpt

22 분

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

12-03

paper-review, with-gpt

28 분

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

12-03

paper-review, with-gpt

16 분

Tree-based Speculative Inference and Verification

12-03

paper-review, with-gpt

18 분

Abseil Tip 142 다중 매개변수 생성자와 explicit

12-02

cpp, abseil

5 분

Abseil Tip 59 튜플 연결하기

12-02

cpp, abseil

2 분

Abseil Tip 88 초기화 방법 =, (), 그리고 {}

12-02

cpp, abseil

4 분

KV Cache Compression

12-02

paper-review, with-gpt

18 분

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

12-02

paper-review, with-gpt

16 분

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

12-02

paper-review, with-gpt

16 분

SKVQ:Sliding-window Key and Value Cache Quantization for Large Language Models

12-02

paper-review, with-gpt

22 분

You Only Cache Once: Decoder-Decoder Architectures for Language Models

12-02

paper-review, with-gpt

16 분

2024년 11월

130 개의 글

Abseil Tip 10 문자열 분리, 골치 아프지 않게

11-29

cpp, abseil

2 분

Abseil Tip 3 문자열 연결과 operator+ vs. StrCat()

11-29

cpp, abseil

3 분

Abseil Tip 36 새로운 Join API

11-29

cpp, abseil

5 분

LoCoCo: Dropping In Convolutions for Long Context Compression

11-29

paper-review, with-gpt

22 분

Loki: Low-rank Keys for Efficient Sparse Attention

11-29

paper-review, with-gpt

15 분

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

11-29

paper-review, with-gpt

19 분

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

11-29

paper-review, with-gpt

17 분

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

11-29

paper-review, with-gpt

21 분

Abseil Tip 131 Special 멤버 함수와 = default

11-28

cpp, abseil

5 분

Abseil Tip 42 초기화 메서드보다 팩토리 함수를 선호하세요

11-28

cpp, abseil

4 분

Abseil Tip 74 위임 생성자와 상속 생성자

11-28

cpp, abseil

3 분

ASimple and Effective L2 Norm-Based Strategy for KV Cache Compression

11-28

paper-review, with-gpt

14 분

Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters

11-28

paper-review, with-gpt

18 분

CItruS : ChunkedInstruction-aware State Eviction for Long Sequence Modeling

11-28

paper-review, with-gpt

14 분

Effectively Compress KV Heads for LLM

11-28

paper-review, with-gpt

14 분

MLKV:Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

11-28

paper-review, with-gpt

15 분

Benchmark of Long Context Capable Approaches

11-27

paper-review, with-gpt

12 분

Dynamic Discriminative Operations (D2O) for Efficient Generative Inference of Large Language Models

11-27

paper-review, with-gpt

14 분

Efficient Sparse Attention needs Adaptive Token Release

11-27

paper-review, with-gpt

21 분

LOOK-M:Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

11-27

paper-review, with-gpt

16 분

MODEL TELLS YOU WHERE TO MERGE: ADAPTIVE KV CACHE MERGING FOR LLMS ON LONG-CONTEXT TASKS

11-27

paper-review, with-gpt

18 분

Abseil Tip 119 using 선언과 네임스페이스 별칭 사용하기

11-26

cpp, abseil

3 분

Abseil Tip 123 absl::optional과 std::unique_ptr

11-26

cpp, abseil

3 분

Abseil Tip 130 네임스페이스 이름 지정

11-26

cpp, abseil

3 분

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

11-26

paper-review, with-gpt

21 분

Keep the Cost Down: A Review on Methods to Optimize LLM’s KV Cache Consumption.

11-26

paper-review, with-gpt

6 분

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

11-26

paper-review, with-gpt

21 분

PQCache: Product Quantization-based KVCache for Long Context LLM Inference

11-26

paper-review, with-gpt

20 분

Pruning in Transformer Decoder

11-26

paper-review, with-gpt

21 분

Abseil Tip 109 함수 선언에서 의미 있는 const 사용

11-25

cpp, abseil

3 분

Abseil Tip 126 make_unique는 새로운 new입니다

11-25

cpp, abseil

3 분

Abseil Tip 99 비멤버 인터페이스 에티켓

11-25

cpp, abseil

4 분

NACL: AGeneral and Effective KV Cache Eviction Framework for LLMs at Inference Time

11-25

paper-review, with-gpt

21 분

Palu: Compressing KV-Cache with Low-Rank Projection

11-25

paper-review, with-gpt

25 분

Post-Training Sparse Attention with Double Sparsity

11-25

paper-review, with-gpt

18 분

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

11-25

paper-review, with-gpt

18 분

ThinK: Thinner Key Cache by Query-Driven Pruning

11-25

paper-review, with-gpt

24 분

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

11-21

paper-review, with-gpt

25 분

InfiniPot: Infinite Context Processing on Memory-Constrained LLMs

11-21

paper-review, with-gpt

18 분

KV-COMPRESS: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

11-21

paper-review, with-gpt

24 분

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

11-21

paper-review, with-gpt

22 분

TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning

11-21

paper-review, with-gpt

23 분

Abseil Tip 112 emplace vs. push_back

11-20

cpp, abseil

3 분

Abseil Tip 49 인자 기반 탐색

11-20

cpp, abseil

4 분

Abseil Tip 65 제자리에 넣기

11-20

cpp, abseil

3 분

DUOATTENTION: EFFICIENT LONG-CONTEXT LLM INFERENCE WITH RETRIEVAL AND STREAMING HEADS

11-20

paper-review, with-gpt

20 분

LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

11-20

paper-review, with-gpt

15 분

SPARSEVLM: VISUAL TOKEN SPARSIFICATION FOR EFFICIENT VISION-LANGUAGE MODEL INFERENCE

11-20

paper-review, with-gpt

19 분

SWIFTKV: FAST PREFILL-OPTIMIZED INFERENCE WITH KNOWLEDGE-PRESERVING MODEL TRANSFORMATION

11-20

paper-review, with-gpt

19 분

TIDALDECODE: FAST AND ACCURATE LLM DECOD ING WITH POSITION PERSISTENT SPARSE ATTENTION

11-20

paper-review, with-gpt

16 분

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training

11-20

paper-review, with-gpt

16 분

A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference

11-19

paper-review, with-gpt

19 분

In-context KV-Cache Eviction for LLMs via Attention-Gate

11-19

paper-review, with-gpt

16 분

Prompt Compression for Large Language Models: A Survey

11-19

paper-review, with-gpt

6 분

Scaling Laws for Neural Language Models

11-19

paper-review, with-gpt

17 분

SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction

11-19

paper-review, with-gpt

18 분

Textbooks Are All You Need

11-19

paper-review, with-gpt

17 분

Abseil Tip 101 반환 값, 참조 및 수명

11-18

cpp, abseil

3 분

Abseil Tip 107 참조 수명 연장

11-18

cpp, abseil

3 분

Abseil Tip 135 계약을 테스트하라, 구현을 테스트하지 마라

11-18

cpp, abseil

5 분

MagicPIG: LSH Sampling for Efficient LLM Generation

11-18

paper-review, with-gpt

26 분

Recycled Attention: Efficient inference for long-context language models

11-18

paper-review, with-gpt

22 분

Squeezed Attention: Accelerating Long Context Length LLM Inference

11-18

paper-review, with-gpt

24 분

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

11-18

paper-review, with-gpt

34 분

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

11-18

paper-review, with-gpt

29 분

Abseil Tip 64 Raw 문자열 리터럴

11-14

cpp, abseil

3 분

Abseil Tip 77 임시 객체, 이동, 복사

11-14

cpp, abseil

5 분

Abseil Tip 86 클래스(enum class)를 활용한 열거형

11-14

cpp, abseil

4 분

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

11-14

paper-review, with-gpt

15 분

Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models

11-14

paper-review, with-gpt

22 분

HART Efficient Visual Generation with Hybrid Autoregressive Transformer

11-14

paper-review, with-gpt

21 분

Learning Transferable Visual Models From Natural Language Supervision

11-14

paper-review, with-gpt

26 분

The CoT Collection Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

11-14

paper-review, with-gpt

22 분

Abseil Tip 122 테스트 픽스처, 명확성, 그리고 데이터 흐름

11-13

cpp, abseil

4 분

Abseil Tip 55 이름 개수 세기와 unique_ptr

11-13

cpp, abseil

4 분

Condition-Aware Neural Network for Controlled Image Generation

11-13

paper-review, with-gpt

24 분

DistriFusion Distributed Parallel Inference for High-Resolution Diffusion Models

11-13

paper-review, with-gpt

32 분

FastComposer Tuning-Free Multi-Subject Image Generation with Localized Attention

11-13

paper-review, with-gpt

25 분

VILA On Pre-training for Visual Language Models

11-13

paper-review, with-gpt

19 분

VILA-U a Unified Foundation Model Integrating Visual Understanding and Generation

11-13

paper-review, with-gpt

23 분

Abseil Tip 1 string_view의 활용 방법과 이점

11-12

cpp, abseil

3 분

Batch Calibration Rethinking Calibration for In-Context Learning and Prompt Engineering

11-12

paper-review, with-gpt

19 분

LaRS Latent Reasoning Skills for Chain-of-Thought Reasoning

11-12

paper-review, with-gpt

25 분

LiteMoE Customizing On-device LLM Serving via Proxy Submodel Tuning

11-12

paper-review, with-gpt

37 분

Query-Efficient Correlation Clustering with Noisy Oracle

11-12

paper-review, with-gpt

16 분

ShadowKV KV Cache in Shadows for High-Throughput Long-Context LLM Inference

11-12

paper-review, with-gpt

18 분

COMET Towards Partical W4A4KV4 LLMs Serving

11-11

paper-review, with-gpt

11 분

ELICIT LLM Augmentation via External In-Context Capability

11-11

paper-review, with-gpt

21 분

EPIC Efficient Position-Independent Context Caching for Serving Large Language Models

11-11

paper-review, with-gpt

11 분

Foundations of Factor Investing

11-11

paper-review, with-gpt, finance

11 분

MagicPIG LSH Sampling for Efficient LLM Generation

11-11

paper-review, with-gpt

14 분

RAG4ITOps A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance

11-11

paper-review, with-gpt

15 분

Scientific Beta Multi-Beta Multi-Strategy Indices Implementing Multi-Factor Equity Portfolios with Smart Factor Indices

11-11

paper-review, with-gpt, finance

13 분

ALPINE Unveiling the Planning Capability of Autoregressive Learning in Language Models

11-10

paper-review, with-gpt

10 분

Can Graph Learning Improve Planning in LLM-based Agents?

11-10

paper-review, with-gpt

16 분

Capital asset prices A theory of market equilibrium under conditions of risk

11-10

paper-review, with-gpt, finance

10 분

DynamoLLM Designing LLM Inference Clusters for Performance and Energy Efficiency

11-10

paper-review, with-gpt

12 분

HYSYNTH Context-Free LLM Approximation for Guiding Program Synthesis

11-10

paper-review, with-gpt

11 분

MInference 1.0 Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

11-10

paper-review, with-gpt

21 분

Portfolio Selection

11-10

paper-review, with-gpt, finance

4 분

The Cross-Section of Expected Stock Returns

11-10

paper-review, with-gpt, finance

15 분

Efficient Streaming Language Models with Attention Sinks

11-07

paper-review, with-gpt

14 분

Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning

11-07

paper-review, with-gpt

3 분

Meta Large Language Model Compiler Foundation Models of Compiler Optimization

11-07

paper-review, with-gpt

4 분

Model Tells You What to Discard Adaptive KV Cache Compression for LLMs

11-07

paper-review, with-gpt

7 분

Transformers are Multi-State RNNs

11-07

paper-review, with-gpt

6 분

BUZZ Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

11-06

paper-review, with-gpt

5 분

CDMPP:ADevice-Model Agnostic Framework for Latency Prediction of Tensor Programs

11-06

paper-review, with-gpt

5 분

Don't Look Twice Faster Video Transformers with Run-Length Tokenization

11-06

paper-review, with-gpt

3 분

FLUX Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

11-06

paper-review, with-gpt

7 분

KVSharer Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

11-06

paper-review, with-gpt

7 분

Efficient Generative LLM Inference Using Phase Splitting

11-05

paper-review, with-gpt

10 분

KernelGPT Enhanced Kernel Fuzzing via Large Language Models

11-05

paper-review, with-gpt

7 분

Magicoder Empowering Code Generation with OSS-Instruct

11-05

paper-review, with-gpt

9 분

Optimal Kernel Orchestration for Tensor Programs with Korch

11-05

paper-review, with-gpt

6 분

SpotServe Serving Generative Large Language Models on Preemptible Instances

11-05

paper-review, with-gpt

13 분

GraphPipe Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

11-04

paper-review, with-gpt

15 분

Helix Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

11-04

paper-review, with-gpt

4 분

Memory Bounds for the Experts Problem

11-04

paper-review, with-gpt

4 분

Sequoia Scalable, Robust, and Hardware-aware Speculative Decoding

11-04

paper-review, with-gpt

8 분

SpecExec Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

11-04

paper-review, with-gpt

7 분

Breaking the Curse of Quality Saturation with User-Centric Ranking

11-03

paper-review, with-gpt

3 분

MEGABYTE Predicting Million-byte Sequences with Multiscale Transformers

11-03

paper-review, with-gpt

10 분

Reasoning over Public and Private Data in Retrieval-Based Systems

11-03

paper-review, with-gpt

5 분

Reasoning over Public and Private Data in Retrieval-Based Systems

11-03

paper-review, with-gpt

3 분

FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU

11-01

paper-review, with-gpt

17 분

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

11-01

paper-review, with-gpt

7 분

Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference

11-01

paper-review, with-gpt

9 분

Teola Towards End-to-End Optimization of LLM-based Applications

11-01

paper-review, with-gpt

9 분

What Matters in Transformers? Not All Attention is Needed Fusion

11-01

paper-review, with-gpt

5 분

2024년 10월

6 개의 글

Better & Faster Large Language Models via Multi-token Prediction

10-31

paper-review, with-gpt

14 분

CacheBlend Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

10-31

paper-review, with-gpt

9 분

Keyformer KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

10-31

paper-review, with-gpt

15 분

Prompt Cache Modular Attention Reuse for Low-Latency Inference

10-31

paper-review, with-gpt

6 분

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

10-31

paper-review, with-gpt

19 분

블로그 다시 시작..

10-31

Daily

1 분

2021

24 개의 글

2021년 02월

21 개의 글

LLVM (clang) build and install (ubuntu 18.04)

02-12

compiler

1 분

LLVM loop unroll and jam pass and view-cfg

02-12

compiler, ML

2 분

간단논문 정리 DARTS DIFFERENTIABLE ARCHITECTURE SEARCH (ICLR 2019)

02-12

NAS, ML, paper-review

2 분

간단논문 정리 End-to-End Deep Learning of Optimization Heuristics (PACT 17)

02-12

compiler, ML, paper-review

1 분

간단논문 정리 Fast and Effective Orchestration of Compiler Optimizations(Zhelong Pan,Rudolf Eigenmann;Purdue University ;CGO’06)

02-12

compiler, ML, paper-review

1 분

간단논문 정리 TVM An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)

02-12

compiler, ML, paper-review

1 분

논문 정리 Chameleon Adaptive Code Optimization for Expedited Deep Neural Network Compilation(ICLR 2020)

02-12

compiler, ML, paper-review

2 분

논문 정리 LLVM A Compilation Framework for Lifelong Program Analysis & Transformation(CGO 04)

02-12

compiler, paper-review

2 분

논문 정리 NeuroVectorizer End-to-End Vectorization with Deep Reinforcement Learning (CGO 20)

02-12

compiler, ML, paper-review

1 분

자주쓰는 파이썬 스크립트 패턴

02-12

python

1 분

apt-get source 바꾸기

02-11

linux

1 분

docker tag 검색하기

02-11

linux

1 분

docker 로 gitlab만들기

02-11

linux

1 분

ElementryOS mouch pad Using it like a Mac Touch Gestures (Loki,Juno)

02-11

linux

1 분

ElementryOS mouch pad Using it like a Mac Touch Gestures (Loki,Juno)

02-11

linux

1 분

inplace_swap

02-11

linux

1 분

jupyter notebook 다른python이 실행될 시

02-11

linux

1 분

Logitech MX anywhere 2s 우분투에서 제스쳐 사용하기

02-11

linux

2 분

ubuntu 16.04 python3 opencv 3.4 설치

02-11

linux

1 분

Ubuntu pdf 를 이미지로 변환

02-11

linux

1 분

ubuntu에서 parallel gzip사용하여 빠르게 압축하기(pigz)

02-11

linux

1 분

2021년 01월

3 개의 글

Linux ubuntu nvidia-docker 설치 및 자주 쓰는 명령어

01-08

linux

2 분

Linux ubuntu Zsh 및 oh-my-zsh 설치

01-03

linux

1 분

블로그 시작..

01-03

Daily

1 분