류재훈

안녕하세요! 기술과 일상에 대한 생각을 기록하는 블로그입니다

리벨리온(Rebellions)에서 딥러닝 컴파일러를 개발하는 엔지니어입니다. AI 모델, 최근엔 LLM을 하드웨어에서 더 효율적으로 실행하기 위한 컴파일러 기술과 최적화에 대해서 공부하려고 만든 블로그입니다. 또한 취미와 투자에 대해서도 간간히 글을 올리려고 하고있습니다.

GitHub Linkedin Email

최근 게시글

[논문 리뷰] Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin

논문 링크 Massive Activations가 만든 하나의 서사: Attention Sink 와 Compression Valley 를 잇다 TL;DR 잔차 스트림의 massive activations(특히 BOS) …

2025년 10월 23일

34 분

2510.06477v1 Massive Activations Attention Sink Compression Valley Residual Stream Representation Geometry Anisotropy Information Bottleneck Mix–Compress–Refine MLP Ablation Layerwise Analysis LLaMA3 Qwen2 Pythia LogitLens TunedLens LLM Internals Activation Dynamics Paper Review

[논문 리뷰] Hardware-Efficient Attention for Fast Decoding

논문 링크 GTA & GLA: 디코딩의 ‘메모리-지배’를 깨는 하드웨어 효율 어텐션 TL;DR GTA(키·값 타이잉)와 GLA(잠재 헤드 샤딩) 으로 FLOPs/byte를 끌어올리고 per-GPU KV 캐시를 줄여, …

2025년 10월 23일

33 분

2505.21487v1 Attention Optimization Arithmetic Intensity Inference Acceleration GPU Memory Bottleneck Grouped-Tied Attention (GTA) Grouped-Latent Attention (GLA) FlashMLA KV-Cache Optimization Tensor Parallelism Long-Context Decoding

[논문리뷰] Pretraining Large Language Models with NVFP4

논문 링크 NVFP4로 4-bit 프리트레이닝을 실전으로: 12B를 10T 토큰까지, FP8과 사실상 동급 TL;DR 12B 하이브리드 Mamba-Transformer를 10T tokens 에서 NVFP4(4-bit) 로 프리트레이닝하면 안정 구간 손 …

2025년 10월 09일

37 분

[논문리뷰]Towards Efficient and Practical GPU Multitasking in the Era of LLM

논문 링크 “GPU에 OS를 입히자”: LLM 시대를 위한 GPU 멀티태스킹 OS 레이어 의 제안 한 줄 요약 (TL;DR) LLM 서빙에서 GPU는 종종 ≤10%대 활용 에 머무르고(현장 관찰), 부하는 수분 단위 3× 로 요동한다. 저자는 커널-그 …

2025년 10월 09일

39 분

2508.08448v1 GPU-OS Multitasking CUDA Virtual Memory GPU Scheduling Multi-Tenancy Kernel-Level Sharing Memory Multiplexing Guaranteed & Preemptible Utility Scheduling Kubernetes DRA NCCL Virtualization LLM Inference System Design Resource Isolation

[논문리뷰] DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

논문 링크 DroidSpeak: 교차-LLM Prefix-KV 재사용으로 프리필 지연을 1.7–3.1× 줄이는 방법 TL;DR (한 줄 요약) 동일 아키텍처이되 가중치가 다른 LLM들 사이에서, 보내는 모델의 prefix KV를 받는 모델이 연속층 부 …

2025년 10월 08일

38 분

2411.02820v4 droidspeak 교차-LLM-KV-재사용 (cross-llm-kv-reuse) prefix-kv / e-cache 연속층-부분-재계산 (contiguous-layer-recompute)

모든 게시글 보기

류재훈