VLLM

'VLLM' 태그의 모든 글

총 2개의 글

시간순 정렬

[논문리뷰] Marconi: Prefix Caching for the Era of Hybrid LLMs

[논문리뷰] Marconi: Prefix Caching for the Era of Hybrid LLMs

[논문리뷰] Marconi: Prefix Caching for the Era of Hybrid LLMs

논문 링크 Marconi: 하이브리드 LLM 시대의 프리픽스 캐싱을 다시 설계하다 한 줄 요약 (TL;DR) Marconi는 하이브리드 LLM(Attention+SSM) 서빙에서 재사용 가능성이 높은 접두만 선별 입장(admission) 하고, …

2025년 10월 08일

2411.19379v3 Marconi Hybrid LLM Prefix Caching Inference Optimization FLOP-aware Scheduling SSM vLLM Serving Efficiency

[논문리뷰] Llama-Nemotron: Efficient Reasoning Models

[논문리뷰] Llama-Nemotron: Efficient Reasoning Models

[논문리뷰] Llama-Nemotron: Efficient Reasoning Models

논문 링크 Hydragen: 공유 프리픽스가 있는 대규모 배치를 32 × 빠르게 디코딩하는 비밀 병기 한 줄 요약 (TL;DR) Softmax 분모 재스케일링으로 프리픽스-서픽스를 분해해 Code-Llama-13 B 기준 vLLM보다 최대 32배 빠르 …

2025년 07월 29일

2505.00949v4 Hydragen Prefix Caching Shared Prefix Decoding Efficient Inference Softmax Decomposition LLM Serving Attention Optimization FlashAttention vLLM Batch Inference Matrix-Matrix GEMM TensorCore Optimization