논문 : https://arxiv.org/abs/2407.00326
Quest Query-Aware Sparsity for Efficient Long-Context LLM Inference
논문 : https://arxiv.org/abs/2406.10774
What Matters in Transformers? Not All Attention is Needed Fusion
논문 : https://arxiv.org/abs/2406.15786v1
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
논문 : https://arxiv.org/abs/2407.01527v1
FlexGen High-Throughput Generative Inference of Large Language Models with a Single GPU
논문 : https://arxiv.org/abs/2303.06865
Prompt Cache Modular Attention Reuse for Low-Latency Inference
논문 : https://arxiv.org/abs/2311.04934
Better & Faster Large Language Models via Multi-token Prediction
논문 : https://arxiv.org/abs/2404.19737
Keyformer KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
논문 : https://arxiv.org/abs/2403.09054
CacheBlend Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
논문 : https://arxiv.org/abs/2405.16444
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
논문 : https://arxiv.org/abs/2403.02310