Inference-Time Scaling for Generalist Reward Modeling
논문 링크 Inference-Time Scaling: DeepSeek-GRM이 초대형 모델을 넘어선 비결 한 줄 요약 (TL;DR) “27 B 모델 × 32배 샘플”—Generative Reward Model(GRM)과 k-Vote …
22 분
2504.02495v2
Reward Modeling
Generative Reward Model
LLM Evaluation
Preference Modeling
Reinforcement Learning from Human Feedback (RLHF)
DeepSeek