arXiv cs.LG June 29, 2026 · Papers

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

arXiv:2604.26360v2 Announce Type: replace Abstract: Reinforcement learning from human feedback (RLHF) systems face a compounding alignment challenge: not only are learned reward models uncertain about unseen state-action pairs, but the human preference annotations they are trained on are themselves inconsistent, contex

Read original