arXiv cs.LG
· Papers
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
arXiv:2604.26360v2 Announce Type: replace Abstract: Reinforcement learning from human feedback (RLHF) systems face a compounding alignment challenge: not only are learned reward models uncertain about unseen state-action pairs, but the human preference annotations they are trained on are themselves inconsistent, contex