LessWrong AI June 24, 2026 · Communities

Reward Hacking Without Egregious Misalignment in an RL-Only Setting

This work was done as part of the MATS fellowship by Joey Yudelson and Vladimir Ivanov. It was mentored by Ryan Greenblatt. Thanks to Aghyad Deeb and Anders Woodruff for comments on this post. Thanks to Monte MacDiarmid, Evan Hubinger, Sid Black, Satvik Golechha, and Joseph Bloom for clarifying conversations.TL;DRWe tr

Read original