Skip to content
LessWrong AI · Communities

Research note on negated reward hacking

This work was done as part of the BlueDot's Technical AI Safety Project Sprint and should be treated as an informal report of preliminary results done over a couple of days.The code is available on GitHub, the negated dataset and model checkpoints are available on HuggingFace, also the hack rollouts are available at th