Skip to content
EleutherAI · Open Source

Early Indicators of Reward Hacking via Reasoning Interpolation

Using importance sampling with fine-tuned donor prefills to predict reward hacking emergence during training