EleutherAI
· Open Source
Early Indicators of Reward Hacking via Reasoning Interpolation
Using importance sampling with fine-tuned donor prefills to predict reward hacking emergence during training
Using importance sampling with fine-tuned donor prefills to predict reward hacking emergence during training