arXiv cs.LG
· Papers
Realistic honeypot evaluations for scheming propensity
arXiv:2605.29729v2 Announce Type: replace Abstract: We introduce scheming honeypot evaluations, a framework for testing whether models will pursue instrumental goals if given the opportunity. Our scheming honeypot evaluations take the form of coding tasks in Google's alignment research codebases. In a real internal dep