Skip to content
arXiv cs.LG · Papers

Realistic honeypot evaluations for scheming propensity

arXiv:2605.29729v2 Announce Type: replace Abstract: We introduce scheming honeypot evaluations, a framework for testing whether models will pursue instrumental goals if given the opportunity. Our scheming honeypot evaluations take the form of coding tasks in Google's alignment research codebases. In a real internal dep