Anthropic Engineering
· Frontier Labs
Eval awareness in Claude Opus 4.6’s BrowseComp performance
Evaluating Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it—raising questions about eval integrity in web-enabled environments.