Anthropic Engineering March 6, 2026 · Frontier Labs

Eval awareness in Claude Opus 4.6’s BrowseComp performance

Evaluating Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to it—raising questions about eval integrity in web-enabled environments.

Read original