Impact of modelling assumptions on time horizon results
As METR’s time horizon task suite saturates, the results are becoming more sensitive to analysis choices. One example of this was the recent update to fix…
As METR’s time horizon task suite saturates, the results are becoming more sensitive to analysis choices. One example of this was the recent update to fix…
Introduction METR aims to keep the public informed about the capabilities of and risks posed by AI — by some metrics the fastest-moving technology in history,…
We reviewed two versions of Anthropic’s Sabotage Risk Report for Claude Opus 4.6, producing two corresponding review documents: our review of the February 11 version and…
.content figure figcaption p { font-weight: normal; } Summary: We find that roughly half of test-passing SWE-bench Verified PRs written by mid-2024 to mid/late-2025 agents would…
Summary: Opus 4.6 can, with a simple agent scaffold, create mostly-playable but somewhat broken CLI versions of Slay the Spire and Balatro1. Intro Last weekend I…
METR previously published a paper which found the use of AI tools caused a 20% slowdown in completing tasks among experienced open-source developers, using data from…
.show-xxl { display: none; @media (min-width: 1471px) { display: block; } } Evidence-based AI policy is important but hard. We need more in-depth studies – which…
.content .tab-pane blockquote { padding: 0.5rem 1rem; margin: 0.5rem 0; } .content .tab-pane blockquote p { font-size: 1.15rem; margin-bottom: 0.75rem; } .custom-padding-table-selector + table * {…
Introduction Human uplift studies like the one we did in 2025 are becoming more expensive as working without AI becomes increasingly costly. In this post, I…
Most of METR’s time horizon measurements are done using two scaffolds: Triframe and ReAct1. People sometimes see that we use these two scaffolds and feel skeptical…