METR
| Formation | 2022 |
|---|---|
| Founder | Beth Barnes |
| Type | Nonprofit research institute |
| Legal status | 501(c)(3) tax exempt charity |
| Purpose | AI safety research and model evaluation |
| Location | |
| Website | metr |
Model Evaluation and Threat Research (METR) (MEE-tər), is a nonprofit research institute, based in Berkeley, California,[1] that evaluates frontier AI models' capabilities to carry out long-horizon, agentic tasks that some researchers argue could pose catastrophic risks to society.[2][3] They have worked with leading AI companies to conduct pre-deployment model evaluations and contribute to system cards, including OpenAI's o3, o4-mini, GPT-4o and GPT-4.5, and Anthropic's Claude models.[3][4][5][6][7]
METR's CEO and founder is Beth Barnes, a former alignment researcher at OpenAI who left in 2022 to form ARC Evals, the evaluation division of Paul Christiano's Alignment Research Center. In December 2023, ARC Evals was then spun off into an independent 501(c)(3) nonprofit and renamed METR.[8][9][10]
Research
A substantial amount of METR's research is focused on evaluating the capabilities of AI systems to conduct research and development of AI systems themselves, including RE-Bench, a benchmark designed to test whether AIs can "solve research engineering tasks and accelerate AI R&D".[11][12]
Doubling time estimates
In March 2025, METR published a paper noting that the length of software engineering tasks that the leading AI model could complete had a doubling time of around 7 months between 2019 and 2024.[14]
In January 2026, METR has released a new version of their time horizon estimates model (Time Horizon 1.1). According to their new model the rate of progress of AI capabilities has increased since 2023. They now estimate that the post-2023 doubling-time is 130.8 days (4.3 months). Progress is thus estimated to be 20% more rapid.[15]
Time horizon measurements
METR releases a "task-completion time horizon" for analysed AI models. This measures the "task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability."[16] They release it in two variants: The 50%-time horizon, which gives the task duration at which an AI model is estimated to succeed 50% of the time and the 80%-time horizon, which gives the task duration at which an AI model is estimated to succeed 80% of the time.[16] They have two versions of horizon estimates: Time Horizon 1.1, introduced in January 2026, and the original Time Horizon 1.0.[16]
As of 21 February 2026 the best performing model is Claude Opus 4.6 with a 14 hours 30 minutes 50%-time horizon and a 80%-time horizon of 1 hour and 3 minutes.[16] The following table provides the time horizon estimates ordered by the model's release date:[16]
| Model | Release date | Time Horizon 1.1 | Time Horizon 1.0 | ||
|---|---|---|---|---|---|
| 50% | 80% | 50% | 80% | ||
| GPT-2 | February 2019 | — | — | 2 seconds | 0 seconds |
| GPT-3 | May 2020 | — | — | 9 seconds | 2 seconds |
| GPT-3.5 | March 2022 | — | — | 36 seconds | 10 seconds |
| GPT-4 | March 2023 | 4 minutes | 37 seconds | 5 minutes | 1 minute |
| GPT-4 (November 2023) |
November 2023 | 4 minutes | 34 seconds | 9 minutes | 1 minute |
| Claude 3 Opus | March 2024 | 4 minutes | 29 seconds | 6 minutes | 1 minute |
| GPT-4 Turbo | April 2024 | 3 minutes | 37 seconds | 7 minutes | 2 minutes |
| GPT-4o | May 2024 | 6 minutes | 57 seconds | 9 minutes | 2 minutes |
| Qwen2-72B | June 2024 | — | — | 2 minutes | 25 seconds |
| Claude 3.5 Sonnet (Old) | June 2024 | 11 minutes | 1 minute | 19 minutes | 3 minutes |
| Qwen2.5-72B | September 2024 | — | — | 5 minutes | 56 seconds |
| o1-preview | September 2024 | 19 minutes | 3 minutes | 22 minutes | 5 minutes |
| Claude 3.5 Sonnet (New) | October 2024 | 20 minutes | 2 minutes | 30 minutes | 5 minutes |
| DeepSeek-V3 | December 2024 | — | — | 18 minutes | 4 minutes |
| o1 | December 2024 | 38 minutes | 6 minutes | 41 minutes | 6 minutes |
| Claude 3.7 Sonnet | February 2025 | 1 hour | 10 minutes | 56 minutes | 15 minutes |
| o3 | April 2025 | 2 hours 1 minute | 24 minutes | 1 hour 34 minutes | 21 minutes |
| o4-mini | April 2025 | — | — | 1 hour 19 minutes | 16 minutes |
| Claude Opus 4 | May 2025 | 1 hour 41 minutes | 17 minutes | 1 hour 26 minutes | 21 minutes |
| DeepSeek-R1-0528 | May 2025 | — | — | 32 minutes | 4 minutes |
| Gemini 2.5 Pro Preview | June 2025 | — | — | 40 minutes | 9 minutes |
| Grok 4 | July 2025 | — | — | 1 hour 49 minutes | 15 minutes |
| Claude Opus 4.1 | August 2025 | 1 hour 41 minutes | 19 minutes | — | — |
| GPT-5 | August 2025 | 3 hours 34 minutes | 32 minutes | 2 hours 18 minutes | 27 minutes |
| gpt-oss-120b | August 2025 | — | — | 45 minutes | 7 minutes |
| Claude Sonnet 4.5 | September 2025 | — | — | 2 hours 2 minutes | 21 minutes |
| Gemini 3 Pro | November 2025 | 3 hours 57 minutes | 43 minutes | — | — |
| Claude Opus 4.5 | November 2025 | 5 hours 20 minutes | 42 minutes | 4 hours 49 minutes | 27 minutes |
| GPT-5.1-Codex-Max | November 2025 | 3 hours 57 minutes | 41 minutes | 2 hours 53 minutes | 32 minutes |
| Kimi K2 Thinking (inference via Novita AI) |
November 2025 | — | — | 58 minutes | 12 minutes |
| GPT-5.2 (high) | December 2025 | 6 hours 34 minutes | 55 minutes | — | — |
| Claude Opus 4.6 | February 2026 | 11 hours 59 minutes | 1 hour 10 minutes | — | — |
| GPT-5.3-Codex (high) | February 2026 | 6 hours 30 minutes | 47 minutes | — | — |
References
- ^ Witt, Stephen (10 October 2025). "The A.I. Prompt That Could End the World". The New York Times. Archived from the original on 29 October 2025. Retrieved 29 October 2025.
- ^ "About METR". METR. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
- ^ a b "OpenAI o3 and o4-mini System Card". OpenAI. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
- ^ "GPT-4.5 system card". OpenAI. Retrieved 15 June 2025.
- ^ "Introducing Claude 3.5 Sonnet". Anthropic. Archived from the original on 6 February 2025. Retrieved 15 June 2025.
- ^ "Details about METR's preliminary evaluation of Claude 3.7". METR's Autonomy Evaluation Resources. 4 April 2025. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
- ^ Robison, Kylie (2024-08-08). "OpenAI says its latest GPT-4o model is 'medium' risk" Archived 6 February 2026 at the Wayback Machine. The Verge Archived 21 October 2025 at the Wayback Machine. Retrieved 2025-10-29.
- ^ "ARC Evals is now METR". METR Blog. 4 December 2023. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
- ^ Booth, Harry (5 September 2024). "TIME100 AI 2024: Beth Barnes". TIME. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
- ^ Henshall, Will (21 March 2024). "Nobody Knows How to Safety-Test AI". TIME. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
- ^ "Claude 3.7 Sonnet System Card". Anthropic. 24 February 2025. Retrieved 15 June 2025.
- ^ "Gemini 2.5 Pro Preview Model Card". Google. 6 June 2025. Archived from the original on 28 May 2025. Retrieved 15 June 2025.
- ^ "Measuring AI Ability to Complete Long Tasks". METR Blog. 19 March 2025. Archived from the original on 15 June 2025. Retrieved 15 June 2025.
- ^ Lovely, Garrison (19 March 2025). "AI could soon tackle projects that take humans weeks". Nature. doi:10.1038/d41586-025-00831-8. ISSN 1476-4687. Archived from the original on 1 July 2025. Retrieved 15 June 2025.
- ^ "Time Horizon 1.1". METR Blog. 29 January 2026. Archived from the original on 12 February 2026. Retrieved 14 February 2026.
- ^ a b c d e "Task-Completion Time Horizons of Frontier AI Models". METR. February 2026. Retrieved 20 February 2026.