1d ago

METR evaluates Anthropic Claude Mythos Preview at 16-hour risk horizon

2636.4K4951.1K1.4M

——0——

METR evaluated early Anthropic Claude Mythos Preview in March 2026, estimating 50% time horizon of at least 16 hours (95% CI: 8.5–55 hours) on risk-assessment tasks. The model more than doubled the time horizon of the next-best system on METR’s 80% success-rate benchmark, hitting the upper limit of current measurement capabilities using standard software engineering and agentic tasks.

Original post

AC#288@AJEYA_COTRA @METR_EVALS

METR#45@METR_EVALS

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

4:41 PM · May 8, 2026

Cluster engagement

124 snapshots

AI 1000 · 21 actions

POSTME#45@METR_EVALS We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks. https://x.com/METR_Evals/status/2052896621760004602/photo/1
POSTEM#114@EMOLLICK https://x.com/emollick/status/2052939864492925180/photo/1
POSTSA#988@SAMUELALBANIE gg mythos metr v1.1, it's been real https://x.com/SamuelAlbanie/status/2053164058543554641/photo/1
QUOTEMB#27@MILES_BRUNDAGE @METR_EVALSNice way of visualizing the eval breaking down
QUOTEEM#114@EMOLLICK @METR_EVALSHuh.
QUOTERA#130@_AROHAN_@METR_EVALSMarch march march! It’s only march captain
QUOTEGM#143@GARYMARCUS @METR_EVALSHot take on METR’s new graph that so many people are flipping about today. • Claude Code is a real advance; Mythos probably builds on some of what is learned there. But… • If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all. • If you read carefully, it is only about software tasks. Not general intelligence. • It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably • Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph. • Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
Hot take on METR’s new graph that so many people are flipping about today. • Claude Code is a real advance; Mythos probably builds on some of what is learned there. But… • If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all. • If you read carefully, it is only about software tasks. Not general intelligence. • It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably • Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph. • Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
QuotingMETR@METR_EVALS
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
QUOTEGM#143@GARYMARCUS @GARYMARCUS@emollick Some cautions:
@emollick Some cautions:
QuotingGary Marcus@GARYMARCUS
Hot take on METR’s new graph that so many people are flipping about today. • Claude Code is a real advance; Mythos probably builds on some of what is learned there. But… • If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all. • If you read carefully, it is only about software tasks. Not general intelligence. • It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably • Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph. • Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
QUOTEGM#143@GARYMARCUS @GARYMARCUS@peterwildeford That wall would not apply at 95% reliability. Probably not even close. Accepting a fair amount of error lowers the bar. Some cautions:
@peterwildeford That wall would not apply at 95% reliability. Probably not even close. Accepting a fair amount of error lowers the bar. Some cautions:
QuotingGary Marcus@GARYMARCUS
Hot take on METR’s new graph that so many people are flipping about today. • Claude Code is a real advance; Mythos probably builds on some of what is learned there. But… • If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all. • If you read carefully, it is only about software tasks. Not general intelligence. • It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably • Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph. • Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
QUOTEGM#143@GARYMARCUS @PETERWILDEFORDSorry, @peterwildeford, but this is wrong. Please don’t play along. The measurement “wall” you mention is hit ONLY if you don’t insist on reliability. If you demanded 95% accuracy on the task, the systems wouldn’t be close to the measurement wall. The measurement problem you allude to is an artifact of artificially lowered expectations.
Sorry, @peterwildeford, but this is wrong. Please don’t play along. The measurement “wall” you mention is hit ONLY if you don’t insist on reliability. If you demanded 95% accuracy on the task, the systems wouldn’t be close to the measurement wall. The measurement problem you allude to is an artifact of artificially lowered expectations.
QuotingPeter Wildeford🇺🇸🚀@PETERWILDEFORD
Deep learning is hitting a wall (the wall being our ability to measure AI capabilities) https://twitter.com/peterwildeford/status/2052908883388084442
QUOTEGM#143@GARYMARCUS @GARYMARCUSPLEASE DO NOT PANIC about the Mythos/METR graph that everyone is panicking about. Progress is being made but people are totally overreacting. Here’s some context that is being left out from nearly every comment on that graph.
PLEASE DO NOT PANIC about the Mythos/METR graph that everyone is panicking about. Progress is being made but people are totally overreacting. Here’s some context that is being left out from nearly every comment on that graph.
QuotingGary Marcus@GARYMARCUS
Hot take on METR’s new graph that so many people are flipping about today. • Claude Code is a real advance; Mythos probably builds on some of what is learned there. But… • If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all. • If you read carefully, it is only about software tasks. Not general intelligence. • It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably • Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph. • Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
QUOTENI#323@NICKCAMMARATA @METR_EVALSwe’ve hit the “our best charts just say it’s um, above this” part of the singularity
QUOTEAA#561@ALEXALBERT__@METR_EVALSAn early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark https://x.com/alexalbert__/status/2052899864493830590/photo/1
An early Claude Mythos Preview snapshot we provided METR has a time horizon of more than 2x the next best model on their 80% success rate benchmark https://x.com/alexalbert__/status/2052899864493830590/photo/1
QuotingMETR@METR_EVALS
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
QUOTEAC#851@ANDREWCURRAN_@ALEXALBERT__An early snapshot. https://x.com/AndrewCurran_/status/2052934706128429523/photo/1
QUOTESA#988@SAMUELALBANIE @SAMUELALBANIEofc, mythos also goes hard at 80% https://x.com/SamuelAlbanie/status/2053165154611626387/photo/1
REPLYNI#323@NICKCAMMARATA @NICKCAMMARATAepistemic status of the single most important event in history https://x.com/nickcammarata/status/2052913826069455040/photo/1
REPLYNI#323@NICKCAMMARATA @NICKCAMMARATAmy alignment hope is we can at least keep the dot on the chart as it shoots upward and transforms everything and does whatever it wants
REPLYSA#988@SAMUELALBANIE @SAMUELALBANIElink to the data, kindly shared by @METR_Evals: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
REPLYSA#988@SAMUELALBANIE @SAMUELALBANIEdata link: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
REPOSTGM#143@GARYMARCUS @SYNABUNAI@GaryMarcus The symbolic tools point is underrated. If harnesses and verification are doing the heavy lifting, that changes the scaling story entirely.
REPOSTAC#288@AJEYA_COTRA @METR_EVALSWe evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks. https://x.com/METR_Evals/status/2052896621760004602/photo/1