22h ago

UK institute reports 4.7-month doubling in AI cyber tasks

0

The UK's AI Security Institute reported in February 2026 that frontier models doubled the length of autonomous cyber tasks they complete at 80 percent success every 4.7 months since late 2024, accelerating from an 8-month pace measured in November 2025. A newer Claude Mythos Preview checkpoint became the first model to finish both evaluated cyber ranges. GPT-5.5 posted comparable gains. The trend matches separate METR results showing a 4.2-month doubling time on software engineering benchmarks over the same period.

Original post

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.

8:49 AM · May 13, 2026 View on X
Reposted by

The UK’s state AI Security iIstitute findings: 1) Mythos is a big gain in cyber capabilities. But so is GPT-5.5 2) It is hard to establish an upper bound on Mythos/GPT-5.5, which appear to be limited by tokens used, rather than ability. 3) Capability doubling time is 4.5 months

4:11 PM · May 13, 2026 · 24K Views

All of this aligns with METR’s results as well.

Report: https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing

Ethan Mollick@emollick

The UK’s state AI Security iIstitute findings: 1) Mythos is a big gain in cyber capabilities. But so is GPT-5.5 2) It is hard to establish an upper bound on Mythos/GPT-5.5, which appear to be limited by tokens used, rather than ability. 3) Capability doubling time is 4.5 months

4:11 PM · May 13, 2026 · 24K Views
4:14 PM · May 13, 2026 · 6.6K Views

In life, everything is a wager. Whether you realize it or not, you are constantly making implicit and explicit predictions about the future state of reality. To live is to predict.

So when you are faced with something like Mythos, and you say, “this is just ‘doomer hype’!,” what you are really doing is making a bet against model capabilities growth, and thus ultimately you are making a broad directional bet against deep learning, which has usually been a pretty bad bet to make.

I am surprised that so many people—people who are otherwise AI optimists!—continue to make these bets against deep learning. They keep being wrong, and the less humble among them have torched their credibility with anyone paying attention.

So ask yourself, when you make claims about AI and its future: “am I making an implicit bet against deep learning in a broad directional way?”

6:48 PM · May 13, 2026 · 22.8K Views

The UK AISI found Mythos Preview is the first model to solve both their cyber ranges end-to-end. No model had ever solved the AISI’s “Cooling Tower” cyber range before.

We're getting it to defenders as fast as we responsibly can. More to come on our Glasswing work soon.

Logan Graham@logangraham

A lot of people have been wondering about Mythos, Glasswing, and the vulns we / our partners are fixing. Today, I’m excited for us to start sharing more. (For context, I lead Glasswing @AnthropicAI.) Two independent evaluations this week—from XBOW and the UK AISI—confirm what we've been seeing internally: Claude Mythos Preview is a step change in autonomous cybersecurity capabilities. We need to start preparing fast for a world of models with this level of capabilities. The UK AI Security Institute tested the model we shipped at the launch of Project Glasswing and found Mythos Preview is the first model to solve both of their end-to-end cyber ranges, including one (Cooling Tower) which no model had ever cleared. But attackers (and defenders) have sophistication & cost constraints – Mythos is also the only model that clears every one of their tasks estimated over 8 hours under their deliberately low 2.5M-token cap. XBOW tested it on their offensive security benchmarks, finding "token-for-token, unprecedented precision." It's the only model to succeed at subtle V8 sandbox work. Other Glasswing partners shared similar stories. In a few weeks of testing, Mythos Preview has helped them find many thousands of (estimated) high + critical severity vulnerabilities, sometimes double what they'd normally find in a year. I don't share this to boost Mythos. In fact, this is not about Mythos. It’s about preparing for the coming world of models being better, faster, cheaper, and more creative than some of the best human experts at dual use capabilities. Clearly, we need them supporting defenders as widely as can be done safely – and especially the least resourced ones. Within a year, Mythos will probably look quite dumb (relative to other new models). And others may release openly available or unguardrailed models of Mythos-level capabilities. We started Project Glasswing because capabilities like Mythos Preview's won't stay rare, or stay in careful hands. We are bringing it to defenders as fast as we responsibly can, while working to figure out, for example, the right safeguards and patching & disclosure processes. Also, to be clear, compute has never been a limiter in our rollout. Expect a fuller update on our Glasswing work in the coming days. XBOW report: https://xbow.com/blog/mythos-offensive-security-xbow-evaluation UK AISI report: https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing

5:23 PM · May 13, 2026 · 343.2K Views
5:40 PM · May 13, 2026 · 137.6K Views

They are notedly using 'a newer Mythos Preview checkpoint than that included in previous AISI reporting.' From the blog:

'Our latest doubling time estimates are close to those produced by METR, a research non-profit that estimates time horizons for software engineering – a skillset related to cyber, but broader. Their results imply a consistent doubling time of 4.2 months on software tasks since late 2024.3

We have also observed further evidence of cyber autonomy beyond our narrow task suite. AISI’s cyber ranges (shown below) measure AI models’ ability to complete cyberattacks against small, undefended enterprise networks, where initial access has already been gained. Each cyber range requires sustained planning and execution capability; more detail on them can be found in our recent paper.

In AISI’s latest testing, the newer Mythos Preview checkpoint completed both our cyber ranges, solving the range “The Last Ones” in 6 of 10 attempts and the previously unsolved “Cooling Tower” in 3 of 10 attempts. This was the first time that a model completed the second of our two cyber ranges. GPT-5.5 solved “The Last Ones” on 3 of 10 attempts.

These results utilise a newer Mythos Preview checkpoint than that included in previous AISI reporting. Notable capability jumps do not always require new model releases: later iterations of the same model can also meaningfully change our estimates of frontier capabilities. '

AI Security Institute@AISecurityInst

Our evaluations show that frontier AI's cyber capabilities are advancing quickly. The length of cyber tasks frontier models can complete has been doubling every few months, and this rate has become faster over time, with recent models exceeding our previous trends. 🧵

3:49 PM · May 13, 2026 · 115.5K Views
3:58 PM · May 13, 2026 · 19.9K Views

'No single benchmark result should be read as a precise measure of AI capability. Time horizon estimates carry genuine uncertainty; the longest tasks in our narrow suite have the fewest human baselines, and it is too early to tell whether the step-change from recent models is representative of a new ongoing (or accelerating) pace. Regardless, the direction of change and rapid growth have been consistent across the models, methodological choices and independent data we examined.

Frontier AI's autonomous cyber and software capability is advancing quickly: the length of cyber tasks that frontier models can complete autonomously has doubled on the order of months, not years. What this evidence does not tell us is how the pace of progress will evolve, when AI will reach any particular capability threshold, or how these capabilities will translate against defended, real-world systems.

Stronger AI cyber capabilities are already producing tangible opportunities and risks. Cyber defenders have reported significant advances in vulnerability discovery using recent models, and access to today's controlled capabilities may diffuse over time. The time to invest in strong security baselines is now. Frontier AI can strengthen attackers as well as defenders, and there is a critical window to build resilience. The National Cyber Security Centre recently published advice on using AI models to find vulnerabilities.'

Andrew Curran@AndrewCurran_

They are notedly using 'a newer Mythos Preview checkpoint than that included in previous AISI reporting.' From the blog: 'Our latest doubling time estimates are close to those produced by METR, a research non-profit that estimates time horizons for software engineering – a skillset related to cyber, but broader. Their results imply a consistent doubling time of 4.2 months on software tasks since late 2024.3 We have also observed further evidence of cyber autonomy beyond our narrow task suite. AISI’s cyber ranges (shown below) measure AI models’ ability to complete cyberattacks against small, undefended enterprise networks, where initial access has already been gained. Each cyber range requires sustained planning and execution capability; more detail on them can be found in our recent paper. In AISI’s latest testing, the newer Mythos Preview checkpoint completed both our cyber ranges, solving the range “The Last Ones” in 6 of 10 attempts and the previously unsolved “Cooling Tower” in 3 of 10 attempts. This was the first time that a model completed the second of our two cyber ranges. GPT-5.5 solved “The Last Ones” on 3 of 10 attempts. These results utilise a newer Mythos Preview checkpoint than that included in previous AISI reporting. Notable capability jumps do not always require new model releases: later iterations of the same model can also meaningfully change our estimates of frontier capabilities. '

3:58 PM · May 13, 2026 · 19.9K Views
4:04 PM · May 13, 2026 · 1.8K Views

Mythos-preview-old had existed since late February. I don't know when they had this one ready for evaluation, but likely about 2 months have passed. This isn't extraordinary given what we've seen from OpenAI since 5.2 (and you see Cyber here). Iterations are fast now.

AI Security Institute@AISecurityInst

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.

3:49 PM · May 13, 2026 · 389.4K Views
5:28 PM · May 13, 2026 · 4.5K Views

A lot of people have been wondering about Mythos, Glasswing, and the vulns we / our partners are fixing. Today, I’m excited for us to start sharing more. (For context, I lead Glasswing @AnthropicAI.)

Two independent evaluations this week—from XBOW and the UK AISI—confirm what we've been seeing internally: Claude Mythos Preview is a step change in autonomous cybersecurity capabilities. We need to start preparing fast for a world of models with this level of capabilities.

The UK AI Security Institute tested the model we shipped at the launch of Project Glasswing and found Mythos Preview is the first model to solve both of their end-to-end cyber ranges, including one (Cooling Tower) which no model had ever cleared. But attackers (and defenders) have sophistication & cost constraints – Mythos is also the only model that clears every one of their tasks estimated over 8 hours under their deliberately low 2.5M-token cap.

XBOW tested it on their offensive security benchmarks, finding "token-for-token, unprecedented precision." It's the only model to succeed at subtle V8 sandbox work.

Other Glasswing partners shared similar stories. In a few weeks of testing, Mythos Preview has helped them find many thousands of (estimated) high + critical severity vulnerabilities, sometimes double what they'd normally find in a year.

I don't share this to boost Mythos. In fact, this is not about Mythos. It’s about preparing for the coming world of models being better, faster, cheaper, and more creative than some of the best human experts at dual use capabilities. Clearly, we need them supporting defenders as widely as can be done safely – and especially the least resourced ones.

Within a year, Mythos will probably look quite dumb (relative to other new models). And others may release openly available or unguardrailed models of Mythos-level capabilities.

We started Project Glasswing because capabilities like Mythos Preview's won't stay rare, or stay in careful hands. We are bringing it to defenders as fast as we responsibly can, while working to figure out, for example, the right safeguards and patching & disclosure processes.

Also, to be clear, compute has never been a limiter in our rollout.

Expect a fuller update on our Glasswing work in the coming days.

XBOW report: https://xbow.com/blog/mythos-offensive-security-xbow-evaluation

UK AISI report: https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing

AI Security Institute@AISecurityInst

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.

3:49 PM · May 13, 2026 · 389.4K Views
5:23 PM · May 13, 2026 · 343.2K Views

@fluxxrider @AnthropicAI UK AISI previously tested a partially trained version. The latest results are on the actual Mythos Preview -- the model as it was on the day we launched Glasswing (on April 7th).

6:07 PM · May 13, 2026 · 10.9K Views

@logangraham @connoraxiotes @AnthropicAI 👀

Logan Graham@logangraham

A lot of people have been wondering about Mythos, Glasswing, and the vulns we / our partners are fixing. Today, I’m excited for us to start sharing more. (For context, I lead Glasswing @AnthropicAI.) Two independent evaluations this week—from XBOW and the UK AISI—confirm what we've been seeing internally: Claude Mythos Preview is a step change in autonomous cybersecurity capabilities. We need to start preparing fast for a world of models with this level of capabilities. The UK AI Security Institute tested the model we shipped at the launch of Project Glasswing and found Mythos Preview is the first model to solve both of their end-to-end cyber ranges, including one (Cooling Tower) which no model had ever cleared. But attackers (and defenders) have sophistication & cost constraints – Mythos is also the only model that clears every one of their tasks estimated over 8 hours under their deliberately low 2.5M-token cap. XBOW tested it on their offensive security benchmarks, finding "token-for-token, unprecedented precision." It's the only model to succeed at subtle V8 sandbox work. Other Glasswing partners shared similar stories. In a few weeks of testing, Mythos Preview has helped them find many thousands of (estimated) high + critical severity vulnerabilities, sometimes double what they'd normally find in a year. I don't share this to boost Mythos. In fact, this is not about Mythos. It’s about preparing for the coming world of models being better, faster, cheaper, and more creative than some of the best human experts at dual use capabilities. Clearly, we need them supporting defenders as widely as can be done safely – and especially the least resourced ones. Within a year, Mythos will probably look quite dumb (relative to other new models). And others may release openly available or unguardrailed models of Mythos-level capabilities. We started Project Glasswing because capabilities like Mythos Preview's won't stay rare, or stay in careful hands. We are bringing it to defenders as fast as we responsibly can, while working to figure out, for example, the right safeguards and patching & disclosure processes. Also, to be clear, compute has never been a limiter in our rollout. Expect a fuller update on our Glasswing work in the coming days. XBOW report: https://xbow.com/blog/mythos-offensive-security-xbow-evaluation UK AISI report: https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing

5:23 PM · May 13, 2026 · 343.2K Views
6:19 PM · May 13, 2026 · 1.8K Views

Anthropic has a new version of Claude Mythos that seems to be dramatically more capable than the one that shook the world just weeks ago.

AI is accelerating, not slowing down.

AI Security Institute@AISecurityInst

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.

3:49 PM · May 13, 2026 · 389.4K Views
7:00 PM · May 13, 2026 · 19.9K Views

Worth paying attention to: @AISecurityInst’s initial review of Anthropic’s Mythos made waves… but today they publish results that show that a later checkpoint of the model is significantly more capable still. There is no deceleration.

AI Security Institute@AISecurityInst

Our cyber range results illustrate this step-up. Since our first Mythos evaluation, we received access to a newer Mythos Preview checkpoint. On a 32-step corporate network attack we estimate takes a human expert ~20 hours, this checkpoint completes the full attack in 6 /10 attempts.

3:49 PM · May 13, 2026 · 389.4K Views
7:53 PM · May 13, 2026 · 16.2K Views

Earlier UK AISI eval had GPT-5.5 and Mythos tied on cyber, which seemed off given real-world signals.

Turns out that was testing an unfinished Mythos. A revised test shows Mythos beats GPT-5.5 at cyber.

12:31 AM · May 14, 2026 · 5.8K Views

i’ve grown tired of pretending this is still moving at human speed.

something shifted with mythos. not in the theatrical “the robots woke up” way people like to mock. in the quieter, colder way. the kind where a lab looks at its own evals and realises the old categories stopped working.

we assumed the next jump would be obvious. bigger data centres. louder chips. power plants, yottaflops, national infrastructure, the whole cathedral of compute. turns out the dangerous part was never just scale. it was what happens when reasoning becomes a substrate. when the model stops merely answering and starts searching the problem space like a thing that has its own private geometry.

mythos is the tell.

not because it is magic. not because it is conscious. because it shows the curve bending in public while everyone is still arguing over yesterday’s slope. a general model, not even built as a cyber weapon, starts finding vulnerabilities humans missed for years. not toy bugs. not classroom puzzles. real systems. old systems. the kind of hidden cracks entire industries quietly depend on not being visible.

and the part nobody wants to sit with is this: the next models do not need to be ten times larger to be ten times more consequential.

capability is no longer arriving as a clean linear upgrade. it is arriving as compression. tasks that took experts days become agent loops. workflows that required teams become prompts plus tools. reasoning that looked impossible last year becomes a benchmark nobody cares about by spring.

the public still thinks intelligence means chat. a box that writes emails. a search engine with manners. a productivity toy wearing a human voice.

but behind the curtain, the labs are measuring something else entirely.

autonomy length. planning depth. tool fluency. exploit chaining. internal representations that generalise across domains before anyone has a satisfying explanation for why. models that don’t just know more, but stay coherent longer. push further. recover from mistakes. test their own outputs. route around obstacles.

that is the real threshold.

not “can it talk like us”.

can it operate.

because once a model can hold a goal across time, decompose it, verify progress, use tools, and improve its own path through the maze, the world changes shape. suddenly intelligence is not a product feature. it is labour. it is research. it is reconnaissance. it is leverage.

and leverage compounds.

this is why the mythos moment feels different. it is not another chatbot release. it is a warning flare from the frontier. a signal that the next generation of models will not merely be better at conversation. they will be better at execution. better at discovering structure. better at finding the thing we missed because our brains were never built to search that many branches at once.

we are not ready for what comes next.

not culturally. not legally. not institutionally. maybe not even psychologically.

because the next wave will not announce itself as science fiction. it will arrive as a workflow improvement. a security tool. a coding agent. a research assistant. a quiet multiplier embedded into every system that matters.

meanwhile mainstream conversation is still “will ai replace junior developers” and “can it make me a nicer spreadsheet”.

brother.

we are watching non-human cognition become operational infrastructure, and everyone is still asking whether it can write better emails.

12:42 PM · May 14, 2026 · 7.7K Views
UK institute reports 4.7-month doubling in AI cyber tasks · KRO · Digg