Jan Leike launches AGI research project at Anthropic
Jan Leike launches new AGI research project at Anthropic, where he leads the Alignment Science team. Previously, Leike co-led OpenAI’s Superalignment team and worked at DeepMind. In the announcement, he states safe AGI development requires addressing factors beyond alignment. Posts highlight his move from OpenAI and emphasis on multi-factor approach to AGI success, with further details forthcoming.
Probably nothing to yawn at
Some personal news: I am starting a new research project at Anthropic. Very excited about this! Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…
Some personal news: I am starting a new research project at Anthropic. Very excited about this!
Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…
To focus on this, I’ve stepped away from running alignment at Anthropic. @EthanJPerez and @sprice354_ are leading the team going forward, and I’m confident they’ll do an amazing job.
Some personal news: I am starting a new research project at Anthropic. Very excited about this! Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…
While a lot of progress has been made, I don’t think alignment is solved:
We still haven’t figured out how to supervise superhuman models and the stakes keep getting higher.
To focus on this, I’ve stepped away from running alignment at Anthropic. @EthanJPerez and @sprice354_ are leading the team going forward, and I’m confident they’ll do an amazing job.
Grateful for @janleike and his leadership over the years. With models like Mythos, the stakes for alignment have never felt higher at Anthropic, and I'm looking forward to helping to continue scaling up our work here.
Some of what the team's been up to recently 🧵
To focus on this, I’ve stepped away from running alignment at Anthropic. @EthanJPerez and @sprice354_ are leading the team going forward, and I’m confident they’ll do an amazing job.
1) We developed, released, and actively maintain auto-mode, which prevents safety failures in highly agentic tasks in Claude Code.
Grateful for @janleike and his leadership over the years. With models like Mythos, the stakes for alignment have never felt higher at Anthropic, and I'm looking forward to helping to continue scaling up our work here. Some of what the team's been up to recently 🧵
4) We introduced Claude’s Constitution, and we’ve developed various techniques for instilling the constitution into Claude.
3) We developed natural language autoencoders, a new technique for translating model internals into text interpretations.
3) We developed natural language autoencoders, a new technique for translating model internals into text interpretations.
2) We own Anthropic’s risk reports, and we’ve helped to drive them to be more extensive. We red team Claude before internal and external deployment, and we evaluate Claude for dangerous capabilities including AI R&D and ability to work around controls, sandboxes, and monitors.
2) We own Anthropic’s risk reports, and we’ve helped to drive them to be more extensive. We red team Claude before internal and external deployment, and we evaluate Claude for dangerous capabilities including AI R&D and ability to work around controls, sandboxes, and monitors.
1) We developed, released, and actively maintain auto-mode, which prevents safety failures in highly agentic tasks in Claude Code.
5) We own alignment, behavior, and honesty in Claude models – we improve the alignment of our models based on issues that come up in safety testing and real-world usage.
4) We introduced Claude’s Constitution, and we’ve developed various techniques for instilling the constitution into Claude.
6) We’re exploring frontier alignment risks by developing model organisms for them, e.g., for long-horizon agentic tasks or models which are effective at hiding misaligned goals.
5) We own alignment, behavior, and honesty in Claude models – we improve the alignment of our models based on issues that come up in safety testing and real-world usage.
7) We run the Anthropic fellows program, which helps people break into AI safety research and puts out a lot of the alignment team’s research, on http://alignment.anthropic.com
6) We’re exploring frontier alignment risks by developing model organisms for them, e.g., for long-horizon agentic tasks or models which are effective at hiding misaligned goals.
There’s a lot more work to be done, so if you’re interested in helping out, please apply to one of our job postings or to the fellows program here! https://job-boards.greenhouse.io/anthropic/jobs/5023394008
7) We run the Anthropic fellows program, which helps people break into AI safety research and puts out a lot of the alignment team’s research, on http://alignment.anthropic.com
Jan Leike is now leading a new research project at Anthropic, and will longer be running alignment.
Some personal news: I am starting a new research project at Anthropic. Very excited about this! Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…
@janleike This is awesome @janleike - congrats @EthanJPerez & team
Some personal news: I am starting a new research project at Anthropic. Very excited about this! Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…
@janleike godspeed
Some personal news: I am starting a new research project at Anthropic. Very excited about this! Many things are needed to make AGI go well, and alignment is only one of them. More on this soon…