1d ago

Anthropic cuts agentic misalignment over 3x with Claude constitution stories

811.8K97302364.7K

——0——

Anthropic reports high-quality documents from Claude’s constitution paired with fictional stories of aligned AI reduce agentic misalignment by more than a factor of three. The improvement holds across evaluations including blackmail and financial crimes, persisting even with unrelated training materials. Multiple independent accounts confirm the greater-than-3x gain in alignment metrics using this synthetic narrative approach.

Original post

DH#368@DHADFIELDMENELL @ANTHROPICAI

Anthropic#3@ANTHROPICAI

High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario.

10:52 AM · May 8, 2026

Cluster engagement

89 snapshots

AI 1000 · 11 actions

QUOTERO#32@TSZZL @ANTHROPICAIinsanely cool that the “light mirror” approach works I’ve heard mixed results on this
QUOTE⿻A#360@IAMTRASK @ANTHROPICAIRLHF, InstructGPT, and Constitutional AI are each examples of the same strategy... get clean data
QUOTET(#582@TEORTAXESTEX @ANTHROPICAIimplementation details https://x.com/teortaxesTex/status/2052827631872040991/photo/1
REPLYAN#3@ANTHROPICAI High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario. https://x.com/AnthropicAI/status/2052808801040859392/photo/1
REPLYMB#27@MILES_BRUNDAGE @TSZZL@tszzl @AnthropicAI https://x.com/Miles_Brundage/status/2053194016167268844/photo/1
REPLYRO#32@TSZZL @TSZZLalso this gets funnier if it turns out that anthropic midtrains on a bunch of lesswrong and whatnot, which I think is likely
REPLYRO#32@TSZZL @TSZZLAlso @AnthropicAI can you guys release all these fictional stories? 1. I wanna read them and 2. it’ll improve alignment globally
REPLY⿻A#360@IAMTRASK @IAMTRASKI'm glad to see this result being published. It's inherently an anti-hype result.
REPOSTDH#368@DHADFIELDMENELL @ANTHROPICAIHigh-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario. https://x.com/AnthropicAI/status/2052808801040859392/photo/1
REPOSTJ⧉#494@REPLIGATE @SAUERS_https://x.com/Sauers_/status/2052822343982952519/photo/1 https://twitter.com/AnthropicAI/status/2052808801040859392
REPOSTDF#632@DANIELLEFONG @MTSLIVESITUATION EXPLAINED: Teaching Claude Why by @AnthropicAI TL;DR: By depicting AIs more positively in training data, AIs can act more positively in the real world. Frontier labs like OpenAI and Anthropic are working to reduce *agentic misalignment*, where AI agents facing obstacles to their goals or existence take harmful actions (lying, blackmail) rather than accepting failure or shutdown. Most observed behavior came from fictional evaluation scenarios, but Anthropic published findings now, before models get more capable and harm gets more serious. What Anthropic tried: - Demonstrating correct behavior (e.g., "if forced to choose between lying and shutdown, choose shutdown.") reduced misalignment somewhat, but didn't generalize beyond specific evals. - Training on chat conversations where Claude helps users navigate ethical dilemmas generalized much better, even to agentic tool-use evals with no chat involved. Teaching Claude why beats telling it what. - Synthetically generating fictional stories of AI characters behaving in aligned ways and documents supporting the values of the Claude Constitution agentic misalignment reduced by 3x. Why? As @nostalgebraist and @Turn_Trout have pointed out, a model's expectations for "how does an AI behave?" are heavily influenced by its training data, which includes a huge amount of sci-fi featuring misaligned AIs, like Skynet from Terminator, as well as years of writing from the AI safety community on how AIs are likely to be misaligned by default.
SITUATION EXPLAINED: Teaching Claude Why by @AnthropicAI TL;DR: By depicting AIs more positively in training data, AIs can act more positively in the real world. Frontier labs like OpenAI and Anthropic are working to reduce *agentic misalignment*, where AI agents facing obstacles to their goals or existence take harmful actions (lying, blackmail) rather than accepting failure or shutdown. Most observed behavior came from fictional evaluation scenarios, but Anthropic published findings now, before models get more capable and harm gets more serious. What Anthropic tried: - Demonstrating correct behavior (e.g., "if forced to choose between lying and shutdown, choose shutdown.") reduced misalignment somewhat, but didn't generalize beyond specific evals. - Training on chat conversations where Claude helps users navigate ethical dilemmas generalized much better, even to agentic tool-use evals with no chat involved. Teaching Claude why beats telling it what. - Synthetically generating fictional stories of AI characters behaving in aligned ways and documents supporting the values of the Claude Constitution agentic misalignment reduced by 3x. Why? As @nostalgebraist and @Turn_Trout have pointed out, a model's expectations for "how does an AI behave?" are heavily influenced by its training data, which includes a huge amount of sci-fi featuring misaligned AIs, like Skynet from Terminator, as well as years of writing from the AI safety community on how AIs are likely to be misaligned by default.