Tilde Research launches Aurora optimizer for frontier-scale AI models · KRO

Cool work!

Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type:

U being

[[1, 0] [0, eps]]

Which can lead to death spiral for second row updates and neurons selectively die.

Following update could be a bit better.

[[0.5, 0] [0.0, 0.5]]

OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better.

All roads lead to Adam/AdaGrad. Or does it …

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

5:10 PM · May 8, 2026 · 284.1K Views

5:53 PM · May 8, 2026 · 18.5K Views

REPLYRA #95rohan anil@_AROHAN_

I know I left it at a cliff hanger.

rohan anil@_arohan_

Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …

5:53 PM · May 8, 2026 · 18.5K Views

5:55 PM · May 8, 2026 · 2.1K Views

QUOTE POSTRA #95rohan anil@_AROHAN_

Enjoy this vague post too

rohan anil@_arohan_

Neurons don’t die they just are just cryo sleeping. Melody of optimization can wake them up, ready to increase the representation strength.

5:36 PM · Apr 19, 2026 · 2.4K Views

6:00 PM · May 8, 2026 · 2.8K Views

REPLYRA #95rohan anil@_AROHAN_

@andrew_n_carr Appendix B in Scalable second order optimization paper.

Andrew Carr 🤸@andrew_n_carr

@_arohan_ E-shampoo?

6:21 PM · May 8, 2026 · 1K Views

6:24 PM · May 8, 2026 · 797 Views

REPLYRA #95rohan anil@_AROHAN_

@andrew_n_carr We should have named it better!

Andrew Carr 🤸@andrew_n_carr

@_arohan_ Ah! Kronecker approximated and scaled eigenbasis to get the Fischer matrix. E stands for eigen in this case I guess

6:36 PM · May 8, 2026 · 252 Views

7:10 PM · May 8, 2026 · 193 Views

REPLYRA #95rohan anil@_AROHAN_

@recurseparadox What % of completion you do sparsification probably matters, initial training vs later training

Pranav Shyam@recurseparadox

@_arohan_ Isn’t point of optimization to kill some neurons though? Distributed representations need sparse graphs.

11:42 PM · May 8, 2026 · 50 Views

11:48 PM · May 8, 2026 · 64 Views

QUOTE POSTVK #222Vinod Khosla@VKHOSLA

More proof from one of our companies innovation continues unabated around LLM's...

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

5:10 PM · May 8, 2026 · 284.1K Views

6:18 PM · May 8, 2026 · 34.7K Views

REPLYAC #287Andrew Carr 🤸@ANDREW_N_CARR

@_arohan_ E-shampoo?

rohan anil@_arohan_

Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …

5:53 PM · May 8, 2026 · 18.5K Views

6:21 PM · May 8, 2026 · 1K Views

REPLYAC #287Andrew Carr 🤸@ANDREW_N_CARR

@_arohan_ Ah! Kronecker approximated and scaled eigenbasis to get the Fischer matrix. E stands for eigen in this case I guess

rohan anil@_arohan_

@andrew_n_carr Appendix B in Scalable second order optimization paper.

6:24 PM · May 8, 2026 · 797 Views

6:36 PM · May 8, 2026 · 252 Views

QUOTE POSTT(#431Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

I wanted to say "I don't believe in this free lunch bs" but their hypothesis is interesting. In effect, finegrained sparsity (going to the extreme of tiny experts) might have been allowing a failure mode of Muon to stay hidden.

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

5:10 PM · May 8, 2026 · 284.1K Views

5:50 PM · May 8, 2026 · 5.8K Views

QUOTE POSTPS #1222Pranav Shyam@RECURSEPARADOX

Architecture 🤝 Optimizer

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

5:10 PM · May 8, 2026 · 284.1K Views

8:53 PM · May 8, 2026 · 412 Views

REPLYPS #1222Pranav Shyam@RECURSEPARADOX

@_arohan_ Isn’t point of optimization to kill some neurons though? Distributed representations need sparse graphs.

rohan anil@_arohan_

Enjoy this vague post too

6:00 PM · May 8, 2026 · 2.8K Views

11:42 PM · May 8, 2026 · 50 Views