Tilde Research launches Aurora optimizer for frontier-scale AI models
Tilde Research introduced Aurora, an optimizer for training frontier-scale AI models. Aurora-1.1B achieves 100x data efficiency on open-source internet data, using 25% fewer parameters and 100x fewer training tokens than baselines. It matches Qwen3-1.7B performance. Aurora avoids neuron death seen in Muon optimizer. Results use standard open-source data without proprietary corpora or distillation.
Cool work!
Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type:
U being
[[1, 0] [0, eps]]
Which can lead to death spiral for second row updates and neurons selectively die.
Following update could be a bit better.
[[0.5, 0] [0.0, 0.5]]
OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better.
All roads lead to Adam/AdaGrad. Or does it …
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
I know I left it at a cliff hanger.
Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …
Enjoy this vague post too
Neurons don’t die they just are just cryo sleeping. Melody of optimization can wake them up, ready to increase the representation strength.
@andrew_n_carr Appendix B in Scalable second order optimization paper.
@_arohan_ E-shampoo?
@andrew_n_carr We should have named it better!
@_arohan_ Ah! Kronecker approximated and scaled eigenbasis to get the Fischer matrix. E stands for eigen in this case I guess
@recurseparadox What % of completion you do sparsification probably matters, initial training vs later training
@_arohan_ Isn’t point of optimization to kill some neurons though? Distributed representations need sparse graphs.
More proof from one of our companies innovation continues unabated around LLM's...
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
@_arohan_ E-shampoo?
Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …
@_arohan_ Ah! Kronecker approximated and scaled eigenbasis to get the Fischer matrix. E stands for eigen in this case I guess
@andrew_n_carr Appendix B in Scalable second order optimization paper.
I wanted to say "I don't believe in this free lunch bs" but their hypothesis is interesting. In effect, finegrained sparsity (going to the extreme of tiny experts) might have been allowing a failure mode of Muon to stay hidden.

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Architecture 🤝 Optimizer
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
@_arohan_ Isn’t point of optimization to kill some neurons though? Distributed representations need sparse graphs.
Enjoy this vague post too