5d ago

Tilde Research launches Aurora optimizer for frontier-scale AI models

0

Tilde Research introduced Aurora, an optimizer for training frontier-scale AI models. Aurora-1.1B achieves 100x data efficiency on open-source internet data, using 25% fewer parameters and 100x fewer training tokens than baselines. It matches Qwen3-1.7B performance. Aurora avoids neuron death seen in Muon optimizer. Results use standard open-source data without proprietary corpora or distillation.

Original post

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

10:10 AM · May 8, 2026 View on X
Reposted by

Cool work!

Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type:

U being

[[1, 0] [0, eps]]

Which can lead to death spiral for second row updates and neurons selectively die.

Following update could be a bit better.

[[0.5, 0] [0.0, 0.5]]

OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better.

All roads lead to Adam/AdaGrad. Or does it …

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

5:10 PM · May 8, 2026 · 284.1K Views
5:53 PM · May 8, 2026 · 18.5K Views

I know I left it at a cliff hanger.

rohan anil@_arohan_

Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …

5:53 PM · May 8, 2026 · 18.5K Views
5:55 PM · May 8, 2026 · 2.1K Views

Enjoy this vague post too

rohan anil@_arohan_

Neurons don’t die they just are just cryo sleeping. Melody of optimization can wake them up, ready to increase the representation strength.

5:36 PM · Apr 19, 2026 · 2.4K Views
6:00 PM · May 8, 2026 · 2.8K Views

@andrew_n_carr Appendix B in Scalable second order optimization paper.

Andrew Carr 🤸@andrew_n_carr

@_arohan_ E-shampoo?

6:21 PM · May 8, 2026 · 1K Views
6:24 PM · May 8, 2026 · 797 Views

@andrew_n_carr We should have named it better!

Andrew Carr 🤸@andrew_n_carr

@_arohan_ Ah! Kronecker approximated and scaled eigenbasis to get the Fischer matrix. E stands for eigen in this case I guess

6:36 PM · May 8, 2026 · 252 Views
7:10 PM · May 8, 2026 · 193 Views

@recurseparadox What % of completion you do sparsification probably matters, initial training vs later training

Pranav Shyam@recurseparadox

@_arohan_ Isn’t point of optimization to kill some neurons though? Distributed representations need sparse graphs.

11:42 PM · May 8, 2026 · 50 Views
11:48 PM · May 8, 2026 · 64 Views

More proof from one of our companies innovation continues unabated around LLM's...

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

5:10 PM · May 8, 2026 · 284.1K Views
6:18 PM · May 8, 2026 · 34.7K Views

@_arohan_ E-shampoo?

rohan anil@_arohan_

Cool work! Muon (shampoo with b2=0.0) had following pathology, and thus likely toasted some big model performance, the polar factor can lead to updates of type: U being [[1, 0] [0, eps]] Which can lead to death spiral for second row updates and neurons selectively die. Following update could be a bit better. [[0.5, 0] [0.0, 0.5]] OG shampoo has less of this problem but doesn’t fix it. This is why E-shampoo is superior in this respect but you can do better. All roads lead to Adam/AdaGrad. Or does it …

5:53 PM · May 8, 2026 · 18.5K Views
6:21 PM · May 8, 2026 · 1K Views

@_arohan_ Ah! Kronecker approximated and scaled eigenbasis to get the Fischer matrix. E stands for eigen in this case I guess

rohan anil@_arohan_

@andrew_n_carr Appendix B in Scalable second order optimization paper.

6:24 PM · May 8, 2026 · 797 Views
6:36 PM · May 8, 2026 · 252 Views

I wanted to say "I don't believe in this free lunch bs" but their hypothesis is interesting. In effect, finegrained sparsity (going to the extreme of tiny experts) might have been allowing a failure mode of Muon to stay hidden.

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

5:10 PM · May 8, 2026 · 284.1K Views
5:50 PM · May 8, 2026 · 5.8K Views

Architecture 🤝 Optimizer

Tilde@tilderesearch

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

5:10 PM · May 8, 2026 · 284.1K Views
8:53 PM · May 8, 2026 · 412 Views

@_arohan_ Isn’t point of optimization to kill some neurons though? Distributed representations need sparse graphs.

rohan anil@_arohan_

Enjoy this vague post too

6:00 PM · May 8, 2026 · 2.8K Views
11:42 PM · May 8, 2026 · 50 Views