Nous Research releases Token Superposition Training with 2-3× speedup

REPLYS🇸🇬#229swyx 🇸🇬 AIE Singapore!@SWYX

@itsclivetime basically better initial weights? feel like its basically a warmup run of one query/key epoch over whole dataset

Clive Chan@itsclivetime

this is quite remarkable

7:57 PM · May 13, 2026 · 21.6K Views

10:46 PM · May 13, 2026 · 517 Views

QUOTE POSTT🪽#296Teknium 🪽@TEKNIUM

Check out our researchers' latest paper that introduces Superposition, a potential path to multiplying training speed during pre-training.

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 308.6K Views

6:42 PM · May 13, 2026 · 13.2K Views

QUOTE POSTFF #363François Fleuret@FRANCOISFLEURET

Awesome. LLM mixup weirdness.

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 308.6K Views

8:37 AM · May 14, 2026 · 8.2K Views

QUOTE POSTCC #507Clive Chan@ITSCLIVETIME

this is quite remarkable

Nous Research@NousResearch

Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data. During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining. Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE. The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.

5:09 PM · May 13, 2026 · 308.6K Views

7:57 PM · May 13, 2026 · 21.6K Views

QUOTE POSTCC #507Clive Chan@ITSCLIVETIME

someone in the comments pointed out similarity to this prior work, which i've never heard of before:

7:58 PM · May 13, 2026 · 580 Views

REPLYCC #507Clive Chan@ITSCLIVETIME

tl;dr:

3x fewer steps iso-data, by pre-pre-training on a new objective: - segment of 8 tokens (mean pooled embeddings) ==> next segment of 8 tokens (multi-hot cross-entropy)

Clive Chan@itsclivetime

this is quite remarkable

7:57 PM · May 13, 2026 · 21.6K Views

8:10 PM · May 13, 2026 · 808 Views

Cluster engagement

Sentiment

Cluster engagement