Sakana AI and NVIDIA release TwELL sparse format for ICML 2026
——0——
Sakana AI and NVIDIA released an ICML 2026 paper introducing TwELL, a tile-wise ELLPACK sparse format with fused CUDA kernels optimized for NVIDIA GPUs. TwELL targets natural sparsity in transformer feedforward layers, routing sparse tokens through a fast path while keeping dense computation. The approach yields 20% faster LLM training on NVIDIA GPUs and improves inference speed without changing model architecture.
AI 1000 · 10 actions
- POSTSA#58@SAKANAAILABSHow do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️ Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models: Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance. So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations. Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes. We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy. This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!
- QUOTEHA#19@HARDMARU@SAKANAAILABSThe human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it. One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math. We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens. Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements. Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms ⚡️
- QUOTESA#58@SAKANAAILABS@SAKANAAILABSSakana AIは、@NVIDIAとの共同研究で、スパースなTransformer言語モデルの推論・学習を高速化する新しいGPUカーネルとデータ形式を開発しました。 ブログ:https://pub.sakana.ai/sparser-faster-llms/ LLMのコストの大部分を占めるフィードフォワード層では、実は各トークンに対して大半の活性がほぼゼロで無駄な計算に なっています。ReLUと軽いL1正則化を組み合わせれば、性能をほとんど落とさずにスパース率を95%以上まで引き上げられます。ところが現代のGPUは密な行列積に最適化されており、従来のスパース形式は不規則なメモリアクセスのせいで理論上の高速化が相殺されてしまいます。 そこで私たちは、 ① 最適化されたタイル型matmulカーネルにそのまま組み込める新しいスパース格納形式 TwELL (Tile-wise ELLPACK) と、 ② 複数のスパースmatmulを融合してスループットを最大化するカスタムCUDAカーネル を考案しました。 数十億パラメータ規模のスパースLLMを実際に学習・評価したところ、20%以上の高速化と、ピークメモリ・消費電力の大幅な削減を達成しました。 本研究は #ICML2026 にて発表されます。 ぜひブログと論文をご覧ください。 論文:https://arxiv.org/abs/2603.23198 GitHub:https://github.com/SakanaAI/sparser-faster-llms
- REPLYHA#19@HARDMARU@HARDMARUIf you want to look under the hood at the actual custom CUDA kernels and see exactly how we implemented the TwELL format for H100 GPUs, we’ve released the reference code. GitHub: https://github.com/SakanaAI/sparser-faster-llms Blog: https://pub.sakana.ai/sparser-faster-llms/ 🐟
- REPOSTHA#19@HARDMARU@SAKANAAILABSHow do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️ Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models: Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance. So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations. Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes. We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy. This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!
- REPOSTHA#19@HARDMARU@NVIDIAAIGreat collab with @SakanaAILabs on an #ICML26 paper about sparse transformer kernels + formats optimized for modern NVIDIA GPU execution. • TwELL sparse packing • Fused CUDA kernels • 20%+ inference/training speedups at scale Paper + code below 👇 https://twitter.com/hardmaru/status/2052787980344099293
- REPOSTHA#19@HARDMARU@SAKANAAILABSSakana AIは、@NVIDIAとの共同研究で、スパースなTransformer言語モデルの推論・学習を高速化する新しいGPUカーネルとデータ形式を開発しました。 ブログ:https://pub.sakana.ai/sparser-faster-llms/ LLMのコストの大部分を占めるフィードフォワード層では、実は各トークンに対して大半の活性がほぼゼロで無駄な計算に なっています。ReLUと軽いL1正則化を組み合わせれば、性能をほとんど落とさずにスパース率を95%以上まで引き上げられます。ところが現代のGPUは密な行列積に最適化されており、従来のスパース形式は不規則なメモリアクセスのせいで理論上の高速化が相殺されてしまいます。 そこで私たちは、 ① 最適化されたタイル型matmulカーネルにそのまま組み込める新しいスパース格納形式 TwELL (Tile-wise ELLPACK) と、 ② 複数のスパースmatmulを融合してスループットを最大化するカスタムCUDAカーネル を考案しました。 数十億パラメータ規模のスパースLLMを実際に学習・評価したところ、20%以上の高速化と、ピークメモリ・消費電力の大幅な削減を達成しました。 本研究は #ICML2026 にて発表されます。 ぜひブログと論文をご覧ください。 論文:https://arxiv.org/abs/2603.23198 GitHub:https://github.com/SakanaAI/sparser-faster-llms
- REPOSTEG#97@ELADGIL@HARDMARUThe human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it. One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math. We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens. Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements. Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms ⚡️
