3d ago

Sakana AI and NVIDIA release TwELL sparse format for ICML 2026

0

Sakana AI and NVIDIA released an ICML 2026 paper introducing TwELL, a tile-wise ELLPACK sparse format with fused CUDA kernels optimized for NVIDIA GPUs. TwELL targets natural sparsity in transformer feedforward layers, routing sparse tokens through a fast path while keeping dense computation. The approach yields 20% faster LLM training on NVIDIA GPUs and improves inference speed without changing model architecture.

Original post

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️ Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models: Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance. So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations. Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes. We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy. This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!

9:26 AM · May 8, 2026 View on X

AI 1000 · 10 actions

Sakana AI and NVIDIA release TwELL sparse format for ICML 2026 · KRO · Digg