Perplexity AI deploys Qwen3 235B model on NVIDIA GB200 racks
Perplexity AI published research on a disaggregated prefill-decode architecture for serving its post-trained Qwen3 235B Mixture-of-Experts model on NVIDIA GB200 NVL72 Blackwell racks. Separate GPU nodes handle prefill and decode stages linked by NVLink within nodes and InfiniBand between nodes. The company reports higher inference throughput than the prior Hopper generation at equivalent accuracy and has placed the model into production on the new racks with increased throughput and lower cost.
GB 200s change how one does the prefill and decode disaggregation when serving large MoEs like Qwen. We’ve published details of our stack quantifying the throughput benefits compared to serving on Hoppers.
We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform.
Perplexity should keep publishing like this. Good move.
We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform.