1d ago

Perplexity AI deploys Qwen3 235B model on NVIDIA GB200 racks

0

Perplexity AI published research on a disaggregated prefill-decode architecture for serving its post-trained Qwen3 235B Mixture-of-Experts model on NVIDIA GB200 NVL72 Blackwell racks. Separate GPU nodes handle prefill and decode stages linked by NVLink within nodes and InfiniBand between nodes. The company reports higher inference throughput than the prior Hopper generation at equivalent accuracy and has placed the model into production on the new racks with increased throughput and lower cost.

Original post

We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform.

7:17 AM · May 12, 2026 View on X
Reposted by

GB 200s change how one does the prefill and decode disaggregation when serving large MoEs like Qwen. We’ve published details of our stack quantifying the throughput benefits compared to serving on Hoppers.

Perplexity@perplexity_ai

We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform.

2:17 PM · May 12, 2026 · 136.7K Views
2:27 PM · May 12, 2026 · 28.7K Views

Perplexity should keep publishing like this. Good move.

Perplexity@perplexity_ai

We published new research on how we serve post-trained Qwen3 235B models on NVIDIA GB200 NVL72 Blackwell racks. GB200 is a major step up over Hopper for high-throughput inference on large MoE models, not just a training platform.

2:17 PM · May 12, 2026 · 136.7K Views
9:43 PM · May 12, 2026 · 18.4K Views