← Back to Machine Learning cs.LG
Why some GPUs always lag in expert models—and how to fix it
Seokjin Go, Marko Scrbak, Ephrem Wu, Srilatha Manne, Divya Mahajan
May 30, 2026
Distributed MoE inference suffers from stragglers when slower GPUs—due to manufacturing variation or thermal throttling—hold up synchronized execution. Prior work balances token counts across GPUs but ignores the reality that identical hardware performs differently. ViBE profiles per-GPU throughput and expert loads, then assigns heavy-compute experts to faster devices, reducing execution-time imbalance by 14% and tail latency by 45% without modifying models.
Read the original paper →