← Back to Machine Learning
cs.LG

Why some GPUs always lag in expert models—and how to fix it

Seokjin Go, Marko Scrbak, Ephrem Wu, Srilatha Manne, Divya Mahajan

May 30, 2026

Distributed MoE inference suffers from stragglers when slower GPUs—due to manufacturing variation or thermal throttling—hold up synchronized execution. Prior work balances token counts across GPUs but ignores the reality that identical hardware performs differently. ViBE profiles per-GPU throughput and expert loads, then assigns heavy-compute experts to faster devices, reducing execution-time imbalance by 14% and tail latency by 45% without modifying models.
Published as ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving arXiv:2606.00735
Read the original paper →