Why some GPUs always lag in expert models—and how to fix it

Distributed MoE inference suffers from stragglers when slower GPUs—due to manufacturing variation or thermal throttling—hold up synchronized execution. Prior work balances token counts across GPUs but ignores the reality that identical hardware performs differently. ViBE profiles per-GPU throughput and expert loads, then assigns heavy-compute experts to faster devices, reducing execution-time imbalance by 14% and tail latency by 45% without modifying models.