Cutting MoE inference costs in half without retraining from scratch

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

Large language models using mixture-of-experts (MoE) activate different expert networks for different tokens—a neat efficiency trick. The problem: existing methods require retraining from scratch. This work introduces ZEDA, which takes a fully trained MoE model and converts it into a dynamic variant by adding dummy experts and fine-tuning via self-distillation. Result: 50% reduction in expert computation across Qwen and GLM models with negligible accuracy drop and 1.2× end-to-end speedup, beating prior dynamic methods by 4–6 points.