← Back to Artificial Intelligence cs.AI
Cutting MoE inference costs in half without retraining from scratch
Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou
May 18, 2026
Large language models using mixture-of-experts (MoE) activate different expert networks for different tokens—a neat efficiency trick. The problem: existing methods require retraining from scratch. This work introduces ZEDA, which takes a fully trained MoE model and converts it into a dynamic variant by adding dummy experts and fine-tuning via self-distillation. Result: 50% reduction in expert computation across Qwen and GLM models with negligible accuracy drop and 1.2× end-to-end speedup, beating prior dynamic methods by 4–6 points.
Read the original paper →