← Back to Artificial Intelligence
cs.AI

Cutting MoE inference costs in half without retraining from scratch

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

May 18, 2026

Large language models using mixture-of-experts (MoE) activate different expert networks for different tokens—a neat efficiency trick. The problem: existing methods require retraining from scratch. This work introduces ZEDA, which takes a fully trained MoE model and converts it into a dynamic variant by adding dummy experts and fine-tuning via self-distillation. Result: 50% reduction in expert computation across Qwen and GLM models with negligible accuracy drop and 1.2× end-to-end speedup, beating prior dynamic methods by 4–6 points.
Published as Post-Trained MoE Can Skip Half Experts via Self-Distillation arXiv:2605.18643
Read the original paper →