Teaching language models new languages without forgetting old ones

Zeli Su, Ziyin Zhang, Zhou Liu, Xuexian Song, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Rong Fu, Guixian Xu, Wentao Zhang

Extending LLMs to low-resource languages usually causes catastrophic forgetting of general skills—the "alignment tax." This work replaces supervised fine-tuning with Group Relative Policy Optimization (GRPO), using embedding-level semantic rewards rather than likelihood maximization. The approach encourages meaning preservation through flexible surface realizations, reducing interference with pretrained knowledge. Evaluated on Tibetan-Chinese translation and generation, the method acquires new language capabilities while better preserving general competence than standard fine-tuning, with improved semantic quality and few-shot transfer performance.