← Back to Artificial Intelligence cs.AI
Teaching billion-parameter models new domains without forgetting old ones
Anurup Ganguli
May 14, 2026
Continually training LLMs on new data domains typically degrades performance on previously learned ones — a problem called catastrophic forgetting — and existing fixes require storing old data, knowing task boundaries, or expensive regularization that breaks at scale. TFGP introduces a transformer overlay that routes parameter updates so new domain learning writes only to orthogonal subspaces, leaving prior-domain representations intact. Across six domains (prose, Python, math, biomedical, Chinese, JavaScript) at 1B tokens per phase, the method achieves ≥99.59% gradient separation between domains and a 26.8% drop in held-out JavaScript perplexity from Python training alone — positive forward transfer with no replay. Tested at ~398M, ~739M, and ~9B parameter scales in both from-scratch and retrofit settings; code availability is not mentioned.
Read the original paper →