← Back to Artificial Intelligence
cs.AI

Teaching billion-parameter models new domains without forgetting old ones

Anurup Ganguli

May 14, 2026

Continually training LLMs on new data domains typically degrades performance on previously learned ones — a problem called catastrophic forgetting — and existing fixes require storing old data, knowing task boundaries, or expensive regularization that breaks at scale. TFGP introduces a transformer overlay that routes parameter updates so new domain learning writes only to orthogonal subspaces, leaving prior-domain representations intact. Across six domains (prose, Python, math, biomedical, Chinese, JavaScript) at 1B tokens per phase, the method achieves ≥99.59% gradient separation between domains and a 26.8% drop in held-out JavaScript perplexity from Python training alone — positive forward transfer with no replay. Tested at ~398M, ~739M, and ~9B parameter scales in both from-scratch and retrofit settings; code availability is not mentioned.
Published as TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale arXiv:2605.15053
Read the original paper →