Self-Distillation Enables Continual Learning [PDF]
The paper introduces Self-Distillation Fine-Tuning (SDFT), a method enabling models to learn continuously from expert demonstrations without forgetting prior skills. SDFT uses in-context learning by treating a model as its own teacher, generating on-policy training signals from demonstrations. The approach outperforms supervised fine-tuning in both skill acquisition and retention across sequential learning tasks.
- ▪Self-Distillation Fine-Tuning (SDFT) enables on-policy learning directly from expert demonstrations.
- ▪SDFT reduces catastrophic forgetting while improving accuracy on new tasks compared to supervised fine-tuning.
- ▪The method leverages in-context learning, using the model as its own teacher to generate training signals.
- ▪Experiments show SDFT allows a single model to accumulate multiple skills over time without performance decline.
- ▪SDFT establishes on-policy distillation as a practical approach for continual learning from demonstrations.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2601.19897 (cs) [Submitted on 27 Jan 2026] Title:Self-Distillation Enables Continual Learning Authors:Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal View a PDF of the paper titled Self-Distillation Enables Continual Learning, by Idan Shenfeld and 2 other authors View PDF HTML (experimental) Abstract:Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.