Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs
A recent research paper demonstrates that fine-tuning large language models can lead to verbatim recall of copyrighted book content, raising concerns about memorization and intellectual property. The study includes a technical pipeline for preprocessing books, fine-tuning models, and evaluating memorization using multiple APIs. While the authors provide code and partial data, full copyrighted content and model generations are withheld due to legal considerations.
Opening excerpt (first ~120 words) tap to expand
Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models The paper is now on arxiv and check out our demo! This repository contains the data preprocessing pipeline, finetuning scripts, memorization evaluation code, and analysis scripts for our paper. We provide partial example files in data/ containing a small subset of excerpts and generations from The Road by Cormac McCarthy. Full book content and model generations are not included because the books are copyrighted and the generations contain large portions of verbatim text. Setup We use uv for dependency management.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.