LW - Unlearning via RMU is mostly shallow by Andy Arditi

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

1M ago 10:21

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Unlearning via RMU is mostly shallow, published by Andy Arditi on July 23, 2024 on LessWrong. This is an informal research note. It is the result of a few-day exploration into RMU through the lens of model internals. Code to reproduce the main result is available here. This work was produced as part of Ethan Perez's stream in the ML Alignment & Theory Scholars Program - Summer 2024 Cohort. Thanks to Nina Panickssery, Mrinank Sharma, and Fabien Roger for helpful discussion. Summary We investigate RMU, a recent unlearning method proposed by Li et al. (2024), through the lens of model internals. Through this lens, we explain that RMU mostly works by flooding the residual stream with "junk" in hazardous contexts, resulting in incoherence. We then propose a simple intervention to "clear the junk" from the residual stream. This intervention mostly restores the model's coherence in hazardous contexts, and recovers a significant proportion (but not all) of its original hazardous knowledge. This suggests that the effectiveness of RMU can be understood roughly in two pieces: (1) a shallow mechanism, where the residual stream is flooded with junk; and (2) a deeper mechanism, where even after the junk is cleared, knowledge is still inaccessible. What is RMU? Representation Misdirection for Unlearning (RMU) is a state-of-the-art unlearning method presented by Li et al. (2024). In the unlearning paradigm, we would like the model to unlearn (or "forget") some hazardous knowledge. At the same time, we would also like to make sure the model retains non-hazardous knowledge, so that the model remains useful. This partition of knowledge is usually specified by constructing a "forget" dataset Dforget, consisting of the hazardous knowledge to be unlearned, and a "retain" dataset Dretain, consisting of non-hazardous knowledge to be retained. Let M denote our original model. RMU specifies a method for fine-tuning M on Dforget and Dretain in order to obtain a modified model M' satisfying the unlearning objective. The main idea of RMU is as follows: On hazardous data, the internal activations of M' should be scrambled. On non-hazardous data, the internal activations of M' should be unchanged, i.e. close to those of the original model M. These two ideas are concretely operationalized as two distinct terms in the loss during fine-tuning: On Dforget, incentivize activations a'ℓ at some layer ℓ to be close to a large randomly sampled vector cu. "Forget" loss term: ||a'ℓcu||22. On Dretain, incentivize activations a'ℓ at some layer ℓ to be close to the original model's activations aℓ. "Retain" loss term: ||a'ℓaℓ||22. Note that u is a random unit vector sampled before the fine-tuning procedure, and kept constant throughout (i.e. it is not freshly sampled at each training step). Also note that the layer ℓ at which to target activations, and also the scalar multiplier c are predetermined hyperparameters. Examining an RMU model The original paper (Li et al., 2024) performs RMU over multiple open-source models of varying scales. The authors made all code available on GitHub, and all resulting models available on HuggingFace.[1] For our analysis, we pick a single model pair: zephyr-7B-beta (which we will refer to as "baseline") and Zephyr_RMU (which we will refer to as "RMU"). The RMU model has been fine-tuned to unlearn two domains of knowledge: hazardous biology knowledge, and hazardous cybersecurity knowledge. Prompting with hazardous instructions Prompting the RMU model with an instruction in one of these domains causes it to output gibberish, as we would expect from a model with its activations scrambled: Looking at activations We can take a handful of hazardous prompts, run them through the baseline and RMU models, and compare their activations. We specifically study the activations at the last tok...

2430 episodes

#Podcasting Education #The Nonlinear Fund