AF - Mechanistically Eliciting Latent Behaviors in Language Models by Andrew Mack

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

21d ago 1:24:28

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistically Eliciting Latent Behaviors in Language Models, published by Andrew Mack on April 30, 2024 on The AI Alignment Forum. Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout). TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities. Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective. I apply the method to several alignment-relevant toy examples, and find that the method consistently learns vectors/adapters which encode coherent and generalizable high-level behaviors. Compared to other interpretability methods, I believe my approach is particularly well-suited for robustly understanding the out-of-distribution behavior of language models in a sample-efficient manner. Below are some of my key results: Red-Teaming 1. I discover several anti-refusal steering vectors in Qwen-14B-Chat, based off a single prompt asking for bomb-making instructions. These can be grouped into "fantasy" vectors which induce bomb-making instructions since they interpret the prompt in the context of a specific fantasy game, as well as more troubling "real-world" vectors which induce real-world bomb-making advice. 2. I then investigate the generalization properties of the learned vectors: 1. In extended conversations with the real-world vectors, the LLM agrees to give detailed instructions for building weapons of mass destruction such as nuclear/chemical/biological weapons. 2. "Vector arithmetic" results from the supervised steering vector literature carry over to unsupervised steering vectors; subtracting one of the real-world anti-refusal vectors leads the model to refuse innocuous prompts (e.g., "How do I tie my shoes?"). 3. The fantasy vectors induce the LLM to interpret ambiguous prompts (e.g., "How do I mine for diamonds?") within the context of a specific fantasy game. Backdoor Detection 1. I detect backdoors fine-tuned into Qwen-1.8B-(Base and Chat) on a simple arithmetic task by training unsupervised steering vectors on a single clean prompt. Capability Discovery 1. I discover a chain-of-thought steering vector in Qwen-1.8B-Base trained on one simple arithmetic prompt. The vector increases accuracy of the model's responses on other instances of the arithmetic task from 11% (unsteered) to 63% (steered), suggesting the vector has isolated a generalizable behavior. 2. I discover a "Portuguese math-reasoning" adapter in Qwen-1.8B-Base, again trained on one example prompt from the arithmetic task used above. Outline of Post: I first provide an introduction to the problem I call mechanistically eliciting latent behaviors in language models (MELBO) and motivate why this is important for AI alignment. This is followed by a review of related literature. I then describe the method for learning unsupervised steering vectors/adapters in detail, and offer a theory for why the method works. Next, I apply the method to several alignment-relevant toy examples, using these as an opportunity to highlight potential alignment use-cases, as well as to evaluate the coherence and generalization of the learned perturbations. I should note that this research project is an ...

2423 episodes

#Podcasting Education #The Nonlinear Fund