AF - Refusal in LLMs is mediated by a single direction by Andy Arditi

The Nonlinear Library: Alignment Forum

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

2M ago 14:03

MP3•Episode home

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Refusal in LLMs is mediated by a single direction, published by Andy Arditi on April 27, 2024 on The AI Alignment Forum. This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee. This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal. We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review. Executive summary Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you." We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests. We find that this phenomenon holds across open-source model families and model scales. This observation naturally gives rise to a simple modification of the model weights, which effectively jailbreaks the model without requiring any fine-tuning or inference-time interventions. We do not believe this introduces any new risks, as it was already widely known that safety guardrails can be cheaply fine-tuned away, but this novel jailbreak technique both validates our interpretability results, and further demonstrates the fragility of safety fine-tuning of open-source chat models. See this Colab notebook for a simple demo of our methodology. Introduction Chat models that have undergone safety fine-tuning exhibit refusal behavior: when prompted with a harmful or inappropriate instruction, the model will refuse to comply, rather than providing a helpful answer. Our work seeks to understand how refusal is implemented mechanistically in chat models. Initially, we set out to do circuit-style mechanistic interpretability, and to find the "refusal circuit." We applied standard methods such as activation patching, path patching, and attribution patching to identify model components (e.g. individual neurons or attention heads) that contribute significantly to refusal. Though we were able to use this approach to find the rough outlines of a circuit, we struggled to use this to gain significant insight into refusal. We instead shifted to investigate things at a higher level of abstraction - at the level of features, rather than model components.[1] Thinking in terms of features As a rough mental model, we can think of a transformer's residual stream as an evolution of features. At the first layer, representations are simple, on the level of individual token embeddings. As we progress through intermediate layers, representations are enriched by computing higher level features (see Nanda et al. 2023). At later layers, the enriched representations are transformed into unembedding space, and converted to the appropriate output tokens. Our hypothesis is that, across a wide range of harmful prompts, there is a single intermediate feature which is instrumental in the model's refusal. In other words, many particular instances of harmful instructions lead to the expression of this "refusal feature," and once it is expressed in the residual stream, the model outputs text in a sort of "should refuse" mode.[2] If this hypothesis is true, then we would expect to see two phenomena: Erasing this feature from the model would block refusal. Injecting this feature into the model would induce refusal. Our work serves as evidence for this sort of conceptualization. For various different models, we are able to find a direction in activation space, which we can think of as a "feature," that satisfies the above two properties. Methodolog...

396 episodes

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech