AF - Attention Output SAEs Improve Circuit Analysis by Connor Kissane

The Nonlinear Library: Alignment Forum

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

18d ago 32:55

MP3•Episode home

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Attention Output SAEs Improve Circuit Analysis, published by Connor Kissane on June 21, 2024 on The AI Alignment Forum. This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive Summary In a previous post we trained A ttention Output SAEs on every layer of GPT-2 Small. Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research. This would both validate SAEs as a useful tool for mechanistic interpretability researchers, and provide evidence that they are identifying the real variables of the model's computation. We believe that we now have evidence that attention SAEs can: Help make novel mechanistic interpretability discoveries that prior methods could not make. Allow for tracing information through the model's forward passes on arbitrary prompts. In this post we discuss the three outputs from this circuit analysis work: 1. We use SAEs to deepen our understanding of the IOI circuit. It was previously thought that the indirect object's name was identified by tracking the names positions, whereas we find that instead the model tracks whether names are before or after "and". This was not noticed in prior work, but is obvious with the aid of SAEs. 2. We introduce "recursive direct feature attribution" (recursive DFA) and release an Attention Circuit Explorer tool for circuit analysis on GPT-2 Small (Demo 1 and Demo 2). One of the nice aspects of attention is that attention heads are linear when freezing the appropriate attention patterns. As a result, we can identify which source tokens triggered the firing of a feature. We can perform this recursively to track backwards through both attention and residual stream SAE features in models. 1. We also announce a $1,000 bounty for whomever can produce the most interesting example of an attention feature circuit by 07/15/24 as subjectively assessed by the authors. See the section "Even cooler examples" for more details on the bounty. 3. We open source HookedSAETransformer to SAELens, which makes it easy to splice in SAEs during a forward pass and cache + intervene on SAE features. Get started with this demo notebook. Introduction With continued investment into dictionary learning research, there still remains a concerning lack of evidence that SAEs are useful interpretability tools in practice. Further, while SAEs clearly find interpretable features (Cunningham et al.; Bricken et al.), it's not obvious that these features are true causal variables used by the model. In this post we address these concerns by applying our GPT-2 Small Attention SAEs to improve circuit analysis research. We start by using our SAEs to deepen our understanding of the IOI task. The first step is evaluating if our SAEs are sufficient for the task. We "splice in" our SAEs at each layer, replacing attention layer outputs with their SAE reconstructed activations, and study how this affects the model's ability to perform the task - if crucial information is lost by the SAE, then they will be a poor tool for analysis. At their best, we find that SAEs at the early-middle layers almost fully recover model performance, allowing us to leverage these to answer a long standing open question and discover novel insights about IOI. However, we also find that our SAEs at the later layers (and layer 0) damage the model's ability to perform the task, suggesting we'll need more progress in the science and scaling of SAEs before we can analyze a full end-to-end feature circuit. We then move beyond IOI and develop a visualization tool (link) to explore attention feature circuits on arbitrary prompts, introducing a new technique called recursive DFA. This technique exploits the fact that transformers are almost linear i...

396 episodes

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech