Artwork

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

LW - Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions by Lidor Banuel Dabbah

28:10
 
Share
 

Manage episode 429745641 series 3314709
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions, published by Lidor Banuel Dabbah on July 19, 2024 on LessWrong. Tl;dr: In this post we present the exploratory phase of a project aiming to study neural networks by applying static LLC estimation to specific alterations of them. We introduce a new method named Feature Targeted (FT) LLC estimation and study its ability to distinguish SAE trained features from random directions. By comparing our method to other possible metrics, we demonstrate that it outperforms all of them but one, which has comparable performance. We discuss possible explanations to our results, our project and other future directions. Introduction Given a neural network M and a latent layer within it, L, a central motif in current mechanistic interpretability research is to find functions f:LR [1] which are features of the model. Features are (generally) expected to exhibit the following properties: 1. Encode interpretable properties of the input. 2. Be causally relevant to the computation of the output of the model. 3. Encode the output of a certain submodule of our model M, i.e. a component, localized in weight space, which is responsible for a specific part of the total computation. While this is common wisdom, methods for automated feature evaluation usually focus on correlations between the (top) activations of the feature with human (or machine) recognizable interpretations, or on the effect of feature-related interventions on the output of the model. In particular, while the first and second items of the feature characterization above are central in current techniques, the third property, specifically the localized nature of the computation upstream of the feature, is less so[2]. We are currently investigating a direction which fills that gap, and this post shares the findings of the exploratory research we have conducted to validate and inform our approach. More specifically, we operationalized the concept of "weight-localized computation" using the local learning coefficient (LLC) introduced in Lau et al, following the learning coefficient first introduced in the context of singular learning theory. We apply LLC estimation to models associated with our base model and a feature within it, a method we call feature targeted (FT) LLC estimation. In this exploratory work we study FT-LLC estimates of specific models associated with SAE features. Most notably, we have found that: 1. FT-LLC estimates of SAE features are, on average, distinguishably higher then those of random directions. 2. For a particular variant of FT-LLC estimation, which we named the functional FT-LLC (defined in this section) this separation is pronounced enough such that the vast majority of SAE features we studied are clearly separated from the random features we studied. Furthermore, most baseline metrics we compared it to (see here) are less capable at distinguishing SAE features from random directions, with only one performing on par with it. Section 1 introduces the main technique we study in this post, FT-LLC estimation, and section 2 outlines our motivations. Section 3 describes the details of our experimental setting, our results, and the comparison to baseline metrics. In section 4 we discuss our overall takes, how they fit within our general agenda and gaps we currently have in theoretically understanding them. Section 5 is devoted to outlining our next steps, the general direction of the project, and some other possible directions for further research. Lastly, we briefly discuss related work in section 6. What is FT-LLC? LLC estimation We start out by briefly recalling what the local learning coefficient (LLC) is. If you are unfamiliar of the term, we recommend reading this, the longer sequence here, or the paper on LLC estimation ...
  continue reading

2431 episodes

Artwork
iconShare
 
Manage episode 429745641 series 3314709
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions, published by Lidor Banuel Dabbah on July 19, 2024 on LessWrong. Tl;dr: In this post we present the exploratory phase of a project aiming to study neural networks by applying static LLC estimation to specific alterations of them. We introduce a new method named Feature Targeted (FT) LLC estimation and study its ability to distinguish SAE trained features from random directions. By comparing our method to other possible metrics, we demonstrate that it outperforms all of them but one, which has comparable performance. We discuss possible explanations to our results, our project and other future directions. Introduction Given a neural network M and a latent layer within it, L, a central motif in current mechanistic interpretability research is to find functions f:LR [1] which are features of the model. Features are (generally) expected to exhibit the following properties: 1. Encode interpretable properties of the input. 2. Be causally relevant to the computation of the output of the model. 3. Encode the output of a certain submodule of our model M, i.e. a component, localized in weight space, which is responsible for a specific part of the total computation. While this is common wisdom, methods for automated feature evaluation usually focus on correlations between the (top) activations of the feature with human (or machine) recognizable interpretations, or on the effect of feature-related interventions on the output of the model. In particular, while the first and second items of the feature characterization above are central in current techniques, the third property, specifically the localized nature of the computation upstream of the feature, is less so[2]. We are currently investigating a direction which fills that gap, and this post shares the findings of the exploratory research we have conducted to validate and inform our approach. More specifically, we operationalized the concept of "weight-localized computation" using the local learning coefficient (LLC) introduced in Lau et al, following the learning coefficient first introduced in the context of singular learning theory. We apply LLC estimation to models associated with our base model and a feature within it, a method we call feature targeted (FT) LLC estimation. In this exploratory work we study FT-LLC estimates of specific models associated with SAE features. Most notably, we have found that: 1. FT-LLC estimates of SAE features are, on average, distinguishably higher then those of random directions. 2. For a particular variant of FT-LLC estimation, which we named the functional FT-LLC (defined in this section) this separation is pronounced enough such that the vast majority of SAE features we studied are clearly separated from the random features we studied. Furthermore, most baseline metrics we compared it to (see here) are less capable at distinguishing SAE features from random directions, with only one performing on par with it. Section 1 introduces the main technique we study in this post, FT-LLC estimation, and section 2 outlines our motivations. Section 3 describes the details of our experimental setting, our results, and the comparison to baseline metrics. In section 4 we discuss our overall takes, how they fit within our general agenda and gaps we currently have in theoretically understanding them. Section 5 is devoted to outlining our next steps, the general direction of the project, and some other possible directions for further research. Lastly, we briefly discuss related work in section 6. What is FT-LLC? LLC estimation We start out by briefly recalling what the local learning coefficient (LLC) is. If you are unfamiliar of the term, we recommend reading this, the longer sequence here, or the paper on LLC estimation ...
  continue reading

2431 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide