LW - Understanding Positional Features in Layer 0 SAEs by bilalchughtai

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

2d ago 9:29

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding Positional Features in Layer 0 SAEs, published by bilalchughtai on July 30, 2024 on LessWrong. This is an informal research note. It is the result of a few-day exploration into positional SAE features conducted as part of Neel Nanda's training phase of the ML Alignment & Theory Scholars Program - Summer 2024 cohort. Thanks to Andy Arditi, Arthur Conmy and Stefan Heimersheim for helpful feedback. Thanks to Joseph Bloom for training this SAE. Summary We investigate positional SAE features learned by layer 0 residual stream SAEs trained on gpt2-small. In particular, we study the activation blocks.0.hook_resid_pre, which is the sum of the token embeddings and positional embeddings. Importantly gpt2-small uses absolute learned positional embeddings - that is, the positional embeddings are a trainable parameter (learned) and are injected into the residual stream (absolute). We find that this SAE learns a set of positional features. We investigate some of the properties of these features, finding Positional and semantic features are entirely disjoint at layer 0. Note that we do not expect this to continue holding in later layers as attention mixes semantic and positional information. In layer 0, we should expect the SAE to disentangle positional and semantic features as there is a natural notion of ground truth positional and semantic features that interact purely additively. Generically, each positional feature spans a range of positions, except for the first few positions which each get dedicated (and sometimes, several) features. We can attribute degradation of SAE performance beyond the SAE training context length to (lack of) these positional features, and to the absolute nature of positional embeddings used by this model. Set Up We study pretrained gpt2-small SAEs trained on blocks.0.hook_resid_pre. This is particularly clean, as we can generate the entire input distribution to the SAE by summing each of the d_vocab token embeddings with each of the n_ctx positional embeddings, obtaining a tensor all_resid_pres: Float[Tensor, "d_vocab n_ctx d_model"] By passing this tensor through the SAE, we can grab all of the pre/post activation function feature activations all_feature_acts: Float[Tensor, "d_vocab n_ctx d_sae"] In this post, d_model = 768 and d_sae = 24576. Importantly the SAE we study in this post has context_size=128. The SAE context size corresponds is the maximal length of input sequence used to generate activations for training of the SAE. Finding features The activation space of study can be thought of as the direct sum of the token embedding space and the positional embedding space. As such, we hypothesize that semantic and positional features learned by the SAE should be distinct. That is, we hypothesize that the feature activations for some feature i can be written in the form where for each i, either gi=0 or hi=0 identically for all inputs in their domain and x is a d_model dimensional vector. To investigate this we hold tok or pos fixed in all_feature_acts and vary the other input. We first restrict to pos < sae.cfg.context_size. Positional features We first replicate Figure 1f of Gurnee et al. (2024), which finds instances of sinusoidal positional neurons in MLP layers. To do so, we assign each feature a positional score. We first compute the mean activation of each feature at each position by averaging over all possible input tokens. The position score is the max value of this over all positions, i.e. where fi(tok,pos) is the feature activation for feature i for the given input. We find positional scores drop off rapidly. There seem to only be ~50 positional features (of 24k total features) in this SAE. Inspecting the features, we find 1. Many positional features, each with small standard deviation over input tokens (shown in lower opacit...

2434 episodes

#Podcasting Education #The Nonlinear Fund