Artwork

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

AF - Stitching SAEs of different sizes by Bart Bussmann

21:12
 
Share
 

Manage episode 428730093 series 2997284
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Stitching SAEs of different sizes, published by Bart Bussmann on July 13, 2024 on The AI Alignment Forum. Work done in Neel Nanda's stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) "novel features" with new information not in the small SAE and 2) "reconstruction features" that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE. Introduction Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when you take the same activation in the same model and train a wider dictionary on it, what changes? And how do the features learned vary? We show that features in larger SAEs cluster into two kinds of features: those that capture similar information to the smaller SAE (either identical features, or split features; about 65%), and those which capture novel features absent in the smaller mode (the remaining 35%). We validate this by showing that inserting the novel features from the larger SAE into the smaller SAE boosts the reconstruction performance, while inserting the similar features makes performance worse. Building on this insight, we show how features from multiple SAEs of different sizes can be combined to create a "Frankenstein" model that outperforms SAEs with an equal number of features, though tends to lead to higher L0, making a fair comparison difficult. Our work provides new understanding of how SAE dictionary size impacts the learned feature space, and how to reason about whether to train a wider SAE. We hope that this method may also lead to a practically useful way of training high-performance SAEs with less feature splitting and a wider range of learned novel features. Larger SAEs learn both similar and entirely novel features Set-up We use sparse autoencoders as in Towards Monosemanticity and Sparse Autoencoders Find Highly Interpretable Directions. In our setup, the feature activations are computed as: Based on these feature activations, the input is then reconstructed as The encoder and decoder matrices and biases are trained with a loss function that combines an L2 penalty on the reconstruction loss and an L1 penalty on the feature activations: In our experiments, we train a range of sparse autoencoders (SAEs) with varying widths across residual streams in GPT-2 and Pythia-410m. The width of an SAE is determined by the number of features (F) in the sparse autoencoder. Our smallest SAE on GPT-2 consists of only 768 features, while the largest one has nearly 100,000 features. Here is the full list of SAEs used in this research: Name Model site Dictionary size L0 MSE CE Loss Recovered from zero ablation CE Loss Recovered from mean ablation GPT2-768 gpt2-small layer 8 of 12 resid_pre 768 35.2 2.72 0.915 0.876 GPT2-1536 gpt2-small layer 8 of 12 resid_pre 1536 39.5 2.22 0.942 0.915 GPT2-3072 gpt2-small layer 8 of 12 resid_pre 3072 42.4 1.89 0.955 0.937 GPT2-6144 gpt2-small layer 8 of 12 resid_pre 6144 43.8 1.631 0.965 0.949 GPT2-12288 gpt2-small layer 8 of 12 resid_pre 12288 43.9 1.456 0.971 0.958 GPT2-24576 gpt2-small layer 8 of 12 resid_pre 24576 42.9 1.331 0.975 0.963 GPT2-49152 gpt2-small layer 8 of 12 resid_pre 49152 42.4 1.210 0.978 0.967 GPT2-98304 gpt2-small layer 8 of 12 resid_pre 98304 43.9 1.144 0.980 0.970 Pythia-8192 Pythia-410M-deduped layer 3 of 24 resid_pre 8192 51.0 0.030 0.977 0.972 Pythia-16384 Pythia-410M-deduped layer 3 of 24 resid_pre 16384 43.2 0.024 0.983 0.979 The base language models used are those included in Transform...
  continue reading

2447 episodes

Artwork
iconShare
 
Manage episode 428730093 series 2997284
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Stitching SAEs of different sizes, published by Bart Bussmann on July 13, 2024 on The AI Alignment Forum. Work done in Neel Nanda's stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) "novel features" with new information not in the small SAE and 2) "reconstruction features" that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE. Introduction Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when you take the same activation in the same model and train a wider dictionary on it, what changes? And how do the features learned vary? We show that features in larger SAEs cluster into two kinds of features: those that capture similar information to the smaller SAE (either identical features, or split features; about 65%), and those which capture novel features absent in the smaller mode (the remaining 35%). We validate this by showing that inserting the novel features from the larger SAE into the smaller SAE boosts the reconstruction performance, while inserting the similar features makes performance worse. Building on this insight, we show how features from multiple SAEs of different sizes can be combined to create a "Frankenstein" model that outperforms SAEs with an equal number of features, though tends to lead to higher L0, making a fair comparison difficult. Our work provides new understanding of how SAE dictionary size impacts the learned feature space, and how to reason about whether to train a wider SAE. We hope that this method may also lead to a practically useful way of training high-performance SAEs with less feature splitting and a wider range of learned novel features. Larger SAEs learn both similar and entirely novel features Set-up We use sparse autoencoders as in Towards Monosemanticity and Sparse Autoencoders Find Highly Interpretable Directions. In our setup, the feature activations are computed as: Based on these feature activations, the input is then reconstructed as The encoder and decoder matrices and biases are trained with a loss function that combines an L2 penalty on the reconstruction loss and an L1 penalty on the feature activations: In our experiments, we train a range of sparse autoencoders (SAEs) with varying widths across residual streams in GPT-2 and Pythia-410m. The width of an SAE is determined by the number of features (F) in the sparse autoencoder. Our smallest SAE on GPT-2 consists of only 768 features, while the largest one has nearly 100,000 features. Here is the full list of SAEs used in this research: Name Model site Dictionary size L0 MSE CE Loss Recovered from zero ablation CE Loss Recovered from mean ablation GPT2-768 gpt2-small layer 8 of 12 resid_pre 768 35.2 2.72 0.915 0.876 GPT2-1536 gpt2-small layer 8 of 12 resid_pre 1536 39.5 2.22 0.942 0.915 GPT2-3072 gpt2-small layer 8 of 12 resid_pre 3072 42.4 1.89 0.955 0.937 GPT2-6144 gpt2-small layer 8 of 12 resid_pre 6144 43.8 1.631 0.965 0.949 GPT2-12288 gpt2-small layer 8 of 12 resid_pre 12288 43.9 1.456 0.971 0.958 GPT2-24576 gpt2-small layer 8 of 12 resid_pre 24576 42.9 1.331 0.975 0.963 GPT2-49152 gpt2-small layer 8 of 12 resid_pre 49152 42.4 1.210 0.978 0.967 GPT2-98304 gpt2-small layer 8 of 12 resid_pre 98304 43.9 1.144 0.980 0.970 Pythia-8192 Pythia-410M-deduped layer 3 of 24 resid_pre 8192 51.0 0.030 0.977 0.972 Pythia-16384 Pythia-410M-deduped layer 3 of 24 resid_pre 16384 43.2 0.024 0.983 0.979 The base language models used are those included in Transform...
  continue reading

2447 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide