AF - SAEs (usually) Transfer Between Base And Chat Models By Connor Kissane The Nonlinear Library podcast

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

The Nonlinear Library »
AF - SAEs (usually) Transfer Between Base and Chat Models by Connor Kissane

4h ago 19:23

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAEs (usually) Transfer Between Base and Chat Models, published by Connor Kissane on July 18, 2024 on The AI Alignment Forum. This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive Summary We train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly well to reconstructing chat activations (and vice versa) on Mistral-7B and Qwen 1.5 0.5B. We also find that they don't transfer on Gemma v1 2B, and are generally bad at reconstructing <1% of unusually high norm activations (e.g. BOS tokens) from the opposite model. We fine-tune our base Mistral-7B SAE (on 5 million chat activations) to cheaply obtain an SAE with competitive sparsity and reconstruction fidelity to a chat SAE trained from scratch (on 800M tokens). We open source base, chat, and fine-tuned SAEs (plus wandb runs) for Mistral-7B and Qwen 1.5 0.5B.[1] Mistral 7B base SAEs, Mistral 7B chat SAEs, Mistral 7B base SAEs fine-tuned on chat Qwen 1.5 0.5B base SAEs, Qwen 1.5 0.5B chat SAEs, Qwen 1.5 0.5B base SAEs fine-tuned on chat We release accompanying evaluation code at https://github.com/ckkissane/sae-transfer Introduction Fine-tuning is a common technique applied to improve frontier language models, however we don't actually understand what fine-tuning changes within the model's internals. Sparse Autoencoders are a popular technique to decompose the internal activations of LLMs into sparse, interpretable features, and may provide a path to zoom into the differences between base vs fine-tuned representations. In this update, we share preliminary results studying the representation drift caused by fine-tuning with SAEs. We investigate whether SAEs trained to accurately reconstruct a base model's activations also accurately reconstruct activations from the model after fine-tuning (and vice versa). In addition to studying representation drift, we also think this is an important question to gauge the usefulness of sparse autoencoders as a general purpose technique. One flaw of SAEs is that they are expensive to train, so training a new suite of SAEs from scratch each time a model is fine-tuned may be prohibitive. If we are able to fine-tune existing SAEs for much cheaper, or even just re-use them, their utility seems more promising. We find that SAEs trained on the middle-layer residual stream of base models transfer surprisingly well to the corresponding chat model, and vice versa. Splicing in the base SAE to the chat model achieves similar CE loss to the chat SAE on both Mistral-7B and Qwen 1.5 0.5B. This suggests that the residual streams for these base and chat models are very similar. However, we also identify cases where the SAEs don't transfer. First, the SAEs fail to reconstruct activations from the opposite model that have outlier norms (e.g. BOS tokens). These account for less than 1% of the total activations, but cause cascading errors, so we need to filter these out in much of our analysis. We also find that SAEs don't transfer on Gemma v1 2B. We find that the difference in weights between Gemma v1 2B base vs chat is unusually large compared to other fine-tuned models, explaining this phenomenon. Finally, to solve the outlier norm issue, we fine-tune a Mistral 7B base SAE on just 5 million tokens (compared to 800M token pre-training), to obtain a chat SAE of comparable quality to one trained from scratch, without the need to filter out outlier activations. Investigating SAE Transfer between base and chat models In this section we investigate if base SAEs transfer to chat models, and vice versa. We find that with the exception of outlier norm tokens (e.g. BOS), they transfer surprisingly well, achieving similar CE loss recovered to the orig...

2445 episodes

#Podcasting Education #The Nonlinear Fund