AF - Extracting SAE task features for ICL by Dmitrii Kharlapenko

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

3d ago 17:20

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Extracting SAE task features for ICL, published by Dmitrii Kharlapenko on August 12, 2024 on The AI Alignment Forum. TL;DR We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features - you can't just pass an arbitrary vector through the encoder without going off distribution. We explored the algorithm of gradient pursuit suggested in Smith et al, but it didn't work for us without modifications. Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup. This exploits the fact that task vectors have a differentiable objective. We find that this gives a sparser and cleaner reconstruction, which is also highly interpretable, and also serves as a better task vector due to directly optimizing for log likelihood. This takes us from ~100 active features to ~10. Using our algorithm, we find two classes of SAE features involved in ICL. One of them recognizes the exact tasks or output formats from the examples, and another one encodes the tasks for execution by the model later on. We show that steering with these features has causal effects similar to task vectors. This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Neel Nanda and Arthur Conmy. Prior work Task or function vectors are internal representations of some task that LLMs form while processing an ICL prompt. They can be extracted from a model running on a few-shot prompt and then be used to make it complete the same task without having any prior context or task description. Several papers (Function vectors in large language models, In-Context Learning Creates Task Vectors) have proposed different ways to extract those task vectors. They all center around having ICL examples being fed to a model in the form of "input output, … " and averaging the residuals on the "separator" token over a batch. This approach can reconstruct some part of the ICL performance but does not admit a straightforward conversion to the SAE basis. ITO with gradient pursuit can be used to do a sparse coding of a residual vector using SAE features. The post suggests using this algorithm for steering vector SAE decomposition. Since task vectors can be thought of as steering vectors, ITO may provide some insight into the ways they operate. Initial Phi-3 experiments Direct SAE task vector reconstruction In our study we trained a set of gated SAEs for Phi-3 Mini 3.8B using a model-generated synthetic instruction dataset. While offering a sparse dictionary decomposition of residuals, SAEs tend to introduce a reconstruction error that impacts the performance of the model. They also have no guarantee to be able to decompose out-of-distribution vectors, and task vectors being a product of averaging activations across prompts and tokens may be the case of such vectors. Thus, we first studied the performance of SAE reconstructions of task vectors in transferring the definition of two tasks: 1) antonym generation and 2) English to Spanish word translation. These and other tasks used to study task vectors were taken from the ICL task vectors paper github repository. These charts show the NLL loss of the model on the evaluation set of zero-shot prompts for both of the tasks depending on the layer of extraction/insertion. TV stands for the original task vector performance; Recon of TV stands for using the SAE reconstruction of the task vector instead of the task vector; TV on recon stands for first doing a SAE reconstruction of the residuals and then collecting a task vector on them; ITO stands for the ITO algorithm with 40 target l0 loss. It can be seen from charts that SAE reconstruction significantly decrea...

2433 episodes

#Podcasting Education #The Nonlinear Fund