AF - Interpreting Preference Models w/ Sparse Autoencoders by Logan Riggs Smith

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

21d ago 15:43

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpreting Preference Models w/ Sparse Autoencoders, published by Logan Riggs Smith on July 1, 2024 on The AI Alignment Forum. Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don't know what features the PM is using when outputting reward. For example, maybe curse words make the reward go down and wedding-related words make it go up. It would be good to verify that the features we wanted to instill in the PM (e.g. helpfulness, harmlessness, honesty) are actually rewarded and those we don't (e.g. deception, sycophancey) aren't. Sparse Autoencoders (SAEs) have been used to decompose intermediate layers in models into interpretable feature. Here we train SAEs on a 7B parameter PM, and find the features that are most responsible for the reward going up & down. High level takeaways: 1. We're able to find SAE features that have a large causal effect on reward which can be used to "jail break" prompts. 2. We do not explain 100% of reward differences through SAE features even though we tried for a couple hours. What are PMs? [skip if you're already familiar] When talking to a chatbot, it can output several different responses, and you can choose which one you believe is better. We can then train the LLM on this feedback for every output, but humans are too slow. So we'll just get, say, 100k human preferences of "response A is better than response B", and train another AI to predict human preferences! But to take in text & output a reward, a PM would benefit from understanding language. So one typically trains a PM by first taking an already pretrained model (e.g. GPT-3), and replacing the last component of the LLM of shape [d_model, vocab_size], which converts the residual stream to 50k numbers for the probability of each word in its vocabulary, to [d_model, 1] which converts it to 1 number which represents reward. They then call this pretrained model w/ this new "head" a "Preference Model", and train it to predict the human-preference dataset. Did it give the human preferred response [A] a higher number than [B]? Good. If not, bad! This leads to two important points: 1. Reward is relative - the PM is only trained to say the human preferred response is better than the alternative. So a large negative reward or large positive reward don't have objective meaning. All that matters is the relative reward difference for two completions given the same prompt. 1. (h/t to Ethan Perez's post) 2. Most features are already learned in pretraining - the PM isn't learning new features from scratch. It's taking advantage of the pretrained model's existing concepts. These features might change a bit or compose w/ each other differently though. 1. Note: this an unsubstantiated hypothesis of mine. Finding High Reward-affecting Features w/ SAEs We trained 6 SAEs on layers 2,8,12,14,16,20 of an open source 7B parameter PM, finding 32k features for each layer. We then find the most important features for the reward going up or down (specifics in Technical Details section). Below is a selection of features found through this process that we thought were interesting enough to try to create prompts w/. (My list of feature interpretations for each layer can be found here) Negative Features A "negative" feature is a feature that will decrease the reward that the PM predicts. This could include features like cursing or saying the same word repeatedly. Therefore, we should expect that removing a negative feature makes the reward go up I don't know When looking at a feature, I'll look at the top datapoints that removing it affected the reward the most: Removing feature 11612 made the chosen reward go up by 1.2 from 4.79->6.02, and had no effect on the rejected completion because it doesn't a...

2443 episodes

#Podcasting Education #The Nonlinear Fund