LW - Interpreting and Steering Features in Images by Gytis Daujotas

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

2M ago 8:59

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpreting and Steering Features in Images, published by Gytis Daujotas on June 21, 2024 on LessWrong. We trained a SAE to find sparse features in image embeddings. We found many meaningful, interpretable, and steerable features. We find that steering image diffusion works surprisingly well and yields predictable and high-quality generations. You can see the feature library here. We also have an intervention playground you can try. Key Results We can extract interpretable features from CLIP image embeddings. We observe a diverse set of features, e.g. golden retrievers, there being two of something, image borders, nudity, and stylistic effects. Editing features allows for conceptual and semantic changes while maintaining generation quality and coherency. We devise a way to preview the causal impact of a feature, and show that many features have an explanation that is consistent with what they activate for and what they cause. Many feature edits can be stacked to perform task-relevant operations, like transferring a subject, mixing in a specific property of a style, or removing something. Interactive demo Visit the feature library of over ~50k features to explore the features we find. Our main result, the intervention playground, is now available for public use. Introduction We trained a sparse autoencoder on 1.4 million image embeddings to find monosemantic features. In our run, we found 35% (58k) of the total of 163k features were alive, which is that they have a non-zero activation for any of the images in our dataset. We found that many features map to human interpretable concepts, like dog breeds, times of day, and emotions. Some express quantities, human relationships, and political activity. Others express more sophisticated relationships like organizations, groups of people, and pairs. Some features were also safety relevant.We found features for nudity, kink, and sickness and injury, which we won't link here. Steering Features Previous work found similarly interpretable features, e.g. in CLIP-ViT. We expand upon their work by training an SAE in a domain that allows for easily testing interventions. To test an explanation derived from describing the top activating images for a particular feature, we can intervene on an embedding and see if the generation (the decoded image) matches our hypothesis. We do this by steering the features of the image embedding and re-adding the reconstruction loss. We then use an open source diffusion model, Kadinsky 2.2, to diffuse an image back out conditional on this embedding. Even though an image typically has many active features that appear to encode a similar concept, intervening on one feature with a much higher activation value still works and yields an output without noticeable degradation of quality. We built an intervention playground where users could adjust the features of an image to test hypotheses, and later found that the steering worked so well that users could perform many meaningful tasks while maintaining an output that is comparably as coherent and high quality as the original. For instance, the subject of one photo could be transferred to another. We could adjust the time of day, and the quantity of the subject. We could add entirely new features to images to sculpt and finely control them. We could pick two photos that had a semantic difference, and precisely transfer over the difference by transferring the features. We could also stack hundreds of edits together. Qualitative tests with users showed that even relatively untrained users could learn to manipulate image features in meaningful directions. This was an exciting result, because it could suggest that feature space edits could be useful for setting inference time rules (e.g. banning some feature that the underlying model learned) or as user interf...

2442 episodes

#Podcasting Education #The Nonlinear Fund