AF - A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team by Lee Sharkey

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

10h ago 32:24

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team, published by Lee Sharkey on July 18, 2024 on The AI Alignment Forum. Why we made this list: The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we'd work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about! Previous lists of project ideas (such as Neel's collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren't an up-to-date reflection of what some researchers consider the frontiers of mech interp. We therefore thought it would be helpful to share our list of project ideas! Comments and caveats: Some of these projects are more precisely scoped than others. Some are vague, others are more developed. Not every member of the team endorses every project as high priority. Usually more than one team member supports each one, and in many cases most of the team is supportive of someone working on it. We associate the person(s) who generated the project idea to each idea. We've grouped the project ideas into categories for convenience, but some projects span multiple categories. We don't put a huge amount of weight on this particular categorisation. We hope some people find this list helpful! We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out. Foundational work on sparse dictionary learning for interpretability Transcoder-related project ideas See [2406.11944] Transcoders Find Interpretable LLM Feature Circuits) [Nix] Training and releasing high quality transcoders. Probably using top k GPT2 is a classic candidate for this. I'd be excited for people to try hard on even smaller models, e.g. GELU 4L [Nix] Good tooling for using transcoders Nice programming API to attribute an input to a collection of paths (see Dunefsky et al) Web user interface? Maybe in collaboration with neuronpedia. Would need a gpu server constantly running, but I'm optimistic you could do it with a ~a4000. [Nix] Further circuit analysis using transcoders. Take random input sequences, run transcoder attribution on them, examine the output and summarize the findings. High level summary statistics of how much attribution goes through error terms & how many pathways are needed would be valuable Explaining specific behaviors (IOI, greater-than) with high standards for specificity & faithfulness. Might be convoluted if accuracy [I could generate more ideas here, feel free to reach out nix@apolloresearch.ai] [Nix, Lee] Cross layer superposition Does it happen? Probably, but it would be nice to have specific examples! Look for features with similar decoder vectors, and do exploratory research to figure out what exactly is going on. What precisely does it mean? Answering this question seems likely to shed light on the question of 'What is a feature?'. [Lucius] Improving transcoder architectures Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation. If we modify our transcoders to include a linear 'bypass' that is not counted in the sparsity penalty, do we improve performance since we are not unduly penalizing these linear transformations that would always be present and active? If we train multiple transcoders in different layers at the same time, can we include a sparsity penalty for their interactions with each other, encouraging a decomposition of the network that leaves us with as few interactions between features a...

2444 episodes

#Podcasting Education #The Nonlinear Fund