Go offline with the Player FM app!
AF - A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team by Lee Sharkey
Archived series ("Inactive feed" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? Inactive feed status. Our servers were unable to retrieve a valid podcast feed for a sustained period.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429583383 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team, published by Lee Sharkey on July 18, 2024 on The AI Alignment Forum.
Why we made this list:
The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we'd work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about!
Previous lists of project ideas (such as Neel's collation of
200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren't an up-to-date reflection of what some researchers consider the frontiers of mech interp.
We therefore thought it would be helpful to share our list of project ideas!
Comments and caveats:
Some of these projects are more precisely scoped than others. Some are vague, others are more developed.
Not every member of the team endorses every project as high priority. Usually more than one team member supports each one, and in many cases most of the team is supportive of someone working on it.
We associate the person(s) who generated the project idea to each idea.
We've grouped the project ideas into categories for convenience, but some projects span multiple categories. We don't put a huge amount of weight on this particular categorisation.
We hope some people find this list helpful!
We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out.
Foundational work on sparse dictionary learning for interpretability
Transcoder-related project ideas
See
[2406.11944] Transcoders Find Interpretable LLM Feature Circuits)
[Nix] Training and releasing high quality transcoders.
Probably using top k
GPT2 is a classic candidate for this. I'd be excited for people to try hard on even smaller models, e.g. GELU 4L
[Nix] Good tooling for using transcoders
Nice programming API to attribute an input to a collection of paths (see
Dunefsky et al)
Web user interface? Maybe in collaboration with neuronpedia. Would need a gpu server constantly running, but I'm optimistic you could do it with a ~a4000.
[Nix] Further circuit analysis using transcoders.
Take random input sequences, run transcoder attribution on them, examine the output and summarize the findings.
High level summary statistics of how much attribution goes through error terms & how many pathways are needed would be valuable
Explaining specific behaviors (IOI, greater-than) with high standards for specificity & faithfulness. Might be convoluted if accuracy
[I could generate more ideas here, feel free to reach out
nix@apolloresearch.ai]
[Nix, Lee] Cross layer superposition
Does it happen? Probably, but it would be nice to have specific examples! Look for features with similar decoder vectors, and do exploratory research to figure out what exactly is going on.
What precisely does it mean? Answering this question seems likely to shed light on the question of 'What is a feature?'.
[Lucius] Improving transcoder architectures
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation. If we modify our transcoders to include a linear 'bypass' that is not counted in the sparsity penalty, do we improve performance since we are not unduly penalizing these linear transformations that would always be present and active?
If we train multiple transcoders in different layers at the same time, can we include a sparsity penalty for their interactions with each other, encouraging a decomposition of the network that leaves us with as few interactions between features a...
392 episodes
Archived series ("Inactive feed" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 19, 2024 11:06 ()
Why? Inactive feed status. Our servers were unable to retrieve a valid podcast feed for a sustained period.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429583383 series 3337166
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team, published by Lee Sharkey on July 18, 2024 on The AI Alignment Forum.
Why we made this list:
The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we'd work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about!
Previous lists of project ideas (such as Neel's collation of
200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren't an up-to-date reflection of what some researchers consider the frontiers of mech interp.
We therefore thought it would be helpful to share our list of project ideas!
Comments and caveats:
Some of these projects are more precisely scoped than others. Some are vague, others are more developed.
Not every member of the team endorses every project as high priority. Usually more than one team member supports each one, and in many cases most of the team is supportive of someone working on it.
We associate the person(s) who generated the project idea to each idea.
We've grouped the project ideas into categories for convenience, but some projects span multiple categories. We don't put a huge amount of weight on this particular categorisation.
We hope some people find this list helpful!
We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out.
Foundational work on sparse dictionary learning for interpretability
Transcoder-related project ideas
See
[2406.11944] Transcoders Find Interpretable LLM Feature Circuits)
[Nix] Training and releasing high quality transcoders.
Probably using top k
GPT2 is a classic candidate for this. I'd be excited for people to try hard on even smaller models, e.g. GELU 4L
[Nix] Good tooling for using transcoders
Nice programming API to attribute an input to a collection of paths (see
Dunefsky et al)
Web user interface? Maybe in collaboration with neuronpedia. Would need a gpu server constantly running, but I'm optimistic you could do it with a ~a4000.
[Nix] Further circuit analysis using transcoders.
Take random input sequences, run transcoder attribution on them, examine the output and summarize the findings.
High level summary statistics of how much attribution goes through error terms & how many pathways are needed would be valuable
Explaining specific behaviors (IOI, greater-than) with high standards for specificity & faithfulness. Might be convoluted if accuracy
[I could generate more ideas here, feel free to reach out
nix@apolloresearch.ai]
[Nix, Lee] Cross layer superposition
Does it happen? Probably, but it would be nice to have specific examples! Look for features with similar decoder vectors, and do exploratory research to figure out what exactly is going on.
What precisely does it mean? Answering this question seems likely to shed light on the question of 'What is a feature?'.
[Lucius] Improving transcoder architectures
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation. If we modify our transcoders to include a linear 'bypass' that is not counted in the sparsity penalty, do we improve performance since we are not unduly penalizing these linear transformations that would always be present and active?
If we train multiple transcoders in different layers at the same time, can we include a sparsity penalty for their interactions with each other, encouraging a decomposition of the network that leaves us with as few interactions between features a...
392 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.