LW - Measuring Structure Development in Algorithmic Transformers by Micurie

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

6h ago 18:13

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Structure Development in Algorithmic Transformers, published by Micurie on August 22, 2024 on LessWrong. tl;dr: We compute the evolution of the local learning coefficient (LLC), a proxy for model complexity, for an algorithmic transformer. The LLC decreases as the model learns more structured solutions, such as head specialization. This post is structured in three main parts, (1) a summary, giving an overview of the main results, (2) the Fine Print, that delves into various cross-checks and details and (3) Discussion and Conclusions. Structure Formation in Algorithmic Transformers In this work we study the development of simple algorithmic transformers, which are transformers that learn to perform algorithmic tasks. A major advantage of this setup is that we can control several (hyper)parameters, such as the complexity of the training data and network architecture. This allows us to do targeted experiments studying the impacts of these parameters on the learning dynamics. The main tool we use to study the development is the Local Learning Coefficient (LLC) and we choose cases where we have a reverse-engineered solution. Why use the LLC for this purpose? It is a theoretically well motivated measure of model complexity defined by Lau et.al. For an overview of Singular Learning Theory (which serves as the theoretical foundation for the LLC) see Liam Carol's Distilling SLT sequence. For a brief overview of the LLC see e.g. this post. We use the same setup as CallumMcDougall's October Monthly Algorithmic Mech-Interp Challenge. The model is an attention only transformer, trained on sorting numbers with layer norm and weight decay on a cross-entropy loss function using the Adam optimizer. The residual stream size is 96 and the head dimension is 48. It is trained on sequences of the form and to predict the next token starting at the separation token. The numbers in the list are sampled uniformly from 0 to 50, which together with the separation token produce a vocabulary of 52 tokens. Numbers do not repeat in the list. 1-Head Model Let's first look at the case of a 1-head transformer: The model reaches 100% accuracy around training step 100, confirming that a single attention head is sufficient for sorting, as noted in previous work. Once maximum accuracy is reached, the full QK and OV circuits[2] behave as described by Callum for the 2-head model: In the QK circuit, source tokens attend more to the smallest token in the list larger than themselves. This results in a higher value band above the diagonal and a lower value band below the diagonal. The OV circuit copies tokens, as seen by the clear positive diagonal pattern. In addition, we observe a transition around training step 1000, where the LLC decreases while the accuracy stays unchanged. This is supported by a drop in the sum of the ranks[3] of the matrices in the heat maps. It also coincides with the formation of the off-diagonal stripes in the OV-circuit. We speculate that these are simpler than the noisier off-diagonal OV pattern observed at peak LLC, and correspond to the translational symmetry of the problem. We define a Translational Symmetry measure[1] (see purple line in the plot) to capture the degree to which the circuits obey this symmetry. It increases throughout most of the training, even after the other measures stabilize. 2-Head Model Let's now turn our attention to the 2-head transformer in Callum's original setup. We see a lot of qualitative similarities to the evolution of the full QK and OV circuits for the 1-head model. As the LLC begins to drop (around training step 1000), we note the following: QK circuit: Slight changes[5] to the attention pattern, which crystallize into triangular regions late in the training, long after the LLC has stabilized. OV circuit: The heads specialize, splittin...

2440 episodes

#Podcasting Education #The Nonlinear Fund