AF - Degeneracies are sticky for SGD by Guillaume Corlouer

The Nonlinear Library: Alignment Forum

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

2M ago 30:23

MP3•Episode home

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Degeneracies are sticky for SGD, published by Guillaume Corlouer on June 16, 2024 on The AI Alignment Forum. Introduction Singular learning theory (SLT) is a theory of learning dynamics in Bayesian statistical models. It has been argued that SLT could provide insights into the training dynamics of deep neural networks. However, a theory of deep learning inspired by SLT is still lacking. In particular it seems important to have a better understanding of the relevance of SLT insights to stochastic gradient descent (SGD) - the paradigmatic deep learning optimization algorithm. We explore how the degeneracies[1] of toy, low dimensional loss landscapes affect the dynamics of stochastic gradient descent (SGD).[2] We also investigate the hypothesis that the set of parameters selected by SGD after a large number of gradient steps on a degenerate landscape is distributed like the Bayesian posterior at low temperature (i.e., in the large sample limit). We do so by running SGD on 1D and 2D loss landscapes with minima of varying degrees of degeneracy. While researchers experienced with SLT are aware of differences between SGD and Bayesian inference, we want to understand the influence of degeneracies on SGD with more precision and have specific examples where SGD dynamics and Bayesian inference can differ. Main takeaways Degeneracies influence SGD dynamics in two ways: (1) Convergence to a critical point is slower, the more degenerate the critical point is; (2) On a (partially) degenerate manifold, SGD preferentially escapes along non-degenerate directions. If all directions are degenerate, then we empirically observe that SGD is "stuck" To explain our observations, we show that, for our models, SGD noise covariance is proportional to the Hessian in the neighborhood of a critical point of the loss Thus SGD noise covariance goes faster to zero along more degenerate directions, to leading order in the neighborhood of a critical point Qualitatively we observe that the concentration of the end-of-training distribution of parameters sampled from a set of SGD trajectories sometimes differ from the Bayesian posterior as predicted by SLT because of: The hyperparameters such as the learning rate The number of orthogonal degenerate directions The degree of degeneracy in the neighborhood of a minimum Terminology and notation We advise the reader to skip this section and come back to it if notation or terminology is confusing. Consider a sequence of n input-output pairs (xi,yi)1in. We can think of xi as input data to a deep learning model (e.g., a picture, or a token) and yi as an output that model is trying to learn (e.g., whether the picture represents a cat or a dog, or a what the next token is). A deep learning model may be represented as a function y=f(w,x), where wΩ is a point in a parameter space Ω. The one-sample loss function, noted li(w):=12(yif(w,xi))2 (1in), is a measure of how good the model parametrized by w is a predicting the output yi on input xi. The empirical loss over n samples is noted ln(w):=1nni=1li(w). Noting q(x,y) the probability density function of input-output pairs, the theoretical loss (or the potential) writes l(w)=Eq[ln(w)].[4] The loss landscape is the manifold associated with the theoretical loss function wl(w). A point w is a critical point if the gradient of the theoretical loss is 0 at w i.e. l(w)=0. A critical point w is degenerate if the Hessian of the loss H(w):=2l(w) has at least one 0 eigenvalue at w. An eigenvector u of H with zero eigenvalue is a degenerate direction. The local learning coefficient λ(w) measures the greatest amount of degeneracy of a model around a critical point w. For the purpose of this work, if locally l(w=(w1,w2))(w1w1)2k1(w2w2)2k2 then the local learning coefficient is given by λ(w)=min(1k1,1k2). We say that a critical point...

399 episodes

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech