LW - Musings on LLM Scale (Jul 2024) by Vladimir Nesov

The Nonlinear Library: LessWrong

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

2d ago 5:09

MP3•Episode home

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Musings on LLM Scale (Jul 2024), published by Vladimir Nesov on July 6, 2024 on LessWrong. In a recent interview, Dario Amodei claimed that cost of training is (starting with models already available) Right now, $100 million. There are models in training today that are more like a $1 billion. I think if we go to $10 or a $100 billion, and I think that will happen in 2025-2026, maybe 2027, ... (Epistemic status: Fermi estimates, 8 is approximately 10 which is greater than 9.) Assuming $40,000 per H100 and associated infrastructure in a datacenter, $1 billion gives 25K H100s, which matches the scale of for example Meta's new training clusters and requires about 40MW of power. At $2 per hour, training time cost of 25K H100s reaches $100 million in 80 days, which seems reasonable if on the short side for a production training run. The cost of time matches $1 billion at 2.3 years. An H100 (SXM) is rated for 2e15 FLOP/s in BF16 (my impression is this is usually stable out of the box). This becomes 4e15 FLOP/s in FP8, which seems practical if done carefully, no degradation in pre-training loss compared to FP32. The $100 million run then translates to 9e25 FLOPs at 30% utilization in BF16, or 2e26 FLOPs in FP8. (For some reason this SemiAnalysis estimate is 2x lower, peak 2e20 FLOP/s for 100,000 H100s at FP8, possibly the sparsity footnote in H100 specification for the 4000 teraFLOP/s figure is the culprit.) This is maybe 10x original GPT-4, estimated at 2e25 FLOPs. The leading models (Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4 Omni) cost $15-20 per million output tokens, compared to $75-120 for once-frontier models Claude 3 Opus, Gemini 1 Ultra, original GPT-4. Given a Chinchilla optimal model, if we reduce its active parameters 3x and increase training compute 3x, we get approximately the same performance, but it's now at least 3x cheaper for inference. This increases data 10x, which if everything else fails can be obtained by repeating the old data, giving 30x overtraining in compute compared to what is Chinchilla optimal for the smaller model. Llama-3-70b is overtrained 10x, Llama-3-8b 90x, though they don't use MoE and their performance is lower than for MoE models with the same active parameters and training cost. Beyond $100 million The current frontier models are overtrained on compute that could enable even smarter models. Compute is increasing, but it mostly goes to reduction of inference cost, and only a little bit to capabilities. Why aren't any of the three labs directing the compute to train/release models optimized for maximum capability? Possibly costs are already such that training at too many parameter/data tradeoff points won't be done, instead they choose an option that's currently most useful and spend the rest on experiments that would make imminent larger scale runs better. Even OpenAI's next frontier model in training as of May 28 might just be using compute comparable to what GPT-4 Omni required, not OOMs more, and it could still get much more capable if allowed to be more expensive for inference. To do a run at $1 billion in cost of time, even 100K H100s would need 200 days (powered by 150MW). There probably aren't any individual clusters of this scale yet (which would cost about $4 billion). Gemini 1.0 report stated that Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. ... we combine SuperPods in multiple datacenters using Google's intra-cluster and inter-cluster network. Google's network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods. This together with Amodei's claim of current $1 billion training runs and individual 100K H100 clusters still getting built ...

1702 episodes

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech