AF - Fine-tuning is not sufficient for capability elicitation by Theodore Chapman

The Nonlinear Library: Alignment Forum

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

17d ago 15:20

MP3•Episode home

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fine-tuning is not sufficient for capability elicitation, published by Theodore Chapman on June 14, 2024 on The AI Alignment Forum. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort under the supervision of Evan Hubinger. Acknowledgements: Thanks to Kyle Brady for his many contributions to this project. Abstract This post argues that the performance elicited by fine-tuning an LLM on a task using a given prompt format does not usefully bound the level of performance observed when the same information is presented in a different structure. Thus, fine-tuned performance provides very little information about the best performance that would be achieved by a large number of actors fine-tuning models with random prompting schemes in parallel. In particular, we find that we get much better results from fine-tuning gpt-3.5-turbo (ChatGPT 3.5) to play chess when the game so far is presented in a single block of SAN[1] than when the game so far is separated into a series of SAN moves presented as alternating user / assistant messages. The fact that this superficial formatting change is sufficient to change our fine-tuned performance serves to highlight that modern LLMs are much more fragile than they appear at first glance, even subject to fine-tuning. Introduction In the abstract, model evaluations identify a task and attempt to establish a bound on the level of performance that can be elicited from a given model with a given level of investment. The current state of the art is roughly: 1. Choose a reasonable prompting scheme 2. Generate a dataset of high-quality samples and encode them in the chosen format 3. Fine-tune the model and evaluate the resulting performance 4. Make some implicit regularity assumptions about the quality of models fine-tuned using different prompting schemes[1] 5. Conclude that probably no other actor can elicit substantially better performance on the same task from the same model while spending substantially less money than we did This post takes issue with step 4. We begin by illustrating the extreme brittleness of observed model performance when prompting without fine-tuning. Then we argue that fine-tuning is not sufficient to eliminate this effect. Using chess as a toy model, we show two classes of prompting schemes under which ChatGPT-3.5 converges to dramatically different levels of performance after fine-tuning. Our central conclusion is that the structure in which data is presented to an LLM (or at least to ChatGPT 3.5) matters more than one might intuitively expect and that this effect persists through fine-tuning. In the specific case of chess, the better prompting scheme that we use (described in the section below) is easily derived but in situations that are further out of distribution (such as the automated replication and adaptation tasks METR defined), it is not obvious what the best way to present information is, and it seems plausible that there are simple prompt formats which would result in substantially better performance than those that we've tested to date. General Setting We use the term 'agent' to refer to the combination of a model - here gpt-3.5-turbo unless otherwise specified - and a function which takes a chess position as input and outputs the document we feed into the model (henceforth a 'prompting scheme'). We perform our evaluations using three datasets of chess games: 1. A collection of ~6000 games played by humans on Lichess with at least 30 minutes for each player 2. A collection of ~500 games played between all pairings of stockfish 16 level 1, 5, 10, 15, and 20 3. A collection of ~300 games played by ChatGPT 3.5 or gpt-3.5-turbo-instruct with various prompting schemes We evaluate our agents by selecting a random point in each of the games, providing the current game position as...

391 episodes

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech