AF - Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents by Sam Brown

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

1M ago 26:02

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents, published by Sam Brown on July 22, 2024 on The AI Alignment Forum. Summary Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement. This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM agents. This 'Auto-Enhancement benchmark' measures the ability of 'top-level' agents to improve the performance of 'reference' agents on 'component' benchmarks, such as CyberSecEval 2, MLAgentBench, SWE-bench, and WMDP. Results are mostly left for a future post in the coming weeks. Scaffolds such as AutoGPT, ReAct, and SWE-agent can be built around LLMs to build LLM agents, with abilities such as long-term planning and context-window management to enable them to carry out complex general-purpose tasks autonomously. LLM agents can fix issues in large, complex code bases (see SWE-bench), and interact in a general way using web browsers, Linux shells, and Python interpreters. In this post, we outline our plans for a project to measure these LLM agents' ability to modify other LLM agents, undertaken as part of Axiom Futures' Alignment Research Fellowship. Our proposed benchmark consists of "enhancement tasks," which measure the ability of an LLM agent to improve the performance of another LLM agent (which may be a clone of the first agent) on various tasks. Our benchmark uses existing benchmarks as components to measure LLM agent capabilities in various domains, such as software engineering, cybersecurity exploitation, and others. We believe these benchmarks are consequential in the sense that good performance by agents on these tasks should be concerning for us. We plan to write an update post with our results at the end of the Fellowship, and we will link this post to that update. Motivation Agents are capable of complex SWE tasks (see, e.g., Yang et al.). One such task could be the improvement of other scaffolded agents. This capability would be a key component of autonomous replication and adaptation (ARA), and we believe it would be generally recognised as an important step towards extreme capabilities. This post outlines our initial plans for developing a novel benchmark that aims to measure the ability of LLM-based agents to improve other LLM-based agents, including those that are as capable as themselves. Threat model We present two threat models that aim to capture how AI systems may develop super-intelligent capabilities. Expediting AI research: Recent trends show how researchers are leveraging LLMs to expedite academic paper reviews (see Du et al.). ML researchers are beginning to use LLMs to design and train more advanced models (see Cotra's AIs accelerating AI research and Anthropic's work on Constitutional AI). Such LLM-assisted research may expedite progress toward super-intelligent systems. Autonomy: Another way that such capabilities are developed is through LLM agents themselves becoming competent enough to self-modify and further ML research without human assistance (see section Hard Takeoff in this note ), leading to an autonomously replicating and adapting system. Our proposed benchmark aims to quantify the ability of LLM agents to bring about such recursive self-improvement, either with or without detailed human supervision. Categories of bottlenecks and overhang risks We posit that there are three distinct categories of bottlenecks to LLM agent capabilities: 1. Architectures-of-thought, such as structured planning, progress-summarisation, hierarchy of agents, self-critique, chain-of-thought, self-consistency, prompt engineering/elicitation, and so on. Broadly speaking, this encompasses everything between the LL...

2444 episodes

#Podcasting Education #The Nonlinear Fund