AF - A framework for thinking about AI power-seeking by Joe Carlsmith

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

1M ago 30:52

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A framework for thinking about AI power-seeking, published by Joe Carlsmith on July 24, 2024 on The AI Alignment Forum. This post lays out a framework I'm currently using for thinking about when AI systems will seek power in problematic ways. I think this framework adds useful structure to the too-often-left-amorphous "instrumental convergence thesis," and that it helps us recast the classic argument for existential risk from misaligned AI in a revealing way. In particular, I suggest, this recasting highlights how much classic analyses of AI risk load on the assumption that the AIs in question are powerful enough to take over the world very easily, via a wide variety of paths. If we relax this assumption, I suggest, the strategic trade-offs that an AI faces, in choosing whether or not to engage in some form of problematic power-seeking, become substantially more complex. Prerequisites for rational takeover-seeking For simplicity, I'll focus here on the most extreme type of problematic AI power-seeking - namely, an AI or set of AIs actively trying to take over the world ("takeover-seeking"). But the framework I outline will generally apply to other, more moderate forms of problematic power-seeking as well - e.g., interfering with shut-down, interfering with goal-modification, seeking to self-exfiltrate, seeking to self-improve, more moderate forms of resource/control-seeking, deceiving/manipulating humans, acting to support some other AI's problematic power-seeking, etc.[2] Just substitute in one of those forms of power-seeking for "takeover" in what follows. I'm going to assume that in order to count as "trying to take over the world," or to participate in a takeover, an AI system needs to be actively choosing a plan partly in virtue of predicting that this plan will conduce towards takeover.[3] And I'm also going to assume that this is a rational choice from the AI's perspective.[4] This means that the AI's attempt at takeover-seeking needs to have, from the AI's perspective, at least some realistic chance of success - and I'll assume, as well, that this perspective is at least decently well-calibrated. We can relax these assumptions if we'd like - but I think that the paradigmatic concern about AI power-seeking should be happy to grant them. What's required for this kind of rational takeover-seeking? I think about the prerequisites in three categories: Agential prerequisites - that is, necessary structural features of an AI's capacity for planning in pursuit of goals. Goal-content prerequisites - that is, necessary structural features of an AI's motivational system. Takeover-favoring incentives - that is, the AI's overall incentives and constraints combining to make takeover-seeking rational. Let's look at each in turn. Agential prerequisites In order to be the type of system that might engage in successful forms of takeover-seeking, an AI needs to have the following properties: 1. Agentic planning capability: the AI needs to be capable of searching over plans for achieving outcomes, choosing between them on the basis of criteria, and executing them. 2. Planning-driven behavior: the AI's behavior, in this specific case, needs to be driven by a process of agentic planning. 1. Note that this isn't guaranteed by agentic planning capability. 1. For example, an LLM might be capable of generating effective plans, in the sense that that capability exists somewhere in the model, but it could nevertheless be the case that its output isn't driven by a planning process in a given case - i.e., it's not choosing its text output via a process of predicting the consequences of that text output, thinking about how much it prefers those consequences to other consequences, etc. 2. And note that human behavior isn't always driven by a process of agentic planning, either, despite our ...

2442 episodes

#Podcasting Education #The Nonlinear Fund