AF - Coalitional agency by Richard Ngo

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

1M ago 11:23

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Coalitional agency, published by Richard Ngo on July 22, 2024 on The AI Alignment Forum. The coalitional frame Earlier in this sequence I laid out an argument that the goals of increasingly intelligent AIs will become increasingly systematized, until they converge to squiggle-maximization. In my last post, though, I touched on two reasons why this convergence might not happen: humans trying to prevent it, and AIs themselves trying to prevent it. I don't have too much more to say about the former, but it's worth elaborating on the latter. The best way to understand the deliberate protection of existing goals is in terms of Bostrom's notion of instrumental convergence. Bostrom argues that goal preservation will be a convergent instrumental strategy for a wide range of agents. Perhaps it's occasionally instrumentally useful to change your goals - but once you've done so, you'll never want to course-correct back towards your old goals. So this is a strong reason to be conservative about your goals, and avoid changes where possible. One immediate problem with preserving goals, though: it requires that agents continue thinking in terms of the same concepts. But in general, an agent's concepts will change significantly as they learn more about the world. For example, consider a medieval theist whose highest-priority goal is ensuring that their soul goes to heaven not hell. Upon becoming smarter, they realize that none of souls, heaven, or hell exist. The sensible thing to do here would be to either discard the goal, or else identify a more reasonable adaptation of it (e.g. the goal of avoiding torture while alive). But if their goals were totally fixed, then their actions would be determined by a series of increasingly convoluted hypotheticals where god did exist after all. (Or to put it another way: continuing to represent their old goal would require recreating a lot of their old ontology.) This would incur a strong systematicity penalty. So while we should expect agents to have some degree of conservatism, they'll likely also have some degree of systematization. How can we reason about the tradeoff between conservatism and systematization? The approach which seems most natural to me makes three assumptions: 1. We can treat agents' goals as subagents optimizing for their own interests in a situationally-aware way. (E.g. goals have a sense of self-preservation.) 2. Agents have a meta-level goal of systematizing their goals; bargaining between this goal and object-level goals shapes the evolution of the object-level goals. 3. External incentives are not a dominant factor governing the agent's behavior, but do affect the bargaining positions of different goals, and how easily they're able to make binding agreements. I call the combination of these three assumptions the coalitional frame. The coalitional frame gives a picture of agents whose goals do evolve over time, but in a way which is highly sensitive to initial conditions - unlike squiggle-maximizers, who always converge to similar (from our perspective) goals. For coalitional agents, even "dumb" subagents might maintain significant influence as other subagents become highly intelligent, because they were able to lock in that influence earlier (just as my childhood goals exert a nontrivial influence over my current behavior). The assumptions I've laid out above are by no means obvious. I won't defend them fully here, since the coalitional frame is still fairly nascent in my mind, but I'll quickly go over some of the most obvious objections to each of them: Premise 1 assumes that subagents will have situational awareness. The idea of AIs themselves having situational awareness is under debate, so it's even more speculative to think about subagents having situational awareness. But it's hard to describe the dynamics of in...

2442 episodes

#Podcasting Education #The Nonlinear Fund