Artwork

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

AF - Instruction-following AGI is easier and more likely than value aligned AGI by Seth Herd

22:34
 
Share
 

Manage episode 418473936 series 3337166
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Instruction-following AGI is easier and more likely than value aligned AGI, published by Seth Herd on May 15, 2024 on The AI Alignment Forum. Summary: We think a lot about aligning AGI with human values. I think it's more likely that we'll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks. This is similar but distinct from the goal targets of prosaic alignment efforts. Instruction-following is a single goal target that is more likely to be reflexively stable in a full AGI with explicit goals and self-directed learning. It is counterintuitive and concerning to imagine superintelligent AGI that "wants" only to follow the instructions of a human; but on analysis, this approach seems both more appealing and more workable than the alternative of creating sovereign AGI with human values. Instruction-following AGI could actually work, particularly in the short term. And it seems likely to be tried, even if it won't work. So it probably deserves more thought. Overview/Intuition How to use instruction-following AGI as a collaborator in alignment Instruct the AGI to tell you the truth Investigate its understanding of itself and "the truth"; use interpretability methods Instruct it to check before doing anything consequential Instruct it to us a variety of internal reviews to predict consequences Ask it a bunch of questions about how it would interpret various commands Repeat all of the above as it gets smarter frequently ask it for advice and about how its alignment could go wrong Now, this won't work if the AGI won't even try to fulfill your wishes. In that case you totally screwed up your technical alignment approach. But if it will even sort of do what you want, and it at least sort of understands what you mean by "tell the truth", you're in business. You can leverage partial alignment into full alignment - if you're careful enough, and the AGI gets smarter slowly enough. It's looking like the critical risk period is probably going to involve AGI on a relatively slow takeoff toward superintelligence. Being able to ask questions and give instructions, and even retrain or re-engineer the system, is much more useful if you're guiding the AGI's creation and development, not just "making wishes" as we've thought about AGI goals in fast takeoff scenarios. Instruction-following is safer than value alignment in a slow takeoff Instruction-following with verification or DWIMAC seems both intuitively and analytically appealing compared to more commonly discussed[1] alignment targets.[2] This is my pitch for why it should be discussed more. It doesn't require solving ethics to safely launch AGI, and it includes most of the advantages of corrigibility,[3] including stopping on command. Thus, it substantially mitigates (although doesn't outright solve) some central difficulties of alignment: goal misspecification (including not knowing what values to give it as goals) and alignment stability over reflection and continuous learning. This approach it makes one major difficulty worse: humans remaining in control, including power struggles and other foolishness. I think the most likely scenario is that we succeed at technical alignment but fail at societal alignment. But I think there is a path to a vibrant future if we limit AGI proliferation, ...
  continue reading

396 episodes

Artwork
iconShare
 
Manage episode 418473936 series 3337166
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Instruction-following AGI is easier and more likely than value aligned AGI, published by Seth Herd on May 15, 2024 on The AI Alignment Forum. Summary: We think a lot about aligning AGI with human values. I think it's more likely that we'll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks. This is similar but distinct from the goal targets of prosaic alignment efforts. Instruction-following is a single goal target that is more likely to be reflexively stable in a full AGI with explicit goals and self-directed learning. It is counterintuitive and concerning to imagine superintelligent AGI that "wants" only to follow the instructions of a human; but on analysis, this approach seems both more appealing and more workable than the alternative of creating sovereign AGI with human values. Instruction-following AGI could actually work, particularly in the short term. And it seems likely to be tried, even if it won't work. So it probably deserves more thought. Overview/Intuition How to use instruction-following AGI as a collaborator in alignment Instruct the AGI to tell you the truth Investigate its understanding of itself and "the truth"; use interpretability methods Instruct it to check before doing anything consequential Instruct it to us a variety of internal reviews to predict consequences Ask it a bunch of questions about how it would interpret various commands Repeat all of the above as it gets smarter frequently ask it for advice and about how its alignment could go wrong Now, this won't work if the AGI won't even try to fulfill your wishes. In that case you totally screwed up your technical alignment approach. But if it will even sort of do what you want, and it at least sort of understands what you mean by "tell the truth", you're in business. You can leverage partial alignment into full alignment - if you're careful enough, and the AGI gets smarter slowly enough. It's looking like the critical risk period is probably going to involve AGI on a relatively slow takeoff toward superintelligence. Being able to ask questions and give instructions, and even retrain or re-engineer the system, is much more useful if you're guiding the AGI's creation and development, not just "making wishes" as we've thought about AGI goals in fast takeoff scenarios. Instruction-following is safer than value alignment in a slow takeoff Instruction-following with verification or DWIMAC seems both intuitively and analytically appealing compared to more commonly discussed[1] alignment targets.[2] This is my pitch for why it should be discussed more. It doesn't require solving ethics to safely launch AGI, and it includes most of the advantages of corrigibility,[3] including stopping on command. Thus, it substantially mitigates (although doesn't outright solve) some central difficulties of alignment: goal misspecification (including not knowing what values to give it as goals) and alignment stability over reflection and continuous learning. This approach it makes one major difficulty worse: humans remaining in control, including power struggles and other foolishness. I think the most likely scenario is that we succeed at technical alignment but fail at societal alignment. But I think there is a path to a vibrant future if we limit AGI proliferation, ...
  continue reading

396 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide