AF - AXRP Episode 33 - RLHF Problems with Scott Emmons by DanielFilan

The Nonlinear Library: Alignment Forum

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

2M ago 1:21:54

MP3•Episode home

Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AXRP Episode 33 - RLHF Problems with Scott Emmons, published by DanielFilan on June 12, 2024 on The AI Alignment Forum. YouTube link Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting. Topics we discuss: Deceptive inflation Overjustification Bounded human rationality Avoiding these problems Dimensional analysis RLHF problems, in theory and practice Scott's research program Following Scott's research Daniel Filan: Hello, everybody. In this episode I'll be speaking with Scott Emmons. Scott is a PhD student at UC Berkeley, working with the Center for Human-Compatible AI on AI safety research. He's previously co-founded far.ai, which is an AI safety non-profit. For links to what we're discussing, you can check the description of the episode, and for a transcript you can read it at axrp.net. Well, welcome to AXRP. Scott Emmons: Great to be here. Deceptive inflation Daniel Filan: Sure. So today we're talking about your paper, When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning, by Leon Lang, Davis Foote, Stuart Russell, Erik Jenner, and yourself. Can you just tell us roughly what's going on with this paper? Scott Emmons: Yeah, I could start with the motivation of the paper. Daniel Filan: Yeah, sure. Scott Emmons: We've had a lot of speculation in the x-risk community about issues like deception. So people have been worried about what happens if your AIs try to deceive you. And at the same time, I think for a while that's been a theoretical, a philosophical concern. And I use "speculation" here in a positive way. I think people have done really awesome speculation about how the future of AI is going to play out, and what those risks are going to be. And deception has emerged as one of the key things that people are worried about. I think at the same time, we're seeing AI systems actually deployed, and we're seeing a growing interest of people in what exactly do these risks look like, and how do they play out in current-day systems? So the goal of this paper is to say: how might deception play out with actual systems that we have deployed today? And reinforcement learning from human feedback [RLHF] is one of the main mechanisms that's currently being used to fine-tune models, that's used by ChatGPT, it's used by Llama, variants of it are used by Anthropic. So what this paper is trying to do is it's trying to say, "Can we mathematically pin down, in a precise way, how might these failure modes we've been speculating about play out in RLHF?" Daniel Filan: So in the paper, the two concepts you talk about on this front are I think "deceptive inflation" and "overjustification". So maybe let's start with deceptive inflation. What is deceptive inflation? Scott Emmons: I can give you an example. I think examples from me as a child I find really helpful in terms of thinking about this. So when I was a child, my parents asked me to clean the house, and I didn't care about cleaning the house. I just wanted to go play. So there's a misalignment between my objective and the objective my parents had for me. And in this paper, the main failure cases that we're studying are cases of misalignment. So we're saying: when there is misalignment, how does that play out? How does that play out in the failure modes? So [with] me as a misaligned child, one strategy I would have for cleaning the house would be just to sweep any dirt or any debris under the furniture. So I'm cleaning the house, I just sweep some debris...

393 episodes

#The Nonlinear Fund #Podcasting Education #Of TexttoSpeech