Sleeper Agents | Evan Hubinger | EA Global Bay Area: 2024
Manage episode 404913498 series 3503936
If an AI system learned a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? That's the question that Evan and his coauthors at Anthropic sought to answer in their work on ""Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training"", which Evan will be discussing.
Evan Hubinger leads the new Alignment Stress-Testing team at Anthropic, which is tasked with red-teaming Anthropic's internal alignment techniques and evaluations. Prior to joining Anthropic, Evan was a Research Fellow at the Machine Intelligence Research Institute and worked on a variety of theoretical alignment work, including ""Risks from Learned Optimization in Advanced Machine Learning Systems"". Evan will be talking about the Anthropic Alignment Stress-Testing team's first paper, ""Sleeper Agents: Building Deceptive LLMs that Persist Through Safety Training"".
Watch on Youtube: https://www.youtube.com/watch?v=BgfT0AcosHw
182 episodes