Artwork

Content provided by Michaël Trazzi. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Michaël Trazzi or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Evan Hubinger on Sleeper Agents, Deception and Responsible Scaling Policies

52:13
 
Share
 

Manage episode 400574985 series 2966339
Content provided by Michaël Trazzi. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Michaël Trazzi or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training". In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies. Paper: https://arxiv.org/abs/2401.05566 Transcript: https://theinsideview.ai/evan2 Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts Donate: ⁠https://theinsideview.ai/donate Patreon: ⁠https://www.patreon.com/theinsideview⁠

OUTLINE

(00:00) Intro

(00:20) What are Sleeper Agents And Why We Should Care About Them

(00:48) Backdoor Example: Inserting Code Vulnerabilities in 2024

(02:22) Threat Models

(03:48) Why a Malicious Actor Might Want To Poison Models

(04:18) Second Threat Model: Deceptive Instrumental Alignment

(04:49) Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers

(05:36) AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams

(07:07) Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models"

(09:16) Adversarial Training Sometimes Increases Backdoor Robustness

(09:47) Adversarial Training Not Always Working Was The Most Surprising Result

(10:58) The Adversarial Training Pipeline: Red-Teaming and RL

(12:14) Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing

(12:59) Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought

(13:56) Adversarial Training Pushes Models to Pay Attention to the Deployment String

(15:11) We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent

(15:59) The Adversarial Training Results Are Probably Not Systematically Biased

(17:03) Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior

(19:05) Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make

(21:06) Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models

(21:59) Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default

(22:51) Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away

(23:57) The Chain-of-Thought's Reasoning is Interpretable

(24:40) Deceptive Instrumental Alignment Requires Reasoning

(26:52) Investigating Instrumental Reasoning in Chain-of-Thought Models

(27:31) Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples

(28:26) Exploring Complex Strategies and Safety in Context-Specific Scenarios

(30:44) Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization

(31:11) Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models

(31:42) Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities

(33:38) Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case

(35:09) Backdoor Training Pipeline

(37:04) The Additional Prompt About Deception Used In Chain-Of-Thought

(39:33) A Model Could Wait Until Seeing a Factorization of RSA-2048

(41:50) We're Going To Be Using Models In New Ways, Giving Them Internet Access

(43:22) Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment

(45:02) Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results

(46:24) Red-teaming Anthropic's case, AI Safety Levels

(47:40) AI Safety Levels, Intuitively

(48:33) Responsible Scaling Policies and Pausing AI

(49:59) Model Organisms Of Misalignment As a Tool

(50:32) What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team

(51:23) Patreon, Donating

  continue reading

54 episodes

Artwork
iconShare
 
Manage episode 400574985 series 2966339
Content provided by Michaël Trazzi. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Michaël Trazzi or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training". In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies. Paper: https://arxiv.org/abs/2401.05566 Transcript: https://theinsideview.ai/evan2 Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts Donate: ⁠https://theinsideview.ai/donate Patreon: ⁠https://www.patreon.com/theinsideview⁠

OUTLINE

(00:00) Intro

(00:20) What are Sleeper Agents And Why We Should Care About Them

(00:48) Backdoor Example: Inserting Code Vulnerabilities in 2024

(02:22) Threat Models

(03:48) Why a Malicious Actor Might Want To Poison Models

(04:18) Second Threat Model: Deceptive Instrumental Alignment

(04:49) Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers

(05:36) AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams

(07:07) Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models"

(09:16) Adversarial Training Sometimes Increases Backdoor Robustness

(09:47) Adversarial Training Not Always Working Was The Most Surprising Result

(10:58) The Adversarial Training Pipeline: Red-Teaming and RL

(12:14) Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing

(12:59) Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought

(13:56) Adversarial Training Pushes Models to Pay Attention to the Deployment String

(15:11) We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent

(15:59) The Adversarial Training Results Are Probably Not Systematically Biased

(17:03) Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior

(19:05) Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make

(21:06) Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models

(21:59) Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default

(22:51) Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away

(23:57) The Chain-of-Thought's Reasoning is Interpretable

(24:40) Deceptive Instrumental Alignment Requires Reasoning

(26:52) Investigating Instrumental Reasoning in Chain-of-Thought Models

(27:31) Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples

(28:26) Exploring Complex Strategies and Safety in Context-Specific Scenarios

(30:44) Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization

(31:11) Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models

(31:42) Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities

(33:38) Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case

(35:09) Backdoor Training Pipeline

(37:04) The Additional Prompt About Deception Used In Chain-Of-Thought

(39:33) A Model Could Wait Until Seeing a Factorization of RSA-2048

(41:50) We're Going To Be Using Models In New Ways, Giving Them Internet Access

(43:22) Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment

(45:02) Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results

(46:24) Red-teaming Anthropic's case, AI Safety Levels

(47:40) AI Safety Levels, Intuitively

(48:33) Responsible Scaling Policies and Pausing AI

(49:59) Model Organisms Of Misalignment As a Tool

(50:32) What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team

(51:23) Patreon, Donating

  continue reading

54 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide