Artwork

Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

“Sycophancy to subterfuge: Investigating reward tampering in large language models” by evhub, Carson Denison

15:37
 
Share
 

Manage episode 424570959 series 3364760
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.New Anthropic model organisms research paper led by Carson Denison from the Alignment Stress-Testing Team demonstrating that large language models can generalize zero-shot from simple reward-hacks (sycophancy) to more complex reward tampering (subterfuge). Our results suggest that accidentally incentivizing simple reward-hacks such as sycophancy can have dramatic and very difficult to reverse consequences for how models generalize, up to and including generalization to editing their own reward functions and covering up their tracks when doing so.
Abstract:
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too [...]
---
First published:
June 17th, 2024
Source:
https://www.lesswrong.com/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in
---
Narrated by TYPE III AUDIO.
  continue reading

300 episodes

Artwork
iconShare
 
Manage episode 424570959 series 3364760
Content provided by LessWrong. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by LessWrong or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.This is a link post.New Anthropic model organisms research paper led by Carson Denison from the Alignment Stress-Testing Team demonstrating that large language models can generalize zero-shot from simple reward-hacks (sycophancy) to more complex reward tampering (subterfuge). Our results suggest that accidentally incentivizing simple reward-hacks such as sycophancy can have dramatic and very difficult to reverse consequences for how models generalize, up to and including generalization to editing their own reward functions and covering up their tracks when doing so.
Abstract:
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too [...]
---
First published:
June 17th, 2024
Source:
https://www.lesswrong.com/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in
---
Narrated by TYPE III AUDIO.
  continue reading

300 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide