Artwork

Content provided by Michaël Trazzi. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Michaël Trazzi or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Erik Jones on Automatically Auditing Large Language Models

22:36
 
Share
 

Manage episode 374008737 series 2966339
Content provided by Michaël Trazzi. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Michaël Trazzi or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.

Youtube: https://youtu.be/bhE5Zs3Y1n8

Paper: https://arxiv.org/abs/2303.04381

Erik: https://twitter.com/ErikJones313

Host: https://twitter.com/MichaelTrazzi

Patreon: https://www.patreon.com/theinsideview

Outline

00:00 Highlights

00:31 Eric's background and research in Berkeley

01:19 Motivation for doing safety research on language models

02:56 Is it too easy to fool today's language models?

03:31 The goal of adversarial attacks on language models

04:57 Automatically Auditing Large Language Models via Discrete Optimization

06:01 Optimizing over a finite set of tokens rather than continuous embeddings

06:44 Goal is revealing behaviors, not necessarily breaking the AI

07:51 On the feasibility of solving adversarial attacks

09:18 Suppressing dangerous knowledge vs just bypassing safety filters

10:35 Can you really ask a language model to cook meth?

11:48 Optimizing French to English translation example

13:07 Forcing toxic celebrity outputs just to test rare behaviors

13:19 Testing the method on GPT-2 and GPT-J

14:03 Adversarial prompts transferred to GPT-3 as well

14:39 How this auditing research fits into the broader AI safety field

15:49 Need for automated tools to audit failures beyond what humans can find

17:47 Auditing to avoid unsafe deployments, not for existential risk reduction

18:41 Adaptive auditing that updates based on the model's outputs

19:54 Prospects for using these methods to detect model deception

22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts

Patreon supporters:

  • Tassilo Neubauer
  • MonikerEpsilon
  • Alexey Malafeev
  • Jack Seroy
  • JJ Hepburn
  • Max Chiswick
  • William Freire
  • Edward Huff
  • Gunnar Höglund
  • Ryan Coppolo
  • Cameron Holmes
  • Emil Wallner
  • Jesse Hoogland
  • Jacques Thibodeau
  • Vincent Weisser
  continue reading

54 episodes

Artwork
iconShare
 
Manage episode 374008737 series 2966339
Content provided by Michaël Trazzi. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Michaël Trazzi or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.

Youtube: https://youtu.be/bhE5Zs3Y1n8

Paper: https://arxiv.org/abs/2303.04381

Erik: https://twitter.com/ErikJones313

Host: https://twitter.com/MichaelTrazzi

Patreon: https://www.patreon.com/theinsideview

Outline

00:00 Highlights

00:31 Eric's background and research in Berkeley

01:19 Motivation for doing safety research on language models

02:56 Is it too easy to fool today's language models?

03:31 The goal of adversarial attacks on language models

04:57 Automatically Auditing Large Language Models via Discrete Optimization

06:01 Optimizing over a finite set of tokens rather than continuous embeddings

06:44 Goal is revealing behaviors, not necessarily breaking the AI

07:51 On the feasibility of solving adversarial attacks

09:18 Suppressing dangerous knowledge vs just bypassing safety filters

10:35 Can you really ask a language model to cook meth?

11:48 Optimizing French to English translation example

13:07 Forcing toxic celebrity outputs just to test rare behaviors

13:19 Testing the method on GPT-2 and GPT-J

14:03 Adversarial prompts transferred to GPT-3 as well

14:39 How this auditing research fits into the broader AI safety field

15:49 Need for automated tools to audit failures beyond what humans can find

17:47 Auditing to avoid unsafe deployments, not for existential risk reduction

18:41 Adaptive auditing that updates based on the model's outputs

19:54 Prospects for using these methods to detect model deception

22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts

Patreon supporters:

  • Tassilo Neubauer
  • MonikerEpsilon
  • Alexey Malafeev
  • Jack Seroy
  • JJ Hepburn
  • Max Chiswick
  • William Freire
  • Edward Huff
  • Gunnar Höglund
  • Ryan Coppolo
  • Cameron Holmes
  • Emil Wallner
  • Jesse Hoogland
  • Jacques Thibodeau
  • Vincent Weisser
  continue reading

54 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide