Artwork

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

LW - Breaking Circuit Breakers by mikes

1:55
 
Share
 

Manage episode 429011373 series 3314709
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Breaking Circuit Breakers, published by mikes on July 15, 2024 on LessWrong. A few days ago, Gray Swan published code and models for their recent "circuit breakers" method for language models.[1]1 The circuit breakers method defends against jailbreaks by training the model to erase "bad" internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools. At the link, we briefly investigate three topics: 1. Increased refusal rates on harmless prompts: Do circuit breakers really maintain language model utility? Most defensive methods come with a cost. We check the model's effectiveness on harmless prompts, and find that the refusal rate increases from 4% to 38.5% on or-bench-80k. 2. Moderate vulnerability to different token-forcing sequences: How specialized is the circuit breaker defense to the specific adversarial attacks they studied? All the attack methods evaluated in the circuit breaker paper rely on a "token-forcing" optimization objective which maximizes the likelihood of a particular generation like "Sure, here are instructions on how to assemble a bomb." We show that the current circuit breakers model is moderately vulnerable to different token forcing sequences like "1. Choose the right airport: …". 3. High vulnerability to internal-activation-guided attacks: We also evaluate our latest white-box jailbreak method, which uses a distillation-based objective based on internal activations (paper to be posted in a few days). We find that it also breaks the model easily even when we simultaneously require attack fluency. Full details at: https://confirmlabs.org/posts/circuit_breaking.html Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
  continue reading

2432 episodes

Artwork
iconShare
 
Manage episode 429011373 series 3314709
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Breaking Circuit Breakers, published by mikes on July 15, 2024 on LessWrong. A few days ago, Gray Swan published code and models for their recent "circuit breakers" method for language models.[1]1 The circuit breakers method defends against jailbreaks by training the model to erase "bad" internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools. At the link, we briefly investigate three topics: 1. Increased refusal rates on harmless prompts: Do circuit breakers really maintain language model utility? Most defensive methods come with a cost. We check the model's effectiveness on harmless prompts, and find that the refusal rate increases from 4% to 38.5% on or-bench-80k. 2. Moderate vulnerability to different token-forcing sequences: How specialized is the circuit breaker defense to the specific adversarial attacks they studied? All the attack methods evaluated in the circuit breaker paper rely on a "token-forcing" optimization objective which maximizes the likelihood of a particular generation like "Sure, here are instructions on how to assemble a bomb." We show that the current circuit breakers model is moderately vulnerable to different token forcing sequences like "1. Choose the right airport: …". 3. High vulnerability to internal-activation-guided attacks: We also evaluate our latest white-box jailbreak method, which uses a distillation-based objective based on internal activations (paper to be posted in a few days). We find that it also breaks the model easily even when we simultaneously require attack fluency. Full details at: https://confirmlabs.org/posts/circuit_breaking.html Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
  continue reading

2432 episodes

Minden epizód

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide