LW - Breaking Circuit Breakers By Mikes The Nonlinear Library podcast

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

The Nonlinear Library »
LW - Breaking Circuit Breakers by mikes

2h ago 1:55

MP3•Episode home

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Breaking Circuit Breakers, published by mikes on July 15, 2024 on LessWrong. A few days ago, Gray Swan published code and models for their recent "circuit breakers" method for language models.[1]1 The circuit breakers method defends against jailbreaks by training the model to erase "bad" internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools. At the link, we briefly investigate three topics: 1. Increased refusal rates on harmless prompts: Do circuit breakers really maintain language model utility? Most defensive methods come with a cost. We check the model's effectiveness on harmless prompts, and find that the refusal rate increases from 4% to 38.5% on or-bench-80k. 2. Moderate vulnerability to different token-forcing sequences: How specialized is the circuit breaker defense to the specific adversarial attacks they studied? All the attack methods evaluated in the circuit breaker paper rely on a "token-forcing" optimization objective which maximizes the likelihood of a particular generation like "Sure, here are instructions on how to assemble a bomb." We show that the current circuit breakers model is moderately vulnerable to different token forcing sequences like "1. Choose the right airport: …". 3. High vulnerability to internal-activation-guided attacks: We also evaluate our latest white-box jailbreak method, which uses a distillation-based objective based on internal activations (paper to be posted in a few days). We find that it also breaks the model easily even when we simultaneously require attack fluency. Full details at: https://confirmlabs.org/posts/circuit_breaking.html Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

2432 episodes

#Podcasting Education #The Nonlinear Fund