LW - Solving adversarial attacks in computer vision as a baby version of general AI alignment by stanislavfort

The Nonlinear Library

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

2M ago 12:33

MP3•Episode home

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 26, 2024 16:04 (1M ago)

What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving adversarial attacks in computer vision as a baby version of general AI alignment, published by stanislavfort on August 30, 2024 on LessWrong.
I spent the last few months trying to tackle the problem of adversarial attacks in computer vision from the ground up. The results of this effort are written up in our new paper Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness (explainer on X/Twitter).
Taking inspiration from biology, we reached state-of-the-art or above state-of-the-art robustness at 100x - 1000x less compute, got human-understandable interpretability for free, turned classifiers into generators, and designed transferable adversarial attacks on closed-source (v)LLMs such as GPT-4 or Claude 3.
I strongly believe that there is a compelling case for devoting serious attention to solving the problem of adversarial robustness in computer vision, and I try to draw an analogy to the alignment of general AI systems here.
1. Introduction
In this post, I argue that the problem of adversarial attacks in computer vision is in many ways analogous to the larger task of general AI alignment. In both cases, we are trying to faithfully convey an implicit function locked within the human brain to a machine, and we do so extremely successfully on average. Under static evaluations, the human and machine functions match up exceptionally well.
However, as is typical in high-dimensional spaces, some phenomena can be relatively rare and basically impossible to find by chance, yet ubiquitous in their absolute count. This is the case for adversarial attacks - imperceptible modifications to images that completely fool computer vision systems and yet have virtually no effect on humans.
Their existence highlights a crucial and catastrophic mismatch between the implicit human vision function and the function learned by machines - a mismatch that can be exploited in a dynamic evaluation by an active, malicious agent. Such failure modes will likely be present in more general AI systems, and our inability to remedy them even in the more restricted vision context (yet) does not bode well for the broader alignment project.
This is a call to action to solve the problem of adversarial vision attacks - a stepping stone on the path to aligning general AI systems.
2. Communicating implicit human functions to machines
The basic goal of computer vision can be viewed as trying to endow a machine with the same vision capabilities a human has. A human carries, locked inside their skull, an implicit vision function mapping visual inputs into semantically meaningful symbols, e.g. a picture of a tortoise into a semantic label tortoise.
This function is represented implicitly and while we are extremely good at using it, we do not have direct, conscious access to its inner workings and therefore cannot communicate it to others easily.
To convey this function to a machine, we usually form a dataset of fixed images and their associated labels. We then use a general enough class of functions, typically deep neural networks, and a gradient-based learning algorithm together with backpropagation to teach the machine how to correlate images with their semantic content, e.g. how to assign a label parrot to a picture of a parrot.
This process is extremely successful in communicating the implicit human vision function to the computer, and the implicit human and explicit, learned machine functions agree to a large extent.
The agreement between the two is striking.
Given how different the architectures are (a simulated graph-like function doing a single forward pass vs the wet protein brain of a mammal running continuous inference), how different the learning algorithms are (gradient descent with backpropagation vs something completely different but still unknown), and how differ...

2447 episodes

#Podcasting Education #The Nonlinear Fund