Looks like the publisher may have taken this series offline or changed its URL. Please contact support if you believe it should be working, the feed URL is invalid, or you have any other concerns about it.
Go offline with the Player FM app!
Podcasts Worth a Listen
SPONSORED


AF - Games for AI Control by Charlie Griffin
Archived series ("Inactive feed" status)
When? This feed was archived on November 22, 2024 20:05 (
Why? Inactive feed status. Our servers were unable to retrieve a valid podcast feed for a sustained period.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429404514 series 3314709
This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by Redwood Research [1, 2].
We'll be extending this work and submitting it to a conference, so we welcome suggestions! Please reach out if you're excited about developing these ideas. This post is written from Charlie's perspective.
AI Control Refresher
The AI Control methodology aims to design and assess protocols[1] that AI labs could use to deploy powerful but untrusted advanced AI systems. For a full summary, see Redwood's blog post, but I'll try to recap here.
A primary goal of AI Control is to design protocols that are safe in that they can reduce the probability of some unacceptable outcome, conditional on the untrusted AI system being a schemer. [2] However, protocols must also be useful, harnessing a significant amount of the benefit of the untrusted system in cases where it is not scheming.
After defining a particular use case and unacceptable outcome, evaluating a protocol involves iteratively engineering the best attack strategies for the red team (a human/AI hybrid acting as a proxy for the untrusted AI) and the blue team (which develops the protocol). This red-teaming exercise is the bulk of the work.
It involves coming up with protocols and gathering empirical data about key questions, such as how well GPT-3.5 can spot backdoors inserted by GPT-4 (see Figure 4 of the original paper):
This work
By introducing "AI-Control Games", an extension of Markov Decision Processes and normal-form games[3], we model the interaction between a protocol and an untrusted AI system to explore safety and usefulness across a range of protocols and games. This is similar to a discrete-time and stateful modification of Redwood's Toy models of AI Control.
The paper contains early work and should be seen as introducing useful tools for thinking about AI Control. I won't explain the model here (which is a bit fiddly and notation-heavy). Instead, I'll share some of our early findings using the model and explain the benefits and limitations of this approach.
Findings
Sometimes (but rarely), the blue team should use a randomised protocol. In many games, randomness is helpful. However, partly because the environment is stochastic, and Red has no memory and time-awareness, in AI-Control Games is usually unnecessary to include randomness in the protocol itself. We found that randomness was Pareto-optimal, in particular when the trusted model is not great at spotting backdoors written by the untrusted model. (See section 2 of the paper).
We can make small improvements to Trusted Monitoring by optimising the parameters. We found small Pareto improvements to existing Trusted Monitoring setups by, for example, by decreasing the threshold for sending a solution to be checked be a human auditor. We found that adapting the threshold for auditing over time led to further small improvements.
We can (usually) use a linear tradeoff to simplify the problem of synthesising protocols. In many cases, including whenever we allow Blue's protocol to be randomised, the Pareto frontier is "concave". This means that choosing different weights to trade off safety and usefulness will allow you to fill out the frontier by optimising a single objective.
Blue leads, and then Red follows. In general, we suppose the protocol designer (Blue) chooses it's protocol, and then the red team (Red) chooses a best-attack. Therefore, in our model, Blue chooses and declares its strategy before Red chooses its response. This assumption relates to Stackelberg Games, which are used to model security in game theory.[6]
Although (technically) multi-objective and general-sum, AI-Cont...
2437 episodes
Archived series ("Inactive feed" status)
When?
This feed was archived on November 22, 2024 20:05 (
Why? Inactive feed status. Our servers were unable to retrieve a valid podcast feed for a sustained period.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 429404514 series 3314709
This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by Redwood Research [1, 2].
We'll be extending this work and submitting it to a conference, so we welcome suggestions! Please reach out if you're excited about developing these ideas. This post is written from Charlie's perspective.
AI Control Refresher
The AI Control methodology aims to design and assess protocols[1] that AI labs could use to deploy powerful but untrusted advanced AI systems. For a full summary, see Redwood's blog post, but I'll try to recap here.
A primary goal of AI Control is to design protocols that are safe in that they can reduce the probability of some unacceptable outcome, conditional on the untrusted AI system being a schemer. [2] However, protocols must also be useful, harnessing a significant amount of the benefit of the untrusted system in cases where it is not scheming.
After defining a particular use case and unacceptable outcome, evaluating a protocol involves iteratively engineering the best attack strategies for the red team (a human/AI hybrid acting as a proxy for the untrusted AI) and the blue team (which develops the protocol). This red-teaming exercise is the bulk of the work.
It involves coming up with protocols and gathering empirical data about key questions, such as how well GPT-3.5 can spot backdoors inserted by GPT-4 (see Figure 4 of the original paper):
This work
By introducing "AI-Control Games", an extension of Markov Decision Processes and normal-form games[3], we model the interaction between a protocol and an untrusted AI system to explore safety and usefulness across a range of protocols and games. This is similar to a discrete-time and stateful modification of Redwood's Toy models of AI Control.
The paper contains early work and should be seen as introducing useful tools for thinking about AI Control. I won't explain the model here (which is a bit fiddly and notation-heavy). Instead, I'll share some of our early findings using the model and explain the benefits and limitations of this approach.
Findings
Sometimes (but rarely), the blue team should use a randomised protocol. In many games, randomness is helpful. However, partly because the environment is stochastic, and Red has no memory and time-awareness, in AI-Control Games is usually unnecessary to include randomness in the protocol itself. We found that randomness was Pareto-optimal, in particular when the trusted model is not great at spotting backdoors written by the untrusted model. (See section 2 of the paper).
We can make small improvements to Trusted Monitoring by optimising the parameters. We found small Pareto improvements to existing Trusted Monitoring setups by, for example, by decreasing the threshold for sending a solution to be checked be a human auditor. We found that adapting the threshold for auditing over time led to further small improvements.
We can (usually) use a linear tradeoff to simplify the problem of synthesising protocols. In many cases, including whenever we allow Blue's protocol to be randomised, the Pareto frontier is "concave". This means that choosing different weights to trade off safety and usefulness will allow you to fill out the frontier by optimising a single objective.
Blue leads, and then Red follows. In general, we suppose the protocol designer (Blue) chooses it's protocol, and then the red team (Red) chooses a best-attack. Therefore, in our model, Blue chooses and declares its strategy before Red chooses its response. This assumption relates to Stackelberg Games, which are used to model security in game theory.[6]
Although (technically) multi-objective and general-sum, AI-Cont...
2437 episodes
Усі епізоди
×
1 No new episodes will be published here. To keep listening to the EAF & LW, listen to this episode for instructions. 0:33

1 LW - Augmenting Statistical Models with Natural Language Parameters by jsteinhardt 16:41

1 LW - Glitch Token Catalog - (Almost) a Full Clear by Lao Mein 2:50:10

1 LW - Investigating an insurance-for-AI startup by L Rudolf L 26:00

1 LW - Applications of Chaos: Saying No (with Hastings Greer) by Elizabeth 3:39

1 LW - Work with me on agent foundations: independent fellowship by Alex Altair 6:20

1 LW - o1-preview is pretty good at doing ML on an unknown dataset by Håvard Tveit Ihle 3:14

1 EA - The Best Argument is not a Simple English Yud Essay by Jonathan Bostock 6:35

1 LW - Interested in Cognitive Bootcamp? by Raemon 2:05

1 LW - Laziness death spirals by PatrickDFarley 13:04

1 LW - We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap by johnswentworth 7:41

1 LW - AI #82: The Governor Ponders by Zvi 43:47

1 LW - Which LessWrong/Alignment topics would you like to be tutored in? [Poll] by Ruby 2:03

1 EA - What Would You Ask The Archbishop of Canterbury? by JDBauman 0:43

1 LW - [Intuitive self-models] 1. Preliminaries by Steven Byrnes 39:21

1 EA - EA Organization Updates: September 2024 by Toby Tremlett 9:56

1 EA - Five Years of Animal Advocacy Careers: Our Journey to impact, Lessons Learned, and What's Next by lauren mee 28:13

1 AF - The Obliqueness Thesis by Jessica Taylor 30:04

1 LW - The case for a negative alignment tax by Cameron Berg 14:19

1 EA - Match funding opportunity to challenge the legality of Frankenchickens by Gavin Chappell-Bates 7:19

1 EA - Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety by titotal 22:44

1 LW - Monthly Roundup #22: September 2024 by Zvi 1:08:02

1 EA - AI Welfare Debate Week retrospective by Toby Tremlett 9:27

1 EA - Material Innovation Initiative (MII) shuts down by Nate Crosser 4:35

1 EA - Tithing: much more than you wanted to know by Vesa Hautala 34:33

1 LW - Generative ML in chemistry is bottlenecked by synthesis by Abhishaike Mahajan 24:59

1 EA - Sensitive assumptions in longtermist modeling by Owen Murphy 13:13

1 LW - Skills from a year of Purposeful Rationality Practice by Raemon 11:11

1 EA - The Subject in Subjective Time: A New Approach to Aggregating Wellbeing (paper draft) by Devin Kalish 1:13:23

1 LW - I finally got ChatGPT to sound like me by lsusr 10:31

1 EA - Evaluations from Manifund's EA Community Choice initiative by Arepo 14:35

1 EA - Insights from a community builder in India on Effective Altruism in a Third World country by Nayanika 5:18

1 EA - Utilitarianism.net Updates by Richard Y Chappell 7:19

1 LW - Book review: Xenosystems by jessicata 1:06:19

1 LW - MIRI's September 2024 newsletter by Harlan 2:29

1 LW - Secret Collusion: Will We Know When to Unplug AI? by schroederdewitt 57:38


1 AF - Secret Collusion: Will We Know When to Unplug AI? by schroederdewitt 57:38

1 LW - How you can help pass important AI legislation with 10 minutes of effort by ThomasW 4:10

1 LW - My disagreements with "AGI ruin: A List of Lethalities" by Noosphere89 31:11

1 LW - Why I funded PIBBSS by Ryan Kidd 5:53

1 LW - Proveably Safe Self Driving Cars by Davidmanheim 11:40

1 LW - Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more by Michael Cohn 17:45

1 LW - Did Christopher Hitchens change his mind about waterboarding? by Isaac King 10:37

1 EA - Bringing the International Space Station down safely: A Billion dollar waste? by NickLaing 6:20
Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.