Artwork

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

LW - Free Will and Dodging Anvils: AIXI Off-Policy by Cole Wyeth

16:01
 
Share
 

Manage episode 437574709 series 3314709
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Free Will and Dodging Anvils: AIXI Off-Policy, published by Cole Wyeth on September 1, 2024 on LessWrong. This post depends on a basic understanding of history-based reinforcement learning and the AIXI model. I am grateful to Marcus Hutter and the lesswrong team for early feedback, though any remaining errors are mine. The universal agent AIXI treats the environment it interacts with like a video game it is playing; the actions it chooses at each step are like hitting buttons and the percepts it receives are like images on the screen (observations) and an unambiguous point tally (rewards). It has been suggested that since AIXI is inherently dualistic and doesn't believe anything in the environment can "directly" hurt it, if it were embedded in the real world it would eventually drop an anvil on its head to see what would happen. This is certainly possible, because the math of AIXI cannot explicitly represent the idea that AIXI is running on a computer inside the environment it is interacting with. For one thing, that possibility is not in AIXI's hypothesis class (which I will write M). There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don't really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness). On top of that, "any" environment "containing" AIXI is at the wrong computability level for a member of M: our best upper bound on AIXI's computability level is Δ02 = limit-computable (for an ε-approximation) instead of the Σ01 level of its environment class. Reflective oracles can fix this but at the moment there does not seem to be a canonical reflective oracle, so there remains a family of equally valid reflective versions of AIXI without an objective favorite. However, in my conversations with Marcus Hutter (the inventor of AIXI) he has always insisted AIXI would not drop an anvil on its head, because Cartesian dualism is not a problem for humans in the real world, who historically believed in a metaphysical soul and mostly got along fine anyway. But when humans stick electrodes in our brains, we can observe changed behavior and deduce that our cognition is physical - would this kind of experiment allow AIXI to make the same discovery? Though we could not agree on this for some time, we eventually discovered the crux: we were actually using slightly different definitions for how AIXI should behave off-policy. In particular, let ξAI be the belief distribution of AIXI. More explicitly, I will not attempt a formal definition here. The only thing we need to know is that M is a set of environments which AIXI considers possible. AIXI interacts with an environment by sending it a sequence of actions a1,a2,... in exchange for a sequence of percepts containing an observation and reward e1=o1r1,e2=o2r2,... so that action at precedes percept et. One neat property of AIXI is that its choice of M satisfies ξAIM (this trick is inherited with minor changes from the construction of Solomonoff's universal distribution). Now let Vπμ be a (discounted) value function for policy π interacting with environment μ, which is the expected sum of discounted rewards obtained by π. We can define the AIXI agent as By the Bellman equations, this also specifies AIXI's behavior on any history it can produce (all finite percept strings have nonzero probability under ξAI). However, it does not tell us how AIXI behaves when the history includes actions it would not have chosen. In that case, the natural extension is so that AIXI continues to act optimally (with respect to its updated belief distribution) even when some suboptimal actions have previously been taken. The philosophy of this extension is that AIXI acts exactly as if...
  continue reading

2431 episodes

Artwork
iconShare
 
Manage episode 437574709 series 3314709
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Free Will and Dodging Anvils: AIXI Off-Policy, published by Cole Wyeth on September 1, 2024 on LessWrong. This post depends on a basic understanding of history-based reinforcement learning and the AIXI model. I am grateful to Marcus Hutter and the lesswrong team for early feedback, though any remaining errors are mine. The universal agent AIXI treats the environment it interacts with like a video game it is playing; the actions it chooses at each step are like hitting buttons and the percepts it receives are like images on the screen (observations) and an unambiguous point tally (rewards). It has been suggested that since AIXI is inherently dualistic and doesn't believe anything in the environment can "directly" hurt it, if it were embedded in the real world it would eventually drop an anvil on its head to see what would happen. This is certainly possible, because the math of AIXI cannot explicitly represent the idea that AIXI is running on a computer inside the environment it is interacting with. For one thing, that possibility is not in AIXI's hypothesis class (which I will write M). There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don't really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness). On top of that, "any" environment "containing" AIXI is at the wrong computability level for a member of M: our best upper bound on AIXI's computability level is Δ02 = limit-computable (for an ε-approximation) instead of the Σ01 level of its environment class. Reflective oracles can fix this but at the moment there does not seem to be a canonical reflective oracle, so there remains a family of equally valid reflective versions of AIXI without an objective favorite. However, in my conversations with Marcus Hutter (the inventor of AIXI) he has always insisted AIXI would not drop an anvil on its head, because Cartesian dualism is not a problem for humans in the real world, who historically believed in a metaphysical soul and mostly got along fine anyway. But when humans stick electrodes in our brains, we can observe changed behavior and deduce that our cognition is physical - would this kind of experiment allow AIXI to make the same discovery? Though we could not agree on this for some time, we eventually discovered the crux: we were actually using slightly different definitions for how AIXI should behave off-policy. In particular, let ξAI be the belief distribution of AIXI. More explicitly, I will not attempt a formal definition here. The only thing we need to know is that M is a set of environments which AIXI considers possible. AIXI interacts with an environment by sending it a sequence of actions a1,a2,... in exchange for a sequence of percepts containing an observation and reward e1=o1r1,e2=o2r2,... so that action at precedes percept et. One neat property of AIXI is that its choice of M satisfies ξAIM (this trick is inherited with minor changes from the construction of Solomonoff's universal distribution). Now let Vπμ be a (discounted) value function for policy π interacting with environment μ, which is the expected sum of discounted rewards obtained by π. We can define the AIXI agent as By the Bellman equations, this also specifies AIXI's behavior on any history it can produce (all finite percept strings have nonzero probability under ξAI). However, it does not tell us how AIXI behaves when the history includes actions it would not have chosen. In that case, the natural extension is so that AIXI continues to act optimally (with respect to its updated belief distribution) even when some suboptimal actions have previously been taken. The philosophy of this extension is that AIXI acts exactly as if...
  continue reading

2431 episodes

All episodes

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide