Artwork

Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

LW - A Simple Toy Coherence Theorem by johnswentworth

11:59
 
Share
 

Manage episode 432145370 series 2997284
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Simple Toy Coherence Theorem, published by johnswentworth on August 2, 2024 on LessWrong. This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments. Setting Deterministic MDP. That means at each time t there's a state S[t][1], the agent/policy takes an action A[t] (which can depend on both time t and current state S[t]), and then the next state S[t+1] is fully determined by S[t] and A[t]. The current state and current action are sufficient to tell us the next state. We will think about values over the state at some final time T. Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep. One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable. Theorem There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states. Proof As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state S every other state is reachable, yet the policy chooses S, so in order for the policy to be optimal S must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general. Bulk of Proof (click to expand) Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming: V[S,t]=max S' reachable in next timestep from S V[S',t+1] where V[S,T] are the values over final states. A policy is optimal for final values V[S,T] if-and-only-if at each timestep t1 it chooses a next state with highest reachable V[S,t]. Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A; from the other the policy chooses B. This is an inconsistent revealed preference between A and B at time t: sometimes the policy has a revealed preference for A over B, sometimes for B over A. In order for a policy with an inconsistent revealed preference between A and B at time t to be optimal, the values must satisfy V[A,t]=V[B,t] Why? Well, a policy is optimal for final values V[S,T] if-and-only if at each timestep t1 it chooses a next state with highest reachable V[S,t]. So, if an optimal policy sometimes chooses A over B at timestep t when both are reachable, then we must have V[A,t]V[B,t]. And if an optimal policy sometimes chooses B over A at timestep t when both are reachable, then we must have V[A,t]V[B,t]. If both of those occur, i.e. the policy has an inconsistent revealed preference between A and B at time t, then V[A,t]=V[B,t]. Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from A at time t must have the highest reachable value, and that value...
  continue reading

2439 episodes

Artwork
iconShare
 
Manage episode 432145370 series 2997284
Content provided by The Nonlinear Fund. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by The Nonlinear Fund or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Simple Toy Coherence Theorem, published by johnswentworth on August 2, 2024 on LessWrong. This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments. Setting Deterministic MDP. That means at each time t there's a state S[t][1], the agent/policy takes an action A[t] (which can depend on both time t and current state S[t]), and then the next state S[t+1] is fully determined by S[t] and A[t]. The current state and current action are sufficient to tell us the next state. We will think about values over the state at some final time T. Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep. One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable. Theorem There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states. Proof As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state S every other state is reachable, yet the policy chooses S, so in order for the policy to be optimal S must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general. Bulk of Proof (click to expand) Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming: V[S,t]=max S' reachable in next timestep from S V[S',t+1] where V[S,T] are the values over final states. A policy is optimal for final values V[S,T] if-and-only-if at each timestep t1 it chooses a next state with highest reachable V[S,t]. Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A; from the other the policy chooses B. This is an inconsistent revealed preference between A and B at time t: sometimes the policy has a revealed preference for A over B, sometimes for B over A. In order for a policy with an inconsistent revealed preference between A and B at time t to be optimal, the values must satisfy V[A,t]=V[B,t] Why? Well, a policy is optimal for final values V[S,T] if-and-only if at each timestep t1 it chooses a next state with highest reachable V[S,t]. So, if an optimal policy sometimes chooses A over B at timestep t when both are reachable, then we must have V[A,t]V[B,t]. And if an optimal policy sometimes chooses B over A at timestep t when both are reachable, then we must have V[A,t]V[B,t]. If both of those occur, i.e. the policy has an inconsistent revealed preference between A and B at time t, then V[A,t]=V[B,t]. Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from A at time t must have the highest reachable value, and that value...
  continue reading

2439 episodes

Semua episode

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Quick Reference Guide