Policy Gradient Theorm
Policy Gradient Theorm states that the gradient of the performance measure w.r.t the policy parameter is proportional to the sum of the gradients of the policy w.r.t times the action value function.
The task is assumed to be episodic and the performance measure() is the value of the start state of the episode.
- is the state space.
- is the policy, parameterized by .
On Policy Distribution.
Definition of on-policy state distribution().
- Sate distribution is the probability of being in state and
- denote the probability of starting in state .
- denotes the number of times, on average, is visited in an episode.
Proof of Policy Gradient Theorm
- Given a start state, Performance measure depends on both the actions chosen and distribution of states in which those selections are made.
- The policy parameter effects both of those.
- The goal is to determine the performance gradient with out involving the gradients of the state distribution as the state distribution depends on the environment and is usually not known.
- Policy Gradient Theorm provides a solution for this. It states that the gradient of the performance measure w.r.t the policy parameter is proportional to the sum of the gradients of the policy w.r.t times the action value function.
- All Gradients are w.r.t
- is the probability of transitioning from state to state in steps under policy .
- Where, denotes proportionality.
Last updated on