Policy Gradient Theorm

Policy Gradient Theorm states that the gradient of the performance measure w.r.t the policy parameter $\theta$ is proportional to the sum of the gradients of the policy w.r.t $\theta$ times the action value function.

The task is assumed to be episodic and the performance measure( $J(\theta)$ ) is the value of the start state $v_\pi(s_0)$ of the episode.

$\mathcal{S}$ is the state space.
$\pi$ is the policy, parameterized by $\theta$ .

On Policy Distribution.

Definition of on-policy state distribution( $\mu(s)$ ).

Sate distribution $\mu(s)$ is the probability of being in state $s$ and $\Sigma_{s \in \mathcal{S}} \mu(s) = 1$
$h(s)$ denote the probability of starting in state $s$ .
$\eta (s)$ denotes the number of times, on average, $s$ is visited in an episode.

$\eta(s)=h(s)+\sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s}) p(s \mid \bar{s}, a), \quad \forall s \in \mathcal{S}$

$\mu(s)=\frac{\eta(s)}{\sum_{s^{\prime}} \eta\left(s^{\prime}\right)}, \forall s \in \mathcal{S}$

Proof of Policy Gradient Theorm

Given a start state, Performance measure $J$ depends on both the actions chosen and distribution of states in which those selections are made.
The policy parameter $\theta$ effects both of those.
The goal is to determine the performance gradient with out involving the gradients of the state distribution as the state distribution depends on the environment and is usually not known.
Policy Gradient Theorm provides a solution for this. It states that the gradient of the performance measure w.r.t the policy parameter $\theta$ is proportional to the sum of the gradients of the policy w.r.t $\theta$ times the action value function.
All Gradients are w.r.t $\theta$

\begin{array}{rlr} \nabla v_\pi(s) & =\nabla\left[\sum_a \pi(a \mid s) q_\pi(s, a)\right], \quad \text{ for all } s \in \mathcal{S} & \\ & =\sum_a\left[\nabla \pi(a \mid s) q_\pi(s, a)+\pi(a \mid s) \nabla q_\pi(s, a)\right] \quad \text{ (product rule of calculus) } \\ & =\sum_a\left[\nabla \pi(a \mid s) q_\pi(s, a)+\pi(a \mid s) \nabla \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left(r+v_\pi\left(s^{\prime}\right)\right)\right] \\ & =\sum_a\left[\nabla \pi(a \mid s) q_\pi(s, a)+\pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \nabla v_\pi\left(s^{\prime}\right)\right] \\ & =\sum_a\left[\nabla \pi(a \mid s) q_\pi(s, a)+\pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\right. \quad \text{(unrolling)} \\ & \left.\quad \sum_{a^{\prime}}\left[\nabla \pi\left(a^{\prime} \mid s^{\prime}\right) q_\pi\left(s^{\prime}, a^{\prime}\right)+\pi\left(a^{\prime} \mid s^{\prime}\right) \sum_{s^{\prime \prime}} p\left(s^{\prime \prime} \mid s^{\prime}, a^{\prime}\right) \nabla v_\pi\left(s^{\prime \prime}\right)\right]\right] \\ & =\sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \operatorname{Pr}(s \rightarrow x, k, \pi) \sum_a \nabla \pi(a \mid x) q_\pi(x, a) \quad \text{(after repeated unrolling)} \\ \end{array}

$\operatorname{Pr}(s \rightarrow x, k, \pi)$ is the probability of transitioning from state $s$ to state $x$ in $k$ steps under policy $\pi$ .

\begin{array}{rlr} \nabla J(\boldsymbol{\theta}) & =\nabla v_\pi\left(s_0\right) \\ & =\sum_s\left(\sum_{k=0}^{\infty} \operatorname{Pr}\left(s_0 \rightarrow s, k, \pi\right)\right) \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \\ & =\sum_s \eta(s) \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \\ & =\sum_{s^{\prime}} \eta\left(s^{\prime}\right) \sum_s \frac{\eta(s)}{\sum_{s^{\prime}} \eta\left(s^{\prime}\right)} \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \\ & =\sum_{s^{\prime}} \eta\left(s^{\prime}\right) \sum_s \mu(s) \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \\ & \propto \sum_s \mu(s) \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \end{array}

Where, $\propto$ denotes proportionality.