Skip to Content

Policy Gradient Theorm

Policy Gradient Theorm states that the gradient of the performance measure w.r.t the policy parameter θ\theta is proportional to the sum of the gradients of the policy w.r.t θ\theta times the action value function.

The task is assumed to be episodic and the performance measure(J(θ)J(\theta)) is the value of the start state vπ(s0)v_\pi(s_0) of the episode.

  • S\mathcal{S} is the state space.
  • π\pi is the policy, parameterized by θ\theta.

On Policy Distribution.

Definition of on-policy state distribution(μ(s)\mu(s)).

  • Sate distribution μ(s)\mu(s) is the probability of being in state ss and ΣsSμ(s)=1\Sigma_{s \in \mathcal{S}} \mu(s) = 1
  • h(s)h(s) denote the probability of starting in state ss.
  • η(s)\eta (s) denotes the number of times, on average, ss is visited in an episode.

η(s)=h(s)+sˉη(sˉ)aπ(asˉ)p(ssˉ,a),sS\eta(s)=h(s)+\sum_{\bar{s}} \eta(\bar{s}) \sum_a \pi(a \mid \bar{s}) p(s \mid \bar{s}, a), \quad \forall s \in \mathcal{S}

μ(s)=η(s)sη(s),sS\mu(s)=\frac{\eta(s)}{\sum_{s^{\prime}} \eta\left(s^{\prime}\right)}, \forall s \in \mathcal{S}

Proof of Policy Gradient Theorm

  • Given a start state, Performance measure JJ depends on both the actions chosen and distribution of states in which those selections are made.
  • The policy parameter θ\theta effects both of those.
  • The goal is to determine the performance gradient with out involving the gradients of the state distribution as the state distribution depends on the environment and is usually not known.
  • Policy Gradient Theorm provides a solution for this. It states that the gradient of the performance measure w.r.t the policy parameter θ\theta is proportional to the sum of the gradients of the policy w.r.t θ\theta times the action value function.
  • All Gradients are w.r.t θ\theta
vπ(s)=[aπ(as)qπ(s,a)], for all sS=a[π(as)qπ(s,a)+π(as)qπ(s,a)] (product rule of calculus) =a[π(as)qπ(s,a)+π(as)s,rp(s,rs,a)(r+vπ(s))]=a[π(as)qπ(s,a)+π(as)sp(ss,a)vπ(s)]=a[π(as)qπ(s,a)+π(as)sp(ss,a)(unrolling)a[π(as)qπ(s,a)+π(as)sp(ss,a)vπ(s)]]=xSk=0Pr(sx,k,π)aπ(ax)qπ(x,a)(after repeated unrolling)\begin{array}{rlr} \nabla v_\pi(s) & =\nabla\left[\sum_a \pi(a \mid s) q_\pi(s, a)\right], \quad \text{ for all } s \in \mathcal{S} & \\ & =\sum_a\left[\nabla \pi(a \mid s) q_\pi(s, a)+\pi(a \mid s) \nabla q_\pi(s, a)\right] \quad \text{ (product rule of calculus) } \\ & =\sum_a\left[\nabla \pi(a \mid s) q_\pi(s, a)+\pi(a \mid s) \nabla \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left(r+v_\pi\left(s^{\prime}\right)\right)\right] \\ & =\sum_a\left[\nabla \pi(a \mid s) q_\pi(s, a)+\pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \nabla v_\pi\left(s^{\prime}\right)\right] \\ & =\sum_a\left[\nabla \pi(a \mid s) q_\pi(s, a)+\pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)\right. \quad \text{(unrolling)} \\ & \left.\quad \sum_{a^{\prime}}\left[\nabla \pi\left(a^{\prime} \mid s^{\prime}\right) q_\pi\left(s^{\prime}, a^{\prime}\right)+\pi\left(a^{\prime} \mid s^{\prime}\right) \sum_{s^{\prime \prime}} p\left(s^{\prime \prime} \mid s^{\prime}, a^{\prime}\right) \nabla v_\pi\left(s^{\prime \prime}\right)\right]\right] \\ & =\sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty} \operatorname{Pr}(s \rightarrow x, k, \pi) \sum_a \nabla \pi(a \mid x) q_\pi(x, a) \quad \text{(after repeated unrolling)} \\ \end{array}
  • Pr(sx,k,π)\operatorname{Pr}(s \rightarrow x, k, \pi) is the probability of transitioning from state ss to state xx in kk steps under policy π\pi.
J(θ)=vπ(s0)=s(k=0Pr(s0s,k,π))aπ(as)qπ(s,a)=sη(s)aπ(as)qπ(s,a)=sη(s)sη(s)sη(s)aπ(as)qπ(s,a)=sη(s)sμ(s)aπ(as)qπ(s,a)sμ(s)aπ(as)qπ(s,a)\begin{array}{rlr} \nabla J(\boldsymbol{\theta}) & =\nabla v_\pi\left(s_0\right) \\ & =\sum_s\left(\sum_{k=0}^{\infty} \operatorname{Pr}\left(s_0 \rightarrow s, k, \pi\right)\right) \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \\ & =\sum_s \eta(s) \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \\ & =\sum_{s^{\prime}} \eta\left(s^{\prime}\right) \sum_s \frac{\eta(s)}{\sum_{s^{\prime}} \eta\left(s^{\prime}\right)} \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \\ & =\sum_{s^{\prime}} \eta\left(s^{\prime}\right) \sum_s \mu(s) \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \\ & \propto \sum_s \mu(s) \sum_a \nabla \pi(a \mid s) q_\pi(s, a) \end{array}
  • Where, \propto denotes proportionality.
Last updated on