[RL] Convergence of DP

5 minute read

Convergence of Dynamic Programming

Preliminary

Contraction Mapping Theorem

$\mathbf{Definition.}$ $\boldsymbol{\gamma}$-contraction
An operator $F$ on a normed vector space $\mathcal{X}$ is a $\boldsymbol{\gamma}$-contraction for $0 < \gamma < 1$, if for all $x, y \in \mathcal{X}$,

\[\begin{aligned} \| F(x) - F(y) \| \leq \gamma \| x - y \|._\blacksquare \end{aligned}\]

$\mathbf{Theorem.}$ Contraction Mapping Theorem
For a $\gamma$-contraction $F$ in a complete normed vector space $\mathcal{X}$,

$F$ admits an unique fixed point, i.e., $F \mathbf{v} = \mathbf{v}$.
Iterative application of $F$ converges to an unique fixed point in $\mathcal{X}$ independent of the initial point.
at a linear convergence rate determined by $\gamma._\blacksquare$

This theorem is also known as Banach fixed-point theorem, contractive mapping theorem or Banach-Caccioppoli theorem.

Bellman Expectation Backup

Now, consider the vector space $V$ over value functions. Then, there are at most $| \mathcal{S} |$ dimensions, and each point in this space fully specifies a value function $v(s)$. Then, we will show that Bellman backup is a contraction operator that brings value functions closer in this function space. And therefore, the backup must converge to a unique solution.

Define the Bellman expectation backup operator $F^\pi$ as

\[\begin{aligned} F^\pi (\mathbf{v}) = r^\pi + \gamma \mathcal{T}^\pi \mathbf{v} \end{aligned}\]

where $r^\pi = \sum_{a \in \mathcal{A}} \pi (a \vert s) \cdot \mathcal{R}_s^a$, $\mathcal{T}^\pi = \sum_{a \in \mathcal{A}} \pi (a \vert s) \cdot \mathcal{P}_{ss^\prime}^a$. And we show that this operator is $\gamma$-contraction:

\[\begin{aligned} F^\pi (\mathbf{v}) = r^\pi + \gamma \mathcal{T}^\pi \mathbf{v} \end{aligned}\] \[\begin{aligned} \| F^\pi (\mathbf{u}) - F^\pi (\mathbf{v}) \|_\infty & = \| \gamma \mathcal{T}^\pi (\mathbf{u} - \mathbf{v}) \|_\infty \\ & \leq \gamma \mathcal{T}^\pi ( \| \mathbf{u} - \mathbf{v} \|_\infty \cdot \mathbf{1} ) \|_\infty \\ & = \gamma \cdot \| \mathbf{u} - \mathbf{v} \|_\infty \cdot \| \mathcal{T}^\pi \cdot \mathbf{1} \|_\infty \\ & = \gamma \cdot \| \mathbf{u} - \mathbf{v} \|_\infty \end{aligned}\]

where we define $\mathbf{1} \equiv [1, 1, \cdots, 1]^\top$. Then, note that $\mathcal{T}^\pi \cdot \mathbf{1} = \mathbf{1}$. Also note that we don’t have to care about the type of norm, due to equivalence of norms. Hence, the Bellman expectation operator $F^\pi$ has a unique fixed point.

Convergence of Iterative Policy Evaluation

So far, we prove that $F^\pi$ is a contraction, so that the convergence to unique fixed point is ensured. Now, we should show that an unique fixed point must be $v_\pi$.

However we already know that $v_\pi$ is a fixed point of $F^\pi$ by Bellman expectation equation. So we can finalize that by contraction mapping theorem, iterative policy evaluation converges on $v_\pi$.

Convergence of Policy Iteration

Then, do policy iteration with greedy policy improvement converges to $Q^*$ and $\pi^*$? In this section, we will show that

1. Show monotone improvement: $V^{\pi_{k+1}} \geq V^{\pi_k}$
2. Show that the algorithm have to stop after a finite number of steps for finite action space $\mathcal{A}$
3. Show the converged policy is the optimal.

First, monotone policy improvement. Consider a deterministic policy $a= \pi (s)$. We can improve the policy by acting greedily:

\[\begin{aligned} \pi^\prime (s) = \underset{a \in \mathcal{A}}{\text{ argmax }} q_\pi (s, a) \end{aligned}\]

And this improves the value from any state $s$ over one step:

\[\begin{aligned} q_\pi (s, \pi^\prime(s)) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a) \geq q_\pi (s, \pi (s)) = v_\pi (s) \end{aligned}\]

Thus, it improves the overall value function $v_{\pi^\prime} (s) \geq v_\pi (s)$, by $v_\pi (S_t) \leq q_\pi (S_t, \pi^\prime (S_t))$,

\[\begin{aligned} v_\pi (s) & \leq q_\pi (s, \pi^\prime (s)) = \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma v_\pi (S_{t+1}) | S_t = s ] \\ & \leq \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma q_\pi (S_{t+1}, \pi^\prime (S_{t+1}) | S_t = s ] \\ & = \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma R_{t+2} + \gamma^2 v_\pi (S_{t+2}) | S_t = s ] \\ & \leq \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi (S_{t+2}, \pi^\prime (S_{t+2})) | S_t = s ] \\ & \leq \cdots \\ & \leq \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots | S_t = s ] \\ & = v_{\pi^\prime} (s)._\blacksquare \end{aligned}\]

Notice that since there exists only a finite number of policies (we assumed finite action space $\mathcal{A}$), the algorithm has to stop after a finite number of steps. And, if improvements stop, the following equality holds:

\[\begin{aligned} q_\pi (s, \pi^\prime (s)) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a) = q_\pi (s, \pi (s)) = v_\pi (s) \end{aligned}\]

Thus, the Bellman optimality equation has been satisfied

\[\begin{aligned} v_\pi (s) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a) \end{aligned}\]

Therefore, we conclude that $v_\pi (s) = v_* (s)$ for all $s \in \mathcal{S}$ and the converged policy is an optimal policy.

Convergence of Value Iteration

The proof of convergence of value iteration is analogous to the convergence proof for policy iterative. First, we define the Bellman optimality backup operator $T^*$ as

\[\begin{aligned} T^* (\mathbf{v}) = \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{R}^a + \gamma \mathcal{P}^a \mathbf{v} \end{aligned}\]

And again, we show that the operation is $\gamma$-contraction. Note that

\[\begin{aligned} \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{u}] - \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{v}] \leq \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{u} - (\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{v})]. \end{aligned}\]

(It is trivial, if $a^*$ is a maximizer of the RHS, then $\mathcal{R}_s^{a^*} + \gamma \mathcal{P}^{a^*} \mathbf{v}$ will be smaller than or equals to its max value.)

\[\begin{aligned} \| T^* (\mathbf{v}) - T^* (\mathbf{u}) \|_\infty & = \| \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{u}] - \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{v}] \|_\infty \\ & \leq \| \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{u} - \mathcal{R}_s^a - \gamma \mathcal{P}^a \mathbf{v}] \|_\infty \\ & = \gamma \cdot \| \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{P}^a (\mathbf{u} - \mathbf{v}) \|_\infty \\ & \leq \gamma \cdot \| \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{P}^a \cdot \mathbf{1} \cdot \| \mathbf{u} - \mathbf{v} \|_\infty \|_\infty \\ & = \gamma \cdot \| \mathbf{u} - \mathbf{v} \|_\infty \cdot \| \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{P}^a \cdot \mathbf{1} \|_\infty \\ & = \gamma \cdot \| \mathbf{u} - \mathbf{v} \|_\infty \end{aligned}\]

where $\mathbf{1} \equiv [1, 1, \cdots, 1]^\top$, and then $\mathcal{P}^a \cdot \mathbf{1} = \mathbf{1}$.

Hence, the Bellman optimality backup operator $T^*$ has a unique fixed point by Banach fixed-point theorem and $v_*$ is a fixed point of $T^*$ by Bellman optimality equation. Thus, we conclude that value iteration (iterative application of Bellman optimality operator) converges to $v_*$.

Reference

[1] Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto, The MIT Press.
[2] Wikipedia, Banach fixed-point theorem
[3] CMU, Deep Reinforcement Learning and Control, Katerina Fragkiadaki

Share on

Twitter Facebook LinkedIn

Youngdo Lee