[RL] Dynamic Programming

12 minute read

Introduction

Dynamic programming is a method for solving complex problems by breaking them down into subproblems, and combining solutions to subproblems. The term “Dynamic” means it is sequential or temporal component to the problem, and “Programming” stands for opimizing a program. For instance, we may translate linear programming as linear optimization.

Dynamic programming can be used to compute optimal policies given a perfect model of the environment. It is a very general solution method for problems which have two properties:

Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused

One of examples that satisfy these conditions is MDP: Bellman equation gives recursive decomposition and value function stores and reuses solutions.
Dynamic programming assumes full knowledge of the MDP, so it is used for planning in an MDP.

Policy Evaluation

First, let’s consider how can we compute the state value function for any policy $\pi$. That is, we want to evaluate the policy $\pi$.

In dynamic programming, we compute value function of $\pi$ by iterative application of Bellman expectation backup.

At each $t + 1$, for all states $s \in \mathcal{S}$,
Update $v_{t+1}(s)$ from $v_t (s^\prime)$ where $s^\prime$ is a successor state of $s$.

It is called synchronous backups since it updates all the state at each timestep. There is also an asynchronous backup, and we will see in the later of the post. Also, we can ensure the convergence of this algorithm to $v_\pi$ and it can be explicitly proved mathematically. We also discuss this later.

The explicit formula for update, which is obtained by using the Bellman equation for $v_\pi$ as an update rule, is as follows.

\[\begin{aligned} v_{k+1} (s) = \sum_{a \in \mathcal{A}} \pi (a | s) [\mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_k (s^\prime) ] \end{aligned}\]

Note that the true value function $v_\pi$ is a fixed point for this formula. And since we update $v$ for all possible state, this kind of backup is called full backup. Also, it allows us to represent the formular in a matrix form:

\[\begin{aligned} \mathbf{v}^{k + 1} = \boldsymbol{\mathcal{R}}^\boldsymbol{\pi} + \gamma \boldsymbol{\mathcal{P}}^\boldsymbol{\pi} \mathbf{v}^{k} \end{aligned}\]

This way, two arrays for $\mathbf{v}^{k + 1}$ and $\mathbf{v}^{k}$ is maintained (of course, it is possible to construct only one array by in-place backups) during the algorithm and the next value array $\mathbf{v}^{k + 1}$ can be updated without the old values being changed.

$\mathbf{Example.}$ Evaluating a Random Policy in the Small Gridworld

Consider the $4 \times 4$ small gridworld:

The nonterminal state $\mathcal{S} = \{ 1, 2, \cdots, 14 \}$. And there are 4 possible actions in each state $\mathcal{A} = \{ \text{ up }, \text{ down }, \text{ right }, \text{ left } \}$. Note that the actions determin the next state deterministically except when the action causes agent to escape the world; in that case, the agent’s state is unchanged. For each transition, the agent is penalized with reward $R = -1$, so the agent should reach the terminal state (right corner) as soon as possible.

Suppose the agent follows the completely random policy (all actions equally likely). The left column of the figure shows $\{ v_k \}$ of iterative policy evaluation.

The final estimate is in fact $v_\pi$, which in this case gives for each state the negation of the expected number of steps from that state until termination.

Policy Iteration

Now, we are able to evaluate the policy $\pi$ with an iterative method called policy evaluation. Then, how can we improve the policy? Policy iteration gives the way to evaluate and improve given policy $\pi$ to an optimal policy $\pi_*$. It considers $q_\pi (s, a)$ to decide to improve the current policy or not.

Policy Improvement

After we evaluate the policy $\pi$, i.e., computing

\[\begin{aligned} v_\pi (s) = \mathbb{E} [R_{t+1} + \gamma R_{t+2} + \cdots | S_t = s] \end{aligned}\]

we choose the next better policy by acting greedily with respect to $v_\pi$. Consider a deterministic policy $a = \pi (s)$. Then, we can improve the policy by acting greedily:

\[\begin{aligned} \pi^\prime (s) = \underset{a \in \mathcal{A}}{\text{ arg max }} q_\pi (s, a) \end{aligned}\]

since it improves the value from any state $s$ over one step,

\[\begin{aligned} q_\pi (s, \pi^\prime (s)) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a) \geq q_\pi (s, \pi (s)) = v_\pi (s) \end{aligned}\]

therefore improves the value function, i.e., $v_{\pi^\prime} (s) \geq v_\pi (s)$:

\[\begin{aligned} v_\pi (s) &\leq q_\pi (s, \pi^\prime (s)) = \mathbb{E}_{\pi^\prime} [R_{t+1} + \gamma v_\pi (S_{t+1}) | S_t = s] \\ &\leq \mathbb{E}_{\pi^\prime} [R_{t+1} + \gamma q_{\pi^\prime} (S_{t+1}, \pi^\prime (S_{t+1})) | S_t = s] \\ &\leq \mathbb{E}_{\pi^\prime} [R_{t+1} + \gamma R_{t+2} + \gamma^2 q_{\pi^\prime} (S_{t+2}, \pi^\prime (S_{t+2})) | S_t = s] \\ &\leq \mathbb{E}_{\pi^\prime} [R_{t+1} + \gamma R_{t+2} + \cdots | S_t = s] = v_{\pi^\prime} (s) \end{aligned}\]

These process is summarized to policy improvement theorem:

$\mathbf{Thm.}$ Policy Improvement Theorem
Let $\pi$ and $\pi^\prime$ be any pair of deterministic policies such that, for all $s \in \mathcal{S}$, $q_\pi (s, \pi^\prime(s)) \geq v_\pi (s)$. Then, the policy $\pi^\prime$ must be as good as, or better than $\pi$, in other words, $v_{\pi^\prime} (s) \geq v_\pi (s)._\blacksquare$

Here, the notation $q_\pi (s, \pi^\prime (s))$ means the expected return when state $s$ and action $A$ is applied action $A$ where $A \sim \pi^\prime (\cdot | s)$, and follow the policy $\pi$ for the remaining steps. In other words, the following way of definition will be easier to understand:

\[\begin{aligned} q_\pi (s, A) = \mathbb{E}_{\pi} [ G(s, A) | S_t = s] \text{ where } A \sim \pi^\prime (\cdot | s) \end{aligned}\]

Hence, if $\pi^\prime$ is deterministic, it becomes

\[\begin{aligned} q_\pi (s, A = \pi^\prime (s)) & = \mathbb{E}_{\pi} [G (s, \pi^\prime (s)) | S_t = s] \\ & = \mathbb{E}_{\pi} [G (s, A_t = a) | S_t = s, A_t = \pi^\prime (s)] \end{aligned}\]

Or, if $\pi^\prime$ is stochastic (we will see $\varepsilon$-greedy policy in the later post, which is an example of stochastic policy),

\[\begin{aligned} q_\pi (s, A = \pi^\prime (s)) & = \mathbb{E}_{\pi} [G (s, A) | S_t = s] \\ & = \sum_{a \in \mathcal{A}} \pi^\prime (a | s) \mathbb{E}_{\pi} [G (s, a) | S_t = s] \\ & = \sum_{a \in \mathcal{A}} \pi^\prime (a | s) \mathbb{E}_{\pi} [G (s, A_t) | S_t = s, A_t = a] \\ & = \sum_{a \in \mathcal{A}} \pi^\prime (a | s) q_\pi (s, a) \end{aligned}\]

Then, we may denote $v_\pi (s)$ as $q_\pi (s, \pi(s))$ via the above definition.

Let’s consider the termination of iteration. If improvements stop, i.e.,

\[\begin{aligned} q_\pi (s, \pi^\prime (s)) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a) = v_\pi (s) \end{aligned}\]

then the Bellman optimality equation $v_\pi (s) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a)$ has been satisfied. So, we finalize that $\pi$ is an optimal policy. This fact implies that it gives us a stricty better policy except when the original policy is already optimal.

The following pseudocode shows the full policy iteration algorithm with iterative policy evaluation.

$\mathbf{Fig. 1}$ Full policy iteration algorithm

Note that our argument above is based on the deterministic policies. However, it can be easily extended to a stochastic policy under $q_\pi (s, \pi^\prime (s)) = \sum_{a \in \mathcal{A}} \pi^\prime (a | s) q_\pi (s, a)$ naturally.

Generalized Policy Iteration

In summary, the following figure shows the generalized policy iteration (policy evaluation + policy improvement):

$\mathbf{Fig 2.}$ Generalized Policy Iteration

Policy evaluation estimate $v_\pi$, and based on computed $v_\pi$, policy improvement generate the better policy, i.e., $\pi^\prime \geq \pi$. Also, there are various different types of algorithms for policy evaluation and improvement.

$\mathbf{Example.}$ Jack's Car Rental

Jack manages two locations for a car rental company. Each day, some number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited $10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can move them between the two locations overnight.

Then, we can construct a finite MDP environment as follows:

Timestep: days
States: Number of cars at each location (suppose that maximum 20)
Actions: Number of cars that are moved to 2nd from 1st
Reward: $10 for each car rented (must be available)
Transitions: Cars returned and requested randomly
- Poisson distribution; $n$ returns and requests with probability $\frac{\lambda^n}{n!} e^{-\lambda}$
- 1st location: average request = 3, average returns = 3
- 2nd location: average request = 4, average returns = 2

The following figure shows the sequence of policies searched by policy iteration.

Value Iteration

Policy iteration consists of few iterations for evaluating policy, and another few iterations for improving the current policy. So, we must wait for policy evaluation to improve one step for optimal policy and it doesn’t even affect improvement.

There are several ways to truncate such iterations, and value iteration is the one of the remarkable examples.

Principle of Optimality

Note that any optimal policy can be subdivided into two components:

An optimal first action $A_*$
Followed by an optimal policy from successor state $S^\prime$

$\mathbf{Thm.}$ Principle of Optimality
A policy $\pi (a | s)$ achieves the optimal value from state $s$, $v_\pi (s) = v_* (s)$ if and only if for any state $s^\prime$ reachable from $s$, $\pi$ achieves the optimal value from state $s^\prime$, $v_\pi(s^\prime) = v_* (s^\prime)$.

Deterministic Value Iteration

Hence, if we know the solution to subproblems $v_* (s^\prime)$, the true solution $v_* (s)$ can be found by one-step lookahead:

\[\begin{aligned} v_* (s) \longleftarrow \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_* (s^\prime) \end{aligned}\]

And value iteration apply these updates iteratively: at each iteration $k+1$, for all states $s \in \mathcal{S}$,

\[\begin{aligned} v_{k+1} (s) & = \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_k (s^\prime) \\ \mathbf{v}_{k + 1} & = \underset{a \in \mathcal{A}}{\text{ max }} [\boldsymbol{\mathcal{R}^a} + \gamma \boldsymbol{\mathcal{P}^a} \mathbf{v}_k ]\\ \end{aligned}\]

$\mathbf{Fig. 3}$ Backup diagram for Value Iteration

Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Since we only consider the value functions, there is no explicit policy during iterations. And, it is possible that any policies do not correspond to some intermediate value functions.

The following pseudocode shows the full value iteration.

$\mathbf{Fig. 4}$ Full Value Iteration

Extension to DP

Asynchronous DP

So far, we considered synchronous DP algorithms, which update the whole state in each timestep parallely. A main drawback to such algorithm is, updating through the entire state space may be too expensive for computation.

$\mathbf{Fig. 5}$ Summary of synchoronous DP

Asynchronous DP algorithms, the alternative way to avoid expensive iteration, backs up states individually, in any order. Hence the values of some few states may be backed up several times than others, so it can be guaranteed to converge correctly only if it backed up the values of all states. Asynchronous DP algorithms allow great flexibility in selecting states to which backup operations are applied.

There are 3 simple ideas for asynchronous DP algorithms:

In-place dynamic programming
Prioritised sweeping
Real-time dynamic programming

In in-place DP, we do not maintain two arrays for updated values and old values. Instead, we back up the new value in-place. In other words, instead of

\[\begin{aligned} &\text{for all } s \in \mathcal{S} \\ &\quad v_{\text{new}} (s) = \underset{a \in \mathcal{A}}{\text{ max }} (\mathcal{R}_s^a + \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_{\text{old}} (s^\prime) \\ &v_{\text{new}} \longleftarrow v_{\text{old}} \end{aligned}\]

we only stores one copy of value function in in-place value iteration:

\[\begin{aligned} &\text{for all } s \in \mathcal{S} \\ &\quad v (s) = \underset{a \in \mathcal{A}}{\text{ max }} (\mathcal{R}_s^a + \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v (s^\prime)) \end{aligned}\]

Since unnecessary array updates are eliminated, it is more efficient than synchronous value iteration. However, value function is continuously, independently updated for each state and this affects other state updates, so that the update is more unstable.

Priortised sweeping gives priority to the state to be backed up. For states that need more updates, we update them first. And to select the victim state for backup, we use magnitude of Bellman error:

\[\begin{aligned} | \underset{a \in \mathcal{A}}{\text{max }} (\mathcal{R}_s^a + \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v (s^\prime)) - v(s) | \end{aligned}\]

If there is the state with the largest remaining Bellman error, we select that, and update Bellman error of affected states after each backup. With the additional data structures such as priority queue, it can be implemented efficiently. Also, notice that the bellman error is 0 if and only if $v(s)$ satisfies Bellman optimality equation, i.e., $v(s)$ is the optimal value.

Lastly, in real-time DP, it chooses only states that are relavant to agent, i.e., states that the agent actually visited. It ignores some unknown states that a agent has never experienced.
Thus, for each timestep $t$, the agent has $S_t, A_t, R_{t+1}$. And, we backup the state $S_t$ by

\[\begin{aligned} v (S_t) = \underset{a \in \mathcal{A}}{\text{ max }} (\mathcal{R}_{S_t}^a + \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{S_t s^\prime}^a v_{\text{old}} (s^\prime)) \end{aligned}\]

Full-width and sample backups

Notice that DP, regardless of synchronous and asynchronous DP, every successor state and action is considered to update the value, and they use knowledge of the MDP transitions and reward function. If the state space is not large, it may be fine, but for large problem it usually suffers from curse of dimensionality. Instead of full-width backup, we will delve into another algorithms with alternative ways for backup, which is called sample backup in next posts (Monte-Carlo methods, Temporal Difference).

Sample backup use sample rewards and sample transitions $\left\langle S, A, R, S^\prime \right\rangle$ instead of reward function $\mathcal{R}$ and transition dynamics $\mathcal{P}$. It avoids the curse of dimensionality through sampling, and we don’t have to possess a full knowledge of MDP required (model-free), which allows us to learning instead of planning. Then, the cost of backup is constant which is independent of the size of state space $\mathcal{S}$.

Summary

We walked through two iterative method for finding optimal policy $\pi^*$: policy iteration and value iteration. Policy iteration can be divided again into policy evaluation and policy improvement, and value iteration effectively combines in each of sweeps of them. We compare two algorithms side by side [5]:

$\mathbf{Fig. 6}$ Comparison between policy and value iteration

In general, it is known that the policy iteration is likely to converge faster than value iteration, as a policy converges more quickly than a value function.

Reference

[1] Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto, The MIT Press.
[2] Introduction to Reinforcement Learning with David Silver
[3] Planning by Dynamic Programming by Maël Fabien
[4] A (Long) Peek into Reinforcement Learning
[5] What is the difference between value iteration and policy iteration?

Share on

Twitter Facebook LinkedIn

Youngdo Lee