[RL] Model-Free Prediction

9 minute read

Model-Free Prediction

In model-based learning such as MDP and DP, an agent has a great deal of knowledge on the environment (transition probabilities between states, rewards on states). In model-free learning, e.g., Q-learning, an agent does not need to have any model of the environment. It only needs to know what states exist and what actions are possible for each state. In this post, we consider the learning methods for estimating the value function of an unknown MDP, i.e., we don’t have complete knowledge of the environment.

Monte-Carlo Learning

Monte-Carlo method is one of the most broadly used approximation algorithms in mathematics, which can be used to solve problems with randomness. For example, by the law of large number, the expected value of any random variable can be approximated by taking the sample mean of a random sample. The following figure shows an example of Monte-Carlo method that compute $\pi$ approximately. In a rectangle $[0, 1] \times [0, 1]$ in $\mathbb{R}^2$, it samples points $(x, y)$ uniformly. After sampling, we calculate whether the sampled point is in a circle with center $(0, 0)$ and a radius of $1$. Finally, we approximate the ratio of the number of points belonging to the circle divided by the total number of sampled points as $\frac{\pi}{4}$.

$\mathbf{Fig\ 1.}$ The process of approximating $\pi$ by Monte-Carlo method

As such, the Monte Carlo method is a wide class of approximations and simulations for computing (approximating) values that are not expressed in a closed form or that are complex in calculation. In reinforcement learning, Monte-Carlo methods allow the agent to learn directly from epsiodes of experience without knowledge of MDP transitions and rewards (model-free) by just sampling a trajectory.

Monte-Carlo Prediction

First, we consider Monte-Carlo methods for learning $v_\pi$. So, the goal of Monte-Carlo policy evaluation is to learn the true value function $v_\pi = \mathbb{E}_\pi [G_t | S_t = s]$ from epsodes of experience under policy $\pi$:

\[\begin{aligned} S_1, A_1, R_2, \cdots, S_k \sim \pi \end{aligned}\]

$\mathbf{Fig\ 2.}$ Monte-Carlo backup

The idea of it is simple; it uses empirical mean of return instead of expectation. Consider that we evaluate state $s$ from a set of episodes that pass through $s$, with policy $\pi$. Of course, in one episode, the agent can visit $s$ only once or multiple time. We call the number of visits to $s$ visit.

The first-visit Monte-Carlo policy evaluation estimates $v_\pi (s)$ by the average of the returns after first visits to $s$. Let $t$ be the first time-step that state $s$ is vistied in an episode. Let $N(s)$ and $S(s)$ be increment counter and total return for state $s$, respectively.
If the agent visited state $s$ in one trajectory from the set of episodes,

\[\begin{align} & N(s) \longleftarrow N(s) + 1 \\ & S(s) \longleftarrow S(s) + G_t \\ \end{align}\]

After the loop over set of trajectories is done, value is estimated by empirical mean of return

\[\begin{aligned} V(s) = \frac{S(s)}{N(s)} \end{aligned}\]

Note that the convergence of $V(s)$ to $v_\pi (s)$ as $N(s) \to \infty$ is ensured by Law of Large numbers.

In case of every-visit Monte-Carlo policy evaluation, it considers all visits to $s$. We increment counter and total return by $G_t$ for every time-step $t$ that state $s$ is visited in one episode. Notice that these update procedure can be implemented incrementally:

After episode $S_1, A_1, R_2, \cdots, S_T$, for each state $S_t$ with return $G_t$,

\[\begin{aligned} & N(S_t) \longleftarrow N(S_t) + 1 \\ & V(S_t) \longleftarrow V(S_t) + \frac{1}{N(S_t)} (G_t - V(S_t)) \end{aligned}\]

from this mathematical fact

\[\begin{aligned} \mu_k & = \frac{1}{k} \sum_{i=1}^k x_i \\ & = \frac{1}{k} ( x_k + \sum_{i=1}^{k-1} x_i ) \\ & = \mu_{k-1} + \frac{1}{k} (x_k - \mu_{k-1}) \end{aligned}\]

Also, this kind of online averaging can be useful to track a running average, i.e., forget old episode, in non-stationary problems (for example, policy is changing again and again like policy iteration)

\[\begin{aligned} V(S_t) \longleftarrow V(S_t) + \alpha (G_t - V(S_t)) \end{aligned}\]

It can be interpreted as weighted mean, and if the closer $\alpha$ is to 1, the easily forgotten the value from old episodes is. (And, this idea leads us to temporal-difference learning)

Here, the estimates of state-value for each state are independent: we don’t build a value estimate of a state upon the estimate of any other state. (We do not bootstrap). The following diagram summarizes the backup for Monte-Carlo prediction:

$\mathbf{Fig\ 3.}$ The backup diagram for Monte-Carlo prediction of $v_\pi$. The trajectory must be *complete*.

And here is a pseudocode for first-visit MC prediction:

$\mathbf{Fig\ 4.}$ The first-visit MC method for estimating $v_\pi$.

Temporal-Difference Learning

Temporal-Difference Learning is a key central idea of reinforcement learning, which is a combination of Monte-Carlo methods and dynamic programming. TD methods learn directly from episodes of raw experience without knowledge of MDP transitions / reward function like MC methods. And it continuously updates value estimates step-by-step (it doesn’t wait until the episode finishes) like DP. In other words, TD learns from incomplete episodes, by bootstrapping; TD updates a guess towards a guess.

Temporal-Difference Prediction

Let’s consider the policy evaluation, or prediction (estimating $v_\pi$ for a given policy $\pi$) with Temporal-Difference methods. While Monte-Carlo prediction should wait until the trajectory ends to get return for updating $V(s)$, Temporal-Difference method doesn’t need to wait the end; it need wait only until the next time step.

Recall the Bellman expectation equation for $v_\pi (s)$:

\[\begin{aligned} v_\pi (s) & = \sum_{a \in \mathcal{A}} \pi (a | s) ( \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_\pi (s^\prime)) \\ & = \sum_{a \in \mathcal{A}} \pi (a | s) ( \mathcal{R}_s^a + \gamma \mathbb{E}_{s^\prime \sim \mathcal{P}_{ss^\prime}^a} [v_\pi (s^\prime)]) \end{aligned}\]

Then, we can also estimate the expectation by sample mean with one sample only:

\[\begin{aligned} v_\pi (s) & = \sum_{a \in \mathcal{A}} \pi (a | s) ( \mathcal{R}_s^a + \gamma v_\pi (s^\prime)) = \mathbb{E}_\pi [\mathcal{R}_s^a + \gamma v_\pi (s^\prime)] \text{ where } s^\prime \sim \mathcal{P}_{ss^\prime}^a \\ \end{aligned}\]

Then, $\mathcal{R}_s^a + \gamma v_\pi (s^\prime)$ is an unbiased estimator of true $v_\pi (s)$, and it motivates Temporal-Difference learning. In TD prediction, it utilizes this $\mathcal{R}_s^a + \gamma v_\pi (s^\prime) \text{ where } s^\prime \sim \mathcal{P}_{ss^\prime}^a$ as a update target, because The following update rule shows the most basic TD method, which denotes by $\text{TD}(0)$:

\[\begin{aligned} V(S_t) \longleftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(S_t)) \end{aligned}\]

$\mathbf{Fig\ 5.}$ Temporal-Difference backup

We update our guess of value $v (S_t)$ toward estimated return $R_{t+1} + \gamma V (S_{t+1})$ in the sense of online mean. Since it is estimating the expectation of $R_{t+1} + \gamma V (S_{t+1})$, we call it the TD target, and $\delta = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is called the TD error. Note that it bases its update in part on an existing estimate. We say that it is a bootstrapping method.

And that’s why TD target cannot be unbiased estimation although $R_{t+1}$ and $S_{t+1}$ are sampled from the real experience. Suppose we start with an initial function $V$ that are zero for all states. This $V$ function is the first approximation for $v_\pi$. After that, the agent will update this estimate $V$ from obtained samples through interaction with environment.

The following diagram summarizes the backup for Temporal-Difference prediction:

$\mathbf{Fig\ 6.}$ The backup diagram for Temporal-Difference prediction of $v_\pi$

and pseudocode for $\text{TD}(0)$:

$\mathbf{Fig\ 7.}$ TD(0) for estimating $v_\pi$

MC v.s. TD

The main difference of MC and TD methods is TD can learn without knowing the final outcome whereas MC must wait until end of episode so that return is given to agent. This allows TD to learn from incomplete sequence of trajectory and it works in non-terminating environments.

Bias-Variance Tradeoff

And one of the interesting aspects of two method is bias-variance tradeoff. While Monte-Carlo method estimates $v_\pi$ by sample mean of return $G_t$, which is unbiased estimator, Temporal-Difference method learns $v_\pi$ online with TD target $R_{t+1} + \gamma V (S_{t+1})$. Since $V (S_{t+1})$ is a result of bootstrapping, we cannot say TD target is unbiased. (It is biased)

However, TD target has much lower variance than the return: Return depends on many random actions, transitions, rewards since return is the result of one complete long trajectory but TD target depends on one random action, transition, reward because it is calculated right after one timestep. Although TD methods have high bias, fortunately, for any given $\pi$ the TD algorithm has been proved to converge to $v_\pi$.

Sampling & Bootstrapping

The target for Monte Carlo update is $G_t$, to estimate $\mathbb{E} [G_t | S_t = s]$, whereas DP methods use an estimate of $\mathbb{E} [R_{t+1} + \gamma v_\pi (S_{t+1}) | S_t = s]$ as a target. Temporal-Difference method combines both, the target for TD update is $R_{t+1} + \gamma v_\pi (S_{t+1})$.

Note that all targets for three methods are estimation, not an exact value. In case of MC target, we are approximating the expectation (mean) of return $G_t$ by sampling and sample mean, while DP target is an estimate since $v_\pi (S_{t+1})$ is not known and it uses $V(S_{t+1})$ instead. The TD target is also an estimate for both reasons.

Thus, TD methods combine the sampling of Monte Carlo with the bootstrapping of DP. It samples $R_{t+1}$ from next one time-step and update (estimated) value $V(S_t)$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$.

In summary,

Bootstrapping: update involves an estimate
- MC doesn’t bootstrap
- DP bootstraps
- TD bootstraps
Sampling: update samples an expectation
- MC samples
- DP doesn’t sample
- TD samples

$\mathbf{Fig\ 8.}$ Unified view of Reinforcement Learning.

TD($\lambda$)

Then, wouldn’t combining the two methods offset each other’s strengths and weaknesses? How to merge two ideas well?

One possible way is consdering $n$-step returns. We define the $n$-step return $G_t^{(n)}$ as

\[\begin{aligned} G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n}) \end{aligned}\]

Then, for each $n = 1, 2, \cdots$,

\[\begin{alignat*}{2} &n = 1 \text{ (TD) } &&\longrightarrow G_t^{(1)} = R_{t+1} + \gamma V (S_{t+1}) \\ &n = 2 &&\longrightarrow G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V (S_{t+2}) \\ &\quad \vdots \\ &n = \infty \text{ (MC) } &&\longrightarrow G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-1} R_T \\ \end{alignat*}\]

So, as $n$ goes to $\infty$, the return is approximated by TD target. And unlike return, this $n$-step return is ensured to be given to the agent after $n$ timestep which means it works in non-terminating environment too.

As $n$ is a hyperparameter, we can average $n$-step returns over different $n$; we do not know which $n$ is the best.

\[\begin{aligned} \frac{1}{N} \sum_{k=1}^N G_t^{(i_k)} \end{aligned}\]

This averaging $n$-step idea can be generalized to $\lambda$-return $G_t^\lambda$ that combines all $n$-step returns $G_t^{(n)}$ by geometrically weight average:

\[\begin{aligned} G_t^\lambda = \sum_{n=1}^\infty (1 - \lambda) \lambda^{n-1} G_t^{(n)} \end{aligned}\]

for $\lambda \in (0, 1)$. Note that the sum of weights $(1 - \lambda) \lambda^{n-1}$ is $1$.

Reference

[1] Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto, The MIT Press.
[2] Introduction to Reinforcement Learning with David Silver
[3] A (Long) Peek into Reinforcement Learning
[4] What exactly is bootstrapping in reinforcement learning?
[5] Wikipedia, Temporal difference learning

Share on

Twitter Facebook LinkedIn

Youngdo Lee