[RL] Model-Free Control
Model-Free Control
We now delve into the way to optimize the value function without complete knowledge of the environment.
There are two ways for model-free control. One is on-policy learning that the policy for generating action and the policy being updated are same, whereas two policies can be set differently in off-policy learning.
On-Policy Learning
Typical examples of on-policy learning are Monte-Carlo control and Temporal-Difference control, and we will look at these two algorithms one by one. Recall that in the generalized policy iteration in dynamic programming, we are allowed to use any algorithm for policy evaluation and policy improvement instead of iterative policy evaluation and greedy policy improvement.
$\mathbf{Fig\ 1.}$ Generalized Policy Iteration
The high generality of the algorithm implies that it can be utilized even in model-unknown situations unless the evaluation and improvement algorithm needs the knowledge of environment. Hence, Monte-Carlo control and TD control are based on this generalize policy iteration framework.
However, in greedy policy improvement over $V(s)$, we need model of MDP to find an improved policy:
It is because we need to take an action at least once and compare an action value $Q(s, a)$ to know the optimality of action. Hence, once we know $Q(s, a)$ for all action $a \in \mathcal{A}$, we can improve the policy without MDP model. To eliminate the need for an MDP model, we usually estimate an action-value $Q = q_\pi$ instead of $V = v_\pi$ in the policy evaluation step for model-free control.
\[\begin{aligned} \pi^\prime (s) = \underset{a \in \mathcal{A}}{\text{ argmax }} Q(s, a) \end{aligned}\]For policy improvement, this kind of greedy approach is still available, but there is one problem for that. As there are many combination of state and action pair, some pairs might be never visited. These pairs make it difficult to select an action with respect to value. Imagine a given state $s$ where two actions $a_1$ and $a_2$ are possible. What if we have an estimate for $Q(s, a_1)$, but there is no visit for $(s, a_2)$ so that we cannot guess $Q(s, a_2)$? Which action should we choose?
To avoid such a crossroads, one may assume that the episodes start in a state-action pair, and every state-action pair has a nonzero probability of being selected as the start. Such assumption is called exploring starts. Also this can be said to be a way to guarantee continual exploration. (The agent may favor actions that have updated value than intialized to zero value.
$\varepsilon$-greedy Exploration
Like exploring starts, the $\varepsilon$-greedy exploration is one of the simplest idea for ensuring continual exploration. We introduce $\epsilon \in [0, 1]$ that is a hyperparameter, which is a probability for choosing an random action instead of greedy action. That is, all actions in action space can be chosen with non-zero probability:
\[\begin{aligned} \pi (a|s) = \begin{cases} \frac{\varepsilon}{m} + 1 - \varepsilon && \text{ if } a^\star = \underset{a \in \mathcal{A}(s)}{\text{ argmax }} Q(s, a) \\ \frac{\varepsilon}{m} && \text{ otherwise } \end{cases} \end{aligned}\]where $m = | \mathcal{A}(s) |$. And here is a theorem that ensures the improvement of $\varepsilon$-greedy policy iteration:
$\mathbf{Thm.}$ The monotone improvement of $\boldsymbol{\varepsilon}$-greedy policy iteration
For any policy $\pi$, the $\varepsilon$-greedy policy $\pi^\prime$ with respect to $q_\pi$ is an improvement, i.e., $v_{\pi^\prime} (s) \geq v_\pi (s)$.
$\mathbf{Proof.}$
For the first equality, refer to the dynamic programming post. We claim that $\underset{a \in \mathcal{A}(s)}{\text{ max }} q_\pi (s, a) \geq \sum_{a \in \mathcal{A}(s)} \frac{\pi (a | s) - \frac{\varepsilon}{m}}{1 - \varepsilon} q_\pi (s, a)$.
Let $w_a = \frac{\pi (a | s) - \frac{\varepsilon}{m}}{1 - \varepsilon}$. Then, it is clear that $\sum_{a \in \mathcal{A}} w_a = 1$ as $m = | \mathcal{A}(s) |$. Then, $\sum_{a \in \mathcal{A}} \frac{\pi (a | s) - \frac{\varepsilon}{m}}{1 - \varepsilon} q_\pi (s, a)$ is the weighted average of $q_\pi (s, a)$ with non-negative weights summing to $1$, and as such it must be less than or equal to the largest number between $q_\pi (s, a)$.
Hence,
\[\begin{aligned} q_\pi (s, \pi^\prime (s)) & = \sum_{a \in \mathcal{A}(s)} \pi^\prime (a | s) q_\pi (s, a) \\ & = \frac{\varepsilon}{m} \sum_{a \in \mathcal{A}} q_\pi (s, a) + (1 - \varepsilon) \underset{a \in \mathcal{A}(s)}{\text{ max }} q_\pi (s, a) \\ & \geq \frac{\varepsilon}{m} \sum_{a \in \mathcal{A}(s)} q_\pi (s, a) + (1 - \varepsilon) \sum_{a \in \mathcal{A}(s)} \frac{\pi (a | s) - \frac{\varepsilon}{m} }{1 - \varepsilon} q_\pi (s, a) \\ & = \sum_{a \in \mathcal{A}(s)} \pi (a | s) q_\pi (s, a) = v_\pi (s)._\blacksquare \end{aligned}\]which proves the theorem by policy improvement theorem.
Monte-Carlo Control
In Monte-Carlo policy iteration, we use Monte-Carlo prediction of action value for policy evaluation. To estimate action value with Monte-Carlo method, it must consider not only the state, but also the action. In first-visit and every-visit MC method, to approximate $Q(s, a)$, it averages the returns in each episode that the state $s$ was visited and the next action $a$ was selected immediately.
$\mathbf{Fig\ 2.}$ Monte-Carlo Policy Iteration
However, the convergence of Monte-Carlo prediction is assured when the sample size goes to infinity. Hence we may avoid the infinite number of episodes by stopping MC policy evaluation after several finite steps. Although each evaluation iteration updates our estimate of value function toward true $q_{\pi_k}$, but we give up before actual value that can be obtained by expensive number of iterations. (One extreme form of this idea is value iteration in DP. Even only one iteration of policy evaluation is performed between each step of policy improvement, we show that the convergence of algorithm is ensured by Banach fixed point theorem.)
And it is natural to alternate between each episode. After each episode, the observed returns are used for policy evaluation, and then the policy is improved at all the states visited in the episode.
$\mathbf{Fig\ 3.}$ Monte-Carlo Control
On-policy Temporal-Difference Control (SARSA)
We already see that Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC). In Temporal-Difference Control, we used TD method to estimate $Q(s, a)$ instead of MC to exploit such advantages. Recall that an episode consists of an alternating sequence of state–action pairs (and rewards between each of them):
$\mathbf{Fig\ 4.}$ Some parts of an episode
Here is a formula for update in policy evaluation. We update $Q(s, a)$ for every time step of each episode.
$\mathbf{Fig\ 5.}$ The backup diagram of SARSA
Here is a motivation for target from the Bellman expectation equation with one-sampling Monte-Carlo estimation:
\[\begin{aligned} q_\pi (s, a) & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \sum_{a^\prime \in \mathcal{A}} \pi(a^\prime | s^\prime) q_\pi (s^\prime, a^\prime) \\ & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \mathbb{E}_{a^\prime \sim \pi( \cdot | s^\prime)} [q_\pi (s^\prime, a^\prime)] \\ & \approx \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a q_\pi (s^\prime, A^\prime) \text{ where } A^\prime \sim \pi( \cdot | s^\prime) \\ & \approx \mathcal{R}_s^a + \gamma q_\pi (S^\prime, A^\prime) \text{ where } S^\prime \sim \mathcal{P}_s^a, A^\prime \sim \pi( \cdot | s^\prime) \\ \end{aligned}\]Note that a tuple $(S_t, A_t, R_t, S_{t+1}, A_{t+1})$ is all for updating. And this gives rise to the name SARSA for the algorithm.
Now, it is straightforward to design the full control algorithm with TD prediction. As usual, it follows the pattern of generalized policy iteration. But we should use alternative policy improvement (or add some assumptions to policy) such as $\varepsilon$-greedy to assure the convergence. All state–action pairs must be visited an infinite number of times for convergence.
The following figure and pseudocode show the implementation of SARSA algorithm:
$\mathbf{Fig\ 6.}$ SARSA
$\mathbf{Fig\ 7.}$ The pseudocode for SARSA
Like TD prediction, SARSA can be extended to $n$-step SARSA and SARSA$(\lambda)$.
where $q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n Q (S_{t+n})$ is the $n$-step $Q$-return. And, the $q^\lambda$ return combines all $n$-step Q-returns:
\[\begin{aligned} q_t^\lambda = (1- \lambda) \sum_{n=1}^\infty \lambda^{n-1} q_t^{(n)} \end{aligned}\]Off-Policy Learning
In off-policy learning, we differentiate the policy into two types: the policy to optimize (target policy $\pi(a | s)$; learning its value function is the target of the learning process) and the policy to generate experience, or action (behaviour policy $\mu (a | s)$; policy controlling the agent and generating action).
- Evaluate target policy $\pi (a | s)$ to compute $v_\pi (s)$ or $q_\pi (s, a)$
- while following behaviour policy $\mu (a | s)$ that is independent from $\pi$,
There are several reasons for that:
(1) It allows the agent to learn from other sources such as humans or other agents.
(2) The target policy may be deterministic, while the behavior policy can continue to sample all possible actions.
(3) The agent can re-use experience from old policies $\pi_1, \cdots, \pi_{k - 1}$.
(4) The agent can learn about optimal policy while following exploratory policy.
(5) The agent can learn about multiple policies while following one policy.
Thus, the learning on such two different policies called off-policy learning as the agent learns about a policy only with experience off (not following) that policy. Note that the learning $\pi$ from episodes of $\mu$ may be possible when $\pi(a | s) > 0 \Longrightarrow \mu (a | s) > 0$, i.e., every action taken under $\pi$ is also taken under $\mu$.
Importance Sampling
We may change on-policy algorithms like MC & TD prediction to off-policy. One possible technique is called importance sampling. It is a general technique for estimating expectation under a distribution, with given samples from another distribution. Since we have different policies for generating samples, we should estimate the expectation of a different distribution.
\[\begin{aligned} \mathbb{E}_{X \sim P} [f (X)] & = \sum P(X) f(X) \\ & = \sum Q(X) \frac{P(X)}{Q(X)} f(X) \\ & = \mathbb{E}_{X \sim Q} [\frac{P(X)}{Q(X)} f (X)] \\ & \approx \frac{1}{N} \sum_{n=1}^N \frac{P(X^{(n)})}{Q(X^{(n)})} f(X^{(n)}) \text{ where } X^{(n)} \sim Q \end{aligned}\]Also, if $P(X)$ is too complex to sample, the distribution of $X$ can be transformed to other distribution $Q(X)$ via this technique, too. The ratio of two distribution $\frac{P(X)}{Q(X)}$ is usually called importance weight, or importance-sampling ratio.
Off-Policy Monte-Carlo
Note that the probability of the state-action trajectory $S_t, A_t, S_{t+1}, A_{t+1}, \cdots, S_T$ under policy $\pi$ is
\[\begin{aligned} \prod_{k=t}^{T-1} \pi (A_k | S_k) \cdot p(S_{k+1}, | S_k, A_k). \end{aligned}\]Thus, the importance weight for this trajectory is
\[\begin{aligned} \rho_t^T = \frac{\prod_{k=t}^{T-1} \pi (A_k | S_k) \cdot p(S_{k+1} | S_k, A_k)}{\prod_{k=t}^{T-1} \mu (A_k | S_k) \cdot p(S_{k+1} | S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi (A_k | S_k)}{\mu (A_k | S_k)} \end{aligned}\]We can see that the ratio is independent of environment, i.e., transition probability. Thus it doesn’t ruin the nature of off-policy that must be indepedent of model. In off-policy version of MC learning, returns generated from $\mu$ are used to valuate $\pi$. Hence, to transform return to follow our behavior policy $\mu$, we multiply importance sampling corrections during whole episode
\[\begin{aligned} G_t^{\pi / \mu} & = \frac{\pi (A_t | S_t) }{\mu (A_t | S_t)} \cdot \frac{\pi (A_{t+1} | S_{t+1})}{\mu (A_{t+1} | S_{t+1})} \cdots \frac{\pi (A_T | S_T)}{\mu (A_T | S_T)} G_t \\ & = \frac{P(\tau | S_t)}{Q(\tau | S_t)} G_t \end{aligned}\]Then we update value by modified return,
\[\begin{aligned} V(S_t) \longleftarrow V(S_t) + \alpha (G_t^{\pi / \mu} - V(S_t)) \end{aligned}\]and our estimate of $v_\pi (s)$ by Monte-Carlo method is as follows:
\[\begin{aligned} V(s)= \frac{\sum_{t \in \mathcal{T}(s)} \rho_t^{T(t)} G_t}{|\mathcal{T}(s)|} \end{aligned}\]where $\mathcal{T}(s)$ is the set of all time steps in which state $s$ is visited. But since importance sampling can increase variance, so it is not used very often.
Off-Policy TD with importance sampling
Also TD may utilized instead of MC. Like MC, off-policy version uses TD targets from the behaviour policy $\mu$.
Recall that the Bellman expectation equation for $v_\pi (s)$:
\[\begin{aligned} v_\pi (s) & = \sum_{a \in A} \pi (a | s) ( \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_\pi (s^\prime)) \\ & = \sum_{a \in A} \pi (a | s) ( \mathcal{R}_s^a + \gamma \mathbb{E}_{s^\prime \sim \mathcal{P}_{ss^\prime}^a} [ v_\pi (s^\prime) ]) \\ \end{aligned}\]But, it is not an off-policy, so we multiply the equation by importance weight
\[\begin{aligned} v_\pi (s) & = \sum_{a \in A} \mu (a | s) \frac{\pi (a | s)}{\mu (a | s)} ( \mathcal{R}_s^a + \gamma \mathbb{E}_{s^\prime \sim \mathcal{P}_{ss^\prime}^a} [ v_\pi (s^\prime) ]) \\ \end{aligned}\]Since we consider only one time step, only single importance sampling is enough. (Hence it has much lower variance than MC importance sampling.) Finally, we discard the need of transition matrix by approximating the expectation as empirical mean. Here is an iterative update formula:
\[\begin{aligned} V(S_t) & \longleftarrow V(S_t) \\ & + \alpha ( \frac{\pi (A_t | S_t)}{\mu (A_t | S_t)} (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)) \end{aligned}\]Q-Learning
Now, let’s consider off-policy learning of action-values $Q(s, a)$ without importance sampling, which is called Q-learning, one of the most important breakthroughs in RL.
Recall the Bellman optimality equation of $Q$, which is true target for control if we know transition matrix,
\[\begin{aligned} q_* (s, a) = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \underset{a^\prime \in \mathcal{A}}{\text{ max }} q_* (s^\prime, a^\prime) \end{aligned}\]Note that it doesn’t need a policy for calculating the equation and there is no expecation with respect to $\pi$, so we do not need to sample from that policy.
\[\begin{aligned} q_{k+1} (s, a) = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \underset{a^\prime \in \mathcal{A}}{\text{ max }} q_k (s^\prime, a^\prime) \end{aligned}\]Now, we apply MC estimation to the above. In other words, we approximate the expectation with respect to $\mathcal{P}_{ss^\prime}^a$ by sample mean with sampling:
\[\begin{aligned} q_{k+1} (s, a) = \mathcal{R}_s^a + \gamma \underset{a^\prime \in \mathcal{A}}{\text{ max }} q_k (s^\prime, a^\prime) \text{ where } s^\prime \sim \mathcal{P}_{ss^\prime}^a \end{aligned}\]So, we do not need importance sampling here! Finally, Q-Learning is an incremental learning using this Monte-Carlo estimated target:
\[\begin{aligned} q_{k+1} (s, a) = q_k (s, a) + \alpha ( R_{s}^a + \gamma \underset{a^\prime}{\text{ max }} q_k (s^\prime, a^\prime) - q_k (s, a)) \end{aligned}\]Hence, in this update rule, the learned $Q$ will directly estimates the optimal \(Q_*\) and Q-learning is like value iteration of $Q$ with one-step MC sampling for $s^\prime$. It updates the Q-value the most and sample the next action by behavior policy from it. In other words, Q-learning targets to learn values of the optimal greedy policy. Notice that the state-action pair to update Q-value is determined by a behavior policy $\mu$, so that the algorithm is off-policy.
And it is mathematically known that Q-learning control converges with probability 1 to the optimal action-value function if behavior policy must visit every state-action; infinitely many enough, so that all pairs continue to be updated.
The following figure and pseudocode summarizes Q-Learning:
$\mathbf{Fig\ 8.}$ The backup diagram for Q-Learning
$\mathbf{Fig\ 9.}$ The pseudocode for Q-Learning
$\mathbf{Remark.}$ If all policies are greedy in Q-learning and SARSA, are two algorithms equivalent?
(Reference: [5])
No. SARSA samples the next action to update before updating Q values, whereas Q-learning samples after updating. (Compare the pseudocodes of two algorithms)
For explicit example, consider the transition from $S$ to $S^\prime = S$.
-
SARSA
- Initialize $S_1 = S$, and choose \(A_1 = \text{ argmax}_A \; Q_1 (S_1 = S, A)\)
- Take action $A_1$ and obtain $R$ and $S_2 = S^\prime = S$
- Choose action \(A_2 = \text{ argmax}_A \; Q_1 (S_2 = S, A)\)
- $Q_2 (S_1, A_1) = Q_1 (S_1, A_1) + \alpha \left( R + \gamma Q_1 (S_2, A_2) - Q_1(S_1, A_1) \right)$
-
Q-Learning
- Initialize $S_1 = S$
- Choose \(A_1 = \text{ argmax}_A \; Q_1 (S_1 = S, A)\)
- Take action $A_1$ and obtain $R$ and $S_2 = S^\prime = S$
- $Q_2 (S_1, A_1) = Q_1 (S_1, A_1) + \alpha \left( R + \gamma \text{ max}_A \; Q_1 (S_2, A) - Q_1(S_1, A_1) \right)$
- Choose action \(A_2 = \text{ argmax}_A \; Q_2 (S_2 = S, A)\)
Note that $A_1$ is for 1st update, and $A_2$ is an action for 2nd update. And, it can be deducted that the fundamental reason for this difference is on-policy and off-policy.
$\mathbf{Remark.}$ What if we do 2-step Q-Learning?
Like $n$-step $\text{SARSA}$ or $\text{SARSA}(\lambda)$, can we do 2-step Q-Learning?
\[\begin{aligned} q_{k+1} (s, a) = q_k (s, a) + \alpha (\mathcal{R}_s^a + \gamma R_{s^\prime}^{a^\prime} + \gamma^2 \underset{a^{\prime \prime} \in \mathcal{A}}{\text{ max }} q_k (s^{\prime \prime}, a^{\prime \prime}) - q_k (s, a)) \end{aligned}\]Actually, the above target is involving the policy $\pi$. To get the second state and action, we must samples $a^\prime$ from $\pi$. More precisely, after eliminating MC approximations, the TD target can be written as:
\[\begin{aligned} q_{k+1} (s, a) & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \sum_{a^\prime \in \mathcal{A}} \boldsymbol{\pi (a^\prime | s^\prime)} \left( \mathcal{R}_{s^\prime}^{a^\prime} + \gamma \sum_{s^{\prime \prime} \in \mathcal{S}} \mathcal{P}_{s^{\prime} s^{\prime \prime}}^{a^\prime} \underset{a^{\prime \prime} \in \mathcal{A}}{\text{ max }} q_k (s^{\prime \prime}, a^{\prime \prime}) \right) \\ & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \sum_{a^\prime \in \mathcal{A}} \boldsymbol{\mu (a^\prime | s^\prime) \frac{\pi (a^\prime | s^\prime)}{\mu (a^\prime | s^\prime)}} \left( \mathcal{R}_{s^\prime}^{a^\prime} + \gamma \sum_{s^{\prime \prime} \in \mathcal{S}} \mathcal{P}_{s^{\prime} s^{\prime \prime}}^{a^\prime} \underset{a^{\prime \prime} \in \mathcal{A}}{\text{ max }} q_k (s^{\prime \prime}, a^{\prime \prime}) \right) \\ \end{aligned}\]Hence, we should correct the scale by multiplying importance weight to make valid objective.
SARSA v.s. Q-learning
So far we considered two version of TD control, one is SARSA (on-policy) and another is Q-learning (off-policy).
SARSA required to sample $A_{t+1} = a^\prime$ from $\pi$ in order to approximate Bellman expectation equation, and this will be next action for update since it is following $\pi$. However, although Q-Learning seems to be following the target policy $\pi$ (in Q-learning, $\pi$ is greedy policy), it is actually following $\mu$ because a state-action pair to update Q-value is determined by $\mu$. Hence, Q-Learning is off-policy whereas SARSA is on-policy. [4]
Although they seem very similar, but two algorithms are distinctly different. We also confirmed that even all policies in two algorithms are greedy, SARSA and Q-learning could not be the same. Because, if one compares the TD target of two algorithms, SARSA aims to learn the optimal $\varepsilon$-greedy policy, while Q-learning aims to learn the optimal greedy policy.
One of the well-known examples that highlight the effect of this slight difference between them is Cliff walking (referenced from Example 6.6 (page 106) of Reinforcement Learning: An Introduction by Sutton and Barto). It is very similar to frozenlake but with larger observation space.
In the environment, the agent reach the goal along the edge of a cliff; without falling off. There is a $-1$ reward for each work, and penalized if the agent falls off the cliff by $-100$ reward and sends the agent instantly back to the initial state.
By SARSA and Q-learning, the agent learns the values and find the path as follows:
$\mathbf{Fig\ 10.}$ The result of SARSA and Q-learning in cliff-walking task. $\varepsilon = 0.1$
Q-learning learns an optimal path and policy: travels right along the edge of the cliff. On the other hand, SARSA takes the action selection into account and learns the longer but much safer path than Q-learning: travels through the topmost grids. However, its performance is worse than that of Sarsa, which learns the roundabout policy.
But note that this result doesn’t mean that SARSA cannot converge to optimal policy. If $\varepsilon$ of $\varepsilon$-greedy policy is gradually reduced during iteration, both methods can asymptotically converge to the optimal policy.
https://ai.stackexchange.com/questions/34071/is-the-optimal-policy-the-one-with-the-highest-accumulative-reward-q-learning-v
Summary
In summary, we compare the difference algorithms of DP and TD learning:
$\mathbf{Fig\ 11.}$ Relationship Between DP and TD
The notation \(x \overset{\alpha}{\longleftarrow} y \equiv x \Leftrightarrow x + \alpha (y - x)\).
Reference
[1] Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto, The MIT Press.
[2] Introduction to Reinforcement Learning with David Silver
[3] A (Long) Peek into Reinforcement Learning
[4] What is the difference between off-policy and on-policy learning?
[5] Are Q-learning and SARSA the same when action selection is greedy?
[6] Stanford, CS234: Reinforcement Learning Winter 2022
Leave a comment