[RL] Markov Decision Process

17 minute read

Markov Process

$\color{blue}{\mathbf{Definition.}}$ Markov State
A state $S_t$ is Markov if and only if $$ \begin{aligned} P[S_{t+1} | S_t] = P[S_{t+1} | S_1, \cdots, S_t] \end{aligned} $$ Thus, for a Markov state $S_t = s$, and the next state $S_{t+1} = s^\prime$, the following probability can be defined: $$ \begin{aligned} \mathcal{P}_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] \end{aligned} $$ and it is called state transition probability. If the state space is finite, state transition probabilitis from all state $s$ to all successor states $s^\prime$ is represented with a matrix, $$ \mathcal{P}= \left[\begin{array}{ccc} \mathcal{P}_{11} & \ldots & \mathcal{P}_{1 n} \\ \vdots & & \\ \mathcal{P}_{n 1} & \ldots & \mathcal{P}_{n n} \end{array}\right] $$ where each row of the matrix sums to $1$. Such a matrix is called state transition matrix.

$\color{blue}{\mathbf{Definition.}}$ Markov Process
A Markov process, or Markov Chain is a memoryless random process which can be represented only with a tuple $\left\langle \mathcal{S}, \mathcal{P} \right\rangle$ where

$\mathcal{S}$ is a (finite) state space
$\mathcal{R}$ is a state transition probability matrix $$ \begin{aligned} \mathcal{P}_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] \end{aligned} $$

Markov Process in graph

Markov process is schematized in a graph with state (=node) and transition probability (=edge, arrow):

$\mathbf{Fig\ 1.}$ Graphical representation of Markov chain (source: [2])

In this case, the state transition matrix $\mathcal{P}$ is

Note that the above process is ended when it arrives the sleep state, and it is called terminal state. We usually denote it by square node instead of circle. There are two types of reinforcement learning tasks:

Episodic Tasks: the tasks that have a terminal state
Continuous Tasks: the tasks that have no ends (terminal state)

So, it is epsiodic tasks. Each of sequences of states starting from initial to end state is called an episode. For instance, Class 1-Class 2-Class 3-Pass-Sleep is one of examples.

Markov Reward Process

To utilize such Markov process for RL problem, there must be values or rewards for agent. We extend Markov process to Markov reward process by adding values to a tuple:

$\color{blue}{\mathbf{Definition.}}$ Markov Reward Process
A Markov Reward Process is a tuple $\left\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \right\rangle$ where

$\mathcal{S}$ is a (finite) state space
$\mathcal{R}$ is a state transition probability matrix $$ \begin{aligned} \mathcal{P}_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] \end{aligned} $$
$\mathcal{R}$ is a reward function $$ \begin{aligned} \mathcal{R}_{s} = \mathbb{E} [R_{t+1} | S_t = s] \end{aligned} $$
$\gamma \in [0, 1]$ is a discount factor.

Return and Value Function

Recall that the goal of RL agent is to maximize the expected cumulative reward. To compute cumulative reward, the future reward is weighted sum with discount factor $\gamma \in [0, 1]$ and it is called return.

$\color{blue}{\mathbf{Definition.}}$ Return and Value function
The return $G_t$ is the total discounted reward from timestep $t$: $$ \begin{aligned} G_t = R_{t + 1} + \gamma R_{t+2} + \cdots = \sum_{T = 0}^\infty \gamma^T R_{t + T + 1} \end{aligned} $$ And, the value function $v(s)$ is defined by the expectation of return at the current state $$ \begin{aligned} v(s) = \mathbb{E}[G_t \; | \; S_t = s] \end{aligned} $$

Why a discount factor is needed?

Most Markov reward and decision processes are discounted. There are several reasons:

Mathematically convenient tractable (avoid divergence of sum)
preference for immediate reward
etc.

Markov Reward Process in graph

In graphical way, $\mathbf{Fig\ 1}$ can be extended to

$\mathbf{Fig\ 2.}$ Markov reward process (source: [2])

And, starting from $S_1 = C_1$ with $\gamma = \frac{1}{2}$, returns can be computed by:

\[\begin{array}{l|lll} \text { C1 C2 C3 Pass Sleep } & v_1=-2-2 * \frac{1}{2}-2 * \frac{1}{4}+10 * \frac{1}{8} & = & -2.25 \\ \text { C1 FB FB C1 C2 Sleep } & v_1=-2-1 * \frac{1}{2}-1 * \frac{1}{4}-2 * \frac{1}{8}-2 * \frac{1}{16} & = & -3.125 \\ \text { C1 C2 C3 Pub C2 C3 Pass Sleep } & v_1=-2-2 * \frac{1}{2}-2 * \frac{1}{4}+1 * \frac{1}{8}-2 * \frac{1}{16} \cdots & = & -3.41 \\ \text { C1 FB FB C1 C2 C3 Pub C1 } \ldots & v_1=-2-1 * \frac{1}{2}-1 * \frac{1}{4}-2 * \frac{1}{8}-2 * \frac{1}{16} \cdots & = & -3.20 \end{array}\]

Bellman Equation

A fundamental property of value functions used throughout RL is, they satisfy particular recursive formulas. According to Bellman equation, the value function can be decomposed into 2 parts: intermediate reward $R_{t + 1}$ and discounted value of successor state $\gamma v (S_{t+1})$.

\[\begin{aligned} v(s) & =\mathbb{E}\left[G_t \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right) \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s\right] \\ & =\mathbb{E}\left[R_{t+1}+\gamma v\left(S_{t+1}\right) \mid S_t=s\right] \end{aligned}\]

And, the recursive relation allows us to visualize the value function, with backup diagram:

$\mathbf{Fig\ 3.}$ Backup diagram for Bellman equation in MRP

which summarizes the following equation:

\[\begin{aligned} v(s) = \mathcal{R}_s+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}} v\left(s^{\prime}\right) \text{ where } \mathcal{R}_s = \mathbb{E}[R_{t+1}] \end{aligned}\]

It is called Backup diagram, because it diagrams the update, or backup operations that transfer value information back to a state (or a state-action pair) from its successor states (or state-action pairs). (Note that time always flows downward in a backup diagram.) In other words, from the future state to the (current) state, we proceed backward and consider the values of states to compute the current value function.

Then now, through Bellman equation, we can calculate the value function more easily. Consider MRP in figure 2 again:

$\mathbf{Fig\ 4.}$ Value function values for MRP (source: [2])

Note that Bellman equation can be expressed concisely in matrix form,

\[\begin{aligned} v = \mathcal{R} + \gamma \mathcal{P} v \end{aligned}\]

where $v$ is a column vector with one entry per state:

\[\begin{aligned} \left[\begin{array}{c} v(1) \\ \vdots \\ v(n) \end{array}\right]=\left[\begin{array}{c} \mathcal{R}_1 \\ \vdots \\ \mathcal{R}_n \end{array}\right]+\gamma\left[\begin{array}{ccc} \mathcal{P}_{11} & \ldots & \mathcal{P}_{1 n} \\ \vdots & & \\ \mathcal{P}_{11} & \ldots & \mathcal{P}_{n n} \end{array}\right]\left[\begin{array}{c} v(1) \\ \vdots \\ v(n) \end{array}\right] \end{aligned}\]

And, numerically, using inverse technique it can be solved directly through $v = (I - \gamma \mathcal{P})^{-1} \mathcal{R}$. But, it is expensive: computational complexity is $O(n^3)$ for $n$ states. So, we will see many iterative methods for large MRPs later, for instance, Dynamic programming, Monte-Carlo evaluation and Temporal-Difference learning.

Markov Decision Process

Based on the environment of Markov reward process, the agent is able to decide the action. A Markov decision process (MDP) is a Markov reward process with decisions.

$\color{blue}{\mathbf{Definition.}}$ Markov Decision Process
A Markov Decision Process is a tuple $\left\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \right\rangle$ where

$\mathcal{S}$ is a (finite) state space
$\mathcal{A}$ is a (finite) action space
$\mathcal{R}$ is a state transition probability matrix $$ \begin{aligned} \mathcal{P}_{ss^\prime} = P[S_{t+1} = s^\prime | S_t = s] \end{aligned} $$
$\mathcal{R}$ is a reward function $$ \begin{aligned} \mathcal{R}_{s} = \mathbb{E} [R_{t+1} | S_t = s] \end{aligned} $$
$\gamma \in [0, 1]$ is a discount factor.

Assumption to finite state and action spaces mathematically simplify our concept in terms of sums and probabilities rather than integrals and probability densities. As a matter of course, the argument can easily be extended to continuous case.

Policy

Then, how does the agent decide to action in the environment? In reinforcement learning, a policy defines what actions to perform in a particular state.

$\color{blue}{\mathbf{Definition.}}$ Policy
A policy $\pi$ is a distribution over actions given states, $$ \begin{aligned} \pi(a | s) = P(A_t = a | S_t = s) \end{aligned} $$

A policy fully defines the behaviour of an agent, and policies of MDP depend only on the current state (not the history), i.e. Policies are stationary (time-independent), for all $t > 0$,

\[\begin{aligned} A_t \sim \pi (\cdot | S_t) \end{aligned}\]

So, given an MDP $\mathcal{M} = \left\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \right\rangle$ and a policy $\pi$, the whole things are Markov in MDP, i.e.,

$S_1, S_2, \cdots$ is a Markov process $\left\langle \mathcal{S}, \mathcal{P}^\pi \right\rangle$
$S_1, R_1, S_2, R_2, \cdots$ is a Markov reward process $\left\langle \mathcal{S}, \mathcal{P}^\pi, \mathcal{R}^\pi, \gamma \right\rangle$ where

\[\begin{aligned} \mathcal{P}_{s, s^\prime}^\pi & = \sum_{a \in \mathcal{A}} \pi (a | s) \mathcal{P}_{ss^\prime}^a \\ \mathcal{R}_{s}^\pi & = \sum_{a \in \mathcal{A}} \pi (a | s) \mathcal{R}_{s}^a \end{aligned}\]

State and Action-value functions

Since the successor state is determined by the current state and action in MDP, two value function can be defined:

$\color{blue}{\mathbf{Definition.}}$ State-value function and Action-value function
The state-value function $v_\pi (s)$ of an MDP is the expected return starting from state $s$, and then follwing policy $\pi$: $$ \begin{aligned} v_\pi (s) = \mathbb{E}_\pi [G_t \; | \; S_t = s] \end{aligned} $$ And the action-value function $q_\pi (s, a)$ is the value of taking action $a$ in state $s$; the expected return starting from state $s$, taking action $a$, and then follwing policy $\pi$: $$ \begin{aligned} q_\pi (s, a) = \mathbb{E}_\pi [G_t \; | \; S_t = s, A_t = a] \end{aligned} $$

Optimal Value function and Policy

$\color{blue}{\mathbf{Definition.}}$ Optimal value functions
The optimal state-value function $v_\star (s)$ is the maximum value function over all policies: $$ \begin{aligned} v_\star (s) = \underset{\pi \in \mathcal{\Pi}}{\text{max }} v_\pi (s) \end{aligned} $$ The optimal action-value function $q_\star (s, a)$ is the maximum action-value function over all policies: $$ \begin{aligned} q_\star (s, a) = \underset{\pi \in \mathcal{\Pi}}{\text{max }} q_\pi (s, a) \end{aligned} $$

Hence, the optimal value function specifies the best possible performance in the MDP. Then, optimal policy is naturally defined as a policy that maximizes value function. For optimal policy, define a partial ordering $\geq$ over policies:

\[\begin{aligned} \pi \geq \pi^\prime \quad \text{ if } \quad v_\pi (s) \geq v_{\pi^\prime} (s) \; \forall s \end{aligned}\]

One may wonder if the optimal policy exists for any cases. Mathematically, there is a theorem that ensures the existence and uniqueness of optimal value functions:

$\color{red}{\mathbf{Theorem.}}$ Existence of optimal policy
For any Markov Decision Process,

There exists an optimal policy $\pi_\star$ that is better than or equal to all other polices, i.e., $\pi_\star \geq \pi \; \text{ for all } \pi$
All optimal policies achieve the optimal value function, i.e., $v_{\pi_\star} (s) = v_\star (s)$
All optimal policies achieve the optimal action-value function, i.e., $q_{\pi_\star} (s, a) = q_\star (s, a)$

And, an optimal policy can be found by maximizing over $q_\star(s, a)$:

\[\begin{aligned} \pi_*(a \mid s)= \begin{cases}1 & \text { if } a=\underset{a \in \mathcal{A}}{\operatorname{arg max}} q_*(s, a) \\ 0 & \text { otherwise }\end{cases} \end{aligned}\]

Thus, there is always a deterministic optimal policy for any MDP.

Bellman Expectation Equations

Based on the definition of value functions, the state-value and action-value function can again be decomposed into immediate reward plus discounted value of successor state:

\[\begin{aligned} v_\pi (s) & = \mathbb{E}_\pi [R_{t+1} + \gamma v_\pi (S_{t+1}) \; | \; S_t = s] \\ q_\pi (s, a) & = \mathbb{E}_\pi [R_{t+1} + \gamma q_\pi (S_{t+1}, A_{t+1}) \; | \; S_t = s, A_t = a] \end{aligned}\]

Note that we assumed finite state and action space. Hence from the original definition, for $v_\pi$, we have

\[\begin{aligned} v_\pi (s) & = \mathbb{E}_\pi [G_t \; | \; S_t = s] \\ & = \int g \; P(G_t = g \; | \; S_t = s) \; dg \\ & = \int g \; \frac{P(G_t = g \cap S_t = s)}{P(S_t = s)} \; dg \\ & = \int \sum_{a \in \mathcal{A}} g \; \frac{P(G_t = g \cap S_t = s \cap A_t = a)}{P(S_t = s)} \; dg \\ & = \int \sum_{a \in \mathcal{A}} g \; \frac{P(G_t = g \cap S_t = s \cap A_t = a)}{P(S_t = s \cap A_t = a)} \; \frac{P(S_t = s \cap A_t = a)}{P(S_t = s)} \; dg \\ & = \sum_{a \in \mathcal{A}} \pi(a \; | \; s) \int g \; P(G_t = g \; | \; S_t = s, A_t = a) \; dg \\ & = \sum_{a \in \mathcal{A}} \pi(a \; | \; s) \; \mathbb{E}_\pi [G_t \; | \; S_t = s, A_t = a] = \sum_{a \in \mathcal{A}} \pi(a \; | \; s) \; q_\pi (s, a). \end{aligned}\]

by linearity of integral. Like Bellman equation in MRP, it can be represented as a backup diagram:

\[\begin{aligned} v_{\pi} (s) = \sum_{a \in \mathcal{A}} \pi(a \; | \; s) \; q_\pi (s, a) \end{aligned}\]

$\mathbf{Fig\ 5.}$ Backup diagram of Bellman expectation equation for $v_\pi$

In case of $q_\pi$,

\[\begin{aligned} q_\pi (s, a) & = \mathbb{E}_\pi [G_t \; | \; S_t = s, A_t = a] \\ & = \mathbb{E}_\pi [R_{t+1} \; | \; S_t = s, A_t = a] + \gamma \; \mathbb{E}_\pi [G_{t + 1} \; | \; S_t = s, A_t = a] \\ & = \mathcal{R}_s^a + \gamma \int g \; P(G_{t+1} = g \; | \; S_t = s, A_t = a) \; dg \\ & = \mathcal{R}_s^a + \gamma \int g \; \frac{P(G_{t+1} = g \cap S_t = s, A_t = a)}{P(S_t = s, A_t = a)} \; dg \\ & = \mathcal{R}_s^a + \gamma \int \sum_{s^\prime \in \mathcal{S}} g \; \frac{P(G_{t+1} = g \cap (S_t = s, A_t = a) \cap S_{t+1} = s^\prime)}{P(S_t = s, A_t = a)} \; dg \\ & = \mathcal{R}_s^a + \gamma \int \sum_{s^\prime \in \mathcal{S}} g \; \frac{P(G_{t+1} = g \cap (S_t = s, A_t = a, S_{t+1} = s^\prime))}{P(S_t = s, A_t = a, S_{t+1} = s^\prime)} \; P(S_{t+1} = s^\prime \; | \; S_t = s, A_t = a) \; dg \\ & = \mathcal{R}_s^a + \gamma \int \sum_{s^\prime \in \mathcal{S}} g \; \frac{P(G_{t+1} = g \cap (S_t = s, A_t = a, S_{t+1} = s^\prime))}{P(S_t = s, A_t = a, S_{t+1} = s^\prime)} \; P(S_{t+1} = s^\prime \; | \; S_t = s, A_t = a) \; dg \\ & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \int g \; P(G_{t+1} = g \; | \; S_t = s, A_t = a, S_{t+1} = s^\prime) \; P(S_{t+1} = s^\prime \; | \; S_t = s, A_t = a) \; dg \\ & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \; \mathbb{E}_\pi [G_{t+1} \; | \; S_{t+1} = s^\prime] = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \; v_\pi (s^\prime). \end{aligned}\]

where the last line is derived by assuming the finite state space and Markov property.

\[\begin{aligned} q_{\pi} (s, a) = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \; v_{\pi} (s^\prime) \end{aligned}\]

$\mathbf{Fig\ 6.}$ Backup diagram of Bellman expectation equation for $q_\pi$

Notice that it is not convenient to deal with these double reccurence relations. Hence, by substituting each other, we finally obtain

\[\begin{aligned} v_\pi(s) & =\sum_{a \in \mathcal{A}} \pi(a \mid s)\left(\mathcal{R}_s^a+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}}^a v_\pi\left(s^{\prime}\right)\right) \\ q_\pi(s, a) & =\mathcal{R}_s^a+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}}^a \sum_{a^{\prime} \in \mathcal{A}} \pi\left(a^{\prime} \mid s^{\prime}\right) q_\pi\left(s^{\prime}, a^{\prime}\right) \end{aligned}\]

$\mathbf{Fig\ 7.}$ Backup diagram of Bellman expectation equations

Bellman Optimality Equations

The goal of MDP is finding the optimal policy $\pi_\star$ that maximizes the optimal value functions $v_\star (s)$ and $q_\star (s, a)$.

Note that $\pi_\star (a | s) = \underset{a \in \mathcal{A}}{\text{ arg max }} q_{\pi_\star} (s, a)$. From $v_{\pi_\star} (s) = \sum_{a \in \mathcal{A}} \pi_\star (a | s) q_{\pi_\star} (s, a)$, it is obvious that $v_{\pi_\star} (s) = \underset{a \in \mathcal{A}}{\text{max }} q_\star (s, a)$.

Hence, the optimal value functions are recursively related by the Bellman optimality equations:

\[\begin{aligned} v_\star (s) & = \underset{a \in \mathcal{A}}{\text{max }} q_\star (s, a) \\ q_\star (s, a) & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_\star (s^\prime) \end{aligned}\]

$\mathbf{Fig\ 8.}$ Backup diagram of Bellman optimality equations

Again, by substituting each other, we obtain

\[\begin{aligned} v_\star (s) & = \underset{a \in \mathcal{A}}{\text{max }} [\mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_\star (s^\prime)] \\ q_\star (s, a) & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \underset{a^\prime \in \mathcal{A}}{\text{ max }} q_\star (s^\prime, a^\prime) \end{aligned}\]

$\mathbf{Fig\ 9.}$ Backup diagram of Bellman optimality equations

Bellman Optimality Equation is non-linear, so that no closed form solution in general. Instead, many iterative solution methods such as Value Iteration , Policy Iteration, Q-learning, and Sarsa are developed for solving optimality equations.

$\mathbf{Fig\ 10.}$ Summary of Bellman equations in MDP

$\mathbf{Note.}$ Extension to stochastic reward

The above discussion is based on [2], which is assuming deterministic reward. However, in general, reward is not always deterministic; it may be stochastic. Hence, [1] develops the similar discussion based on the stochastic reward.

Suppose given any state $s$ and action $a$, the probability of each possible pair of next state $s^\prime$ and reward $r^\prime$ is denoted

\[\begin{aligned} p(s^\prime, r | s, a) = P \{ S_{t+1} = s^\prime, R_{t+1} = r | S_t = s, A_t = a \}. \end{aligned}\]

which completely specify the dynamics of a finite MDP. Then, the expected rewards for state-action pairs

\[\begin{aligned} r(s, a) = \mathbb{E} [R_{t+1} | S_t = s, A_t = a] = \sum_{r \in \mathcal{R}} r \sum_{s^\prime \in \mathcal{S}} p(s^\prime, r | s, a) \end{aligned}\]

the state-transition probabilities

\[\begin{aligned} p(s^\prime | s, a) = P \{ S_{t+1} = s^\prime | S_t = s, A_t = a \} = \sum_{r \in \mathcal{R}} p (s^\prime, r | s, a) \end{aligned}\]

the expected rewards for state–action–next-state triples,

\[\begin{aligned} r(s, a, s^\prime) = \mathbb{E}[R_{t+1} | S_{t+1} = s^\prime] = \frac{\sum_{r \in \mathcal{R}} p(s^\prime, r | s, a)}{p(s^\prime | s, a)}. \end{aligned}\]

Then, Bellman expectation equation can be derived as follows:

\[\begin{aligned} v_\pi (s) & = \mathbb{E}_\pi [G_t | S_t = s] \\ & = \mathbb{E}_\pi [ \sum_{k=0}^\infty \gamma^k R_{t + 1 + k} | S_t = s] \\ & = \mathbb{E}_\pi [ R_{t+1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t + 2 + k} | S_t = s] \\ & = \mathbb{E}_\pi [R_{t+1} | S_t = s] + \mathbb{E}_\pi [ \gamma \sum_{k=0}^\infty \gamma^k R_{t + 2 + k} | S_t = s] \end{aligned}\]

For the first term,

\[\begin{aligned} \mathbb{E}_\pi [R_{t+1} | S_t = s] & = \sum_{r \in \mathcal{R}} r \cdot p (r | s) \\ & = \sum_{r \in \mathcal{R}} r \cdot \sum_{s^\prime \in \mathcal{S}} p (r, s^\prime | s) \\ & = \sum_{s^\prime \in \mathcal{S}} \sum_{r \in \mathcal{R}} r \cdot p (r, s^\prime | s) \\ & = \sum_{a \in \mathcal{A}} \sum_{s^\prime \in \mathcal{S}} \sum_{r \in \mathcal{R}} r \cdot p (r, s^\prime, a | s) \\ & = \sum_{a \in \mathcal{A}} \sum_{s^\prime \in \mathcal{S}} \sum_{r \in \mathcal{R}} r \cdot p (r, s^\prime | s, a) \cdot \pi(a | s) \\ & = \sum_{a \in \mathcal{A}} \pi(a | s) \sum_{s^\prime \in \mathcal{S}} \sum_{r \in \mathcal{R}} r \cdot p (r, s^\prime | s, a) \end{aligned}\]

Note that due to Markov property, $p(g | s^\prime, r, a, s) = p(g | s^\prime)$. Then the second term is

\[\begin{aligned} \mathbb{E}_\pi [ \gamma \sum_{k=0}^\infty \gamma^k R_{t + 2 + k} | S_t = s] & = \mathbb{E}_\pi [ \gamma G_{t + 1} | S_t = s] \\ & = \gamma \sum_{g \in \mathcal{G}} g \cdot p (g | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{a \in \mathcal{A}} \sum_{s^\prime \in \mathcal{S}} \sum_{g \in \mathcal{G}} g \cdot p (g | s^\prime, r, s, a) \cdot p(s^\prime, r | s, a) \cdot \pi (a | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{a \in \mathcal{A}} \sum_{s^\prime \in \mathcal{S}} \sum_{g \in \mathcal{G}} g \cdot p(g | s^\prime) \cdot p(s^\prime, r | s, a) \cdot \pi (a | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{a \in \mathcal{A}} \sum_{s^\prime \in \mathcal{S}} v_\pi (s^\prime) \cdot p(s^\prime, r | s, a) \cdot \pi (a | s) \\ \end{aligned}\]

Thus, $v_\pi (s) = \sum_{a \in \mathcal{A}} \pi(a | s) \sum_{s^\prime \in \mathcal{S}} \sum_{r \in \mathcal{R}} p (r, s^\prime | s, a) [r + \gamma v_\pi (s^\prime).$

Lastly, the Bellman optimality equations,

\[\begin{aligned} v_* (s) & = \underset{a \in \mathcal{A}(s)}{\text{ max }} q_{\pi_*} (s, a) \\ & = \underset{a}{\text{ max }} \mathbb{E}_{\pi_*} [G_t | S_t = s, A_t = a] \\ & = \underset{a}{\text{ max }} \mathbb{E}_{\pi_*} [\sum_{k=0}^\infty \gamma^k R_{t + k + 1} | S_t = s, A_t = a] \\ & = \underset{a}{\text{ max }} \mathbb{E}_{\pi_*} [R_{t + 1} + \gamma \sum_{k=0}^\infty \gamma^k R_{t + k + 2} | S_t = s, A_t = a] \\ & = \underset{a}{\text{ max }} \mathbb{E}_{\pi_*} [R_{t + 1} + \gamma v_* (S_{t+1}) | S_t = s, A_t = a] \\ &= \underset{a}{\text{ max }} \sum_{s^\prime \in \mathcal{S}} \sum_{r \in \mathcal{R}} p(s^\prime, r | s, a) \cdot [ r + \gamma v_* (s^\prime) ]. \end{aligned}\]

and

\[\begin{aligned} q_* (s, a) & = \mathbb{E} [R_{t+1} + \gamma \underset{a^\prime} {\text{ max }} q_* (S_{t+1}, a^\prime) | S_t = s, A_t = a] \\ & = \sum_{s^\prime, r} p(s^\prime, r | s, a) \cdot [r + \gamma \underset{a^\prime}{\text{ max }} q_* (s^\prime, a^\prime)] \end{aligned}\]

Reference

[1] Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto, The MIT Press.
[2] Introduction to Reinforcement Learning with David Silver
[3] Introduction to Reinforcement Learning by Maël Fabien
[4] A (Long) Peek into Reinforcement Learning

Share on

Twitter Facebook LinkedIn

Youngdo Lee