4 Basics of Reinforcement Learning (RL)
In this section, we outline the main ideas behind reinforcement learning and how they can be applied in the context of this thesis. The reader familiar with the material may skip this section.
4.1 A non mathematical, yet delicious example!
In Reinforcement Learning tasks, we are training an agent that interacts with its environment by taking decisions. In this example, we are the agent, and the environment is the kitchen. Suppose we want to cook a delicious meal. At any point in time, we are making decisions such as;
- Which ingredients we use. Do we use tofu or seitan? Do we add spice or more chili pepper? When do we incorporate the sauce?
- Which cookware we use? Cast iron, or non-stick pan?
- Whether to put the oven to \(200^\circ C\) or \(220^\circ C\).
- Or simply do nothing!
All of these decisions, which we will call actions from now on, are taken based on the current state of the environment, that is the cooking process. How we decide which action to take given a current state will be called the policy from now on.
After each action, the cooking process gets to a new state and we taste the meal. By tasting it, we get a reward that depend on how well we did. Maybe the food started to burn in which case we get a negative reward, or maybe we made the food tastier, in which case we get a positive reward. In this example, there is a starting state, where we decide to cook something, and a terminal state, in which we finished cooking and get to enjoy the meal.
But how do we learn how to cook, how do we know what action to take at a specific state? That is, how do we learn the policy? We learn it by getting feedback, which is defined by the reward we get after each action. Some of those rewards are immediate, for example, if we add some spices to our food and it tastes better. We want to have a policy that maximizes the total rewards we get over a entire cooking session. This also mean that we have to balance how we prefer the immediate rewards against the future rewards. For example, adding a spice may make the meal taste better in the short term, for which we get a reward, but it may clash later when we add other ingredients, leading to a worse meal and worse rewards down the line.
Each time we cook, we learn what works and what doesn’t, and remember that for any future time we cook. But, if we want to get better at cooking, we must not just repeat the actions that worked! We also have to take some risks, and explore the potential actions we can take at each state! On the other hand, we still need to rely and exploit what we know. There is a balance to find between exploitation and exploration so as to learn as fast as possible.
4.2 Another example: Leonardo the rabbit
The last example is an intuitive way of thinking of reinforcement learning as similar to the way we animals learn about the world and its processes. The ideas behind reinforcement learning borrow a lot from the fields of psychology and neuroscience[1], and modelling how we learn is a gargantuan task that is, for this very reason, outside of the scope of this thesis!
We turn our attention to a more modest example that is much easier to model, and is an example that one can find in a lot of reinforcement learning books[2], [3]. We consider the case of Leonardo the rabbit. Leonardo, the agent, is situated in the outside world, which is represented as a \(3\times 3\) grid (the environment). He wants to get to the carrot at the bottom right as fast as possible. To help Leonardo get to his meal, we will use reinforcement learning.
4.2.1 States
The first thing we do is give a number to each box in the grid, from \(1\) to \(9\). We call the set of all boxes number as the state set, which we denote by \(\mathcal{S}\). In this example, \(\mathcal{S} = \{1,2,\dots,9\}\) (see Figure fig-gridworld1). A state is defined as any element in the state set, which we denote by \(s\in\mathcal{S}\). The state is the box Leonardo is in.
4.2.2 Actions
Leonardo, in this grid, can move in any 4 directions, that is left, right, up or down. We call this the action set \(\mathcal{A}\), and in this example \(\mathcal{A} = \{\text{left,right,up,down}\}\). An action is defined as any element in the action set, which we denote by \(a\in\mathcal{A}\).
4.2.3 State transitions
At this point, we can introduce a time variable \(t\). The initial time is set to \(t=0\), and, after Leonardo takes an action \(t\) moves forward by \(1\), and he finds himself in a new state. This is what we call a state transition(see Figure fig-gridworld_transition).
We want to keep track of Leonardo positions and actions over time, which is why we denote the state Leonardo is in at time \(t\) by \(S_t\), and the action he takes by \(A_t\). In this example, there is the initial state \(S_0 = 1\).
Remark. \(S_t\) and \(A_t\) are random variables, which is we note in uppercase. Specific observations of \(S_t\) and \(A_t\) will be in lowercase, that is respectively \(s_t\) and \(a_t\).
4.2.4 Policy
Leonardo, as the agent, only has access to his current state \(S_t\). He has to take an action \(A_t\), but how does he know which action to take? To do that, he uses a policy, which we denote by a function \(\pi\). More formally, \(\pi\) is a function that defines the probability of taking the action \(A_t = a\) if the state is \(S_t = s\). We denote this by \(\pi(a|s)\).
Suppose for example that \(S_t = 3\). Leonardo has no idea of where the carrot is, but he knows that he can not go up nor to the right, so his policy is to go down, or right at random. Then:
- The probability to go right is \(\pi(\text{right}, 3) = 0.5\).
- The probability to go down is \(\pi(\text{down}, 3) = 0.5\).
More specifically, for any state \(s\), we define the conditional probability mass function \(\pi(a|s) = \Pr(A_t = a | S_t = s)\), where \(\Pr\) denote a probability. Hence, for any fixed state \(s\), \(\sum_{a\in\mathcal{A}} \pi(a|s) = 1\).
Remark. We will assume that Leonardo only cares about what his current state is to take an action, and not for how long he has been in the grid. This makes the policy independent of the time \(t\).
4.2.5 Rewards
While Leonardo only takes actions by looking at his current state, he still wants to get to the carrot as fast as possible. He knows his current state \(s_t\) and takes the action \(a_t\). Doing so, he ends up in the state \(s_{t+1}\) and he gets a reward.
- The red colored box are difficult to get in, so if he ends up on one of the red colored box, he gets a reward of \(-5\). This is for example the case in Figure fig-gridworld_transition.
- If he ends up on the carrot, he gets a reward of \(+5\).
- If he ends up in any other state, he gets a reward of \(-1\), as he does not want to lose time.
More formally, we denote the reward Leonardo gets after taking the action \(A_t\) from the state \(S_t\) by \(R_{t+1}\). The set of all possible rewards is denoted by \(\mathcal{R}\). Here \(\mathcal{R} = \{-1,5,-5\}\). \(R_t\) is again a random variable and we denote an observation of the reward at time \(t\) by \(r_t\).
4.2.6 State transitions and rewards probabilities
Suppose now that there is a teleporter in the 4th box. This teleporter is however unreliable. Half the time, it teleports whoever steps in the box to the \(9^{th}\) box, meaning Leonardo could potentially get directly to his prize! The other half of the time, however, it teleports the user to the \(7^{th}\) box.
Suppose now that Leonardo is at state \(s_t=1\), he takes the action \(a_t = \text{down}\) to the teleporter(see Figure fig-gridworld_teleporter). Then:
- The next state is \(s_{t+1} = 9\) with probability \(0.5\).
- The next state is \(s_{t+1} = 7\) with probability \(0.5\).
But now, the reward he gets is random too!
- If he end up in the \(9^{th}\) box, \(r_{t+1} = 5\).
- If the teleporter does not work and he ends up in the \(7^{th}\) box, \(r_{t+1} = -1\).
More specifically, this means that state transitions and rewards need to be modelled by a probability, more specifically, the probability of getting a reward \(r\in\mathcal{R}\), and that the next state is \(s'\in\mathcal{S}\) given that the agent takes the action \(a\in\mathcal{A}\) at the state \(s\in\mathcal{S}\). We formalize a state transition probability as the conditional probability defined in the sample space \(\mathcal{S}\times \mathcal{R}\)
\[ p(s',r|s,a) = \Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a). \]
4.3 Finite Markov decision process
We formalize the above example by defining a Markov decision process (MDP). This definition and the ones up until the end of this chapter are adapted from [3].
Definition 4.1 (Markov decision process). A finite Markov decision process (MDP) is defined as a discrete time process, where we have:
- A finite set of all states \(\mathcal{S}\).
- A finite set of all possible actions \(\mathcal{A}\).
- A reward set \(\mathcal{R}(s,a)\), which contains the potential rewards received after taking any action \(a\in\mathcal{A}\) from any state \(s\in\mathcal{S}\).
We use the notation \(S_t, A_t\) as the state and action of the process at time \(t\). The reward \(R_t\) is the reward received at time t. \(S_t,A_t\) and \(R_t\) are random variables.
A Markov decision process also has a model, which consists of the state and reward transition probabilities:
- The probability, given that the current state is \(s\), and that the action taken is \(a\), that the next state is \(s'\) and the next reward is \(r\). That is \(p(s',r|s,a) = \Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a)\).
Furthermore, a Markov decision process has a policy that governs, for any state \(s\in\mathcal{S}\), the probability of taking action \(a\in\mathcal{A}\), that probability is \(\pi(a|s) = \Pr(A_t = a|S_t = s)\). We assume that the policy is not dependent on time.
Finally, a Markov decision process has the Markov property, or lack of memory. The state transition and rewards probabilities are only dependent on the current state \(S_t\) and action \(A_t\), and not the states and actions that preceeded. Mathematically, \(\Pr(S_{t+1} = s', R_{t+1} = r|S_t, A_t, S_{t-1}, A_{t-1}, \dots , S_0, A_0) = \Pr(S_{t+1} = s', R_{t+1} = r | S_t, A_t)\).
An example of Markov decision process with two states can be seen in Figure fig-MDP_example.
Remark. The state space \(\mathcal{S}\) and the action space \(\mathcal{A}\) can be finite or not. We only consider the case of finite Markov decision process to make matters easier.
Remark. The model in a Markov decision process is often impossible to define in advance. This problem is remedied by using model free algorithms.
4.4 State Value and Bellman Equation
We have a Markov decision process, which serves as a nice mathematical formalization of an agent and its environment [2]. Now we want to train the agent to make the best possible decisions? Answering this question is the goal of the next sections.
We first define a trajectory. We denote by \(S_t\) the state of an agent at instant \(t\). Then, according to the policy, this agent takes the action \(A_t\). After taking this action, the agent is now at the state \(S_{t+1}\), and it gets the rewards \(R_{t+1}\). Then the agent takes action \(A_{t+1}\), and gets to a new state \(S_{t+2}\) with reward \(R_{t+2}\). This continues indefinitely. We define the trajectory of an agent with starting state \(S_t = s_t\) as the chain of states, actions and rewards from time \(t\) onward:
\[ S_t = s_t,A_t \to R_{t+1},S_{t+1},A_{t+1} \to R_{t+2},S_{t+2},A_{t+2} \to \cdots, \]
Note that, due to the Markov property and the fact that we assume the policy is time independent, the starting value of \(t\) is not important.
Remark. In some environments, it is natural for the agent to have a task that has a starting state and a finishing states (for example, beginning a cooking session and finishing it, or starting a game and winning/losing at it.) We call these tasks episodic tasks and in these cases, a finite trajectory \(S_0,A_0 \to \dots \to S_T\) is also called an episode. In the cases where the task is such that no such state can be defined, a trajectory is not finite and we call these tasks continuing tasks, which will be the case in this thesis.
In reinforcement learning setting, we assume that we have no control of the environment model (for example, one can not change the rules of a game), but that we have control over the agent decisions (i.e the policy) and how we reward that agent. The goal of any reinforcement learning algorithm is thus to define the rewards properly and then to find a policy that maximizes the rewards the agent gets. We now define the discounted return along a trajectory,
Definition 4.2 Let \(t = 0, 1, \dots\). The (discounted) return along the trajectory \(S_t,A_t \to S_{t+1},A_{t+1}, R_{t+1} \to S_{t+2},A_{t+2}, R_{t+2} \to \dots\) is the random variable given by
\[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^{+\infty}\gamma^k R_{t+1+k}, \]
where \(\gamma \in [0,1)\) is called the discount rate.
Remark. By setting a discount rate that is less than 1 in continuing tasks, we make sure that the discounted return is well defined in the case of bounded rewards. Indeed, if, for any \(t\), \(|R_t|\leq M\), then \(\sum_{k=0}^{+\infty}|\gamma^k R_{t+1+k}| \leq \sum_{k=0}^{+\infty}\gamma^k M = \frac{M}{1-\gamma}\), so the series is absolutely convergent.
The discounted return is thus the sum of rewards along a trajectory, with a penalty for rewards far in the future, controlled by the discount rate. The discount rate is chosen depending on whether we want the agent to favor short term rewards, in which case a discount rate closer to \(0\) can be chosen, or long term rewards, with a discount rate closer to \(1\).
Since the discounted return is a random variable, we can look at its expectation, in particular, we are interested in its conditional expectation, given a starting state \(S_t = s\). This expectation is called the state value [2].
Definition 4.3 State value The state value of a state \(s\) is the function, defined for any \(s\in\mathcal{S}\) as the conditional expectation of the discounted return, given \(S_t = s\),
\[ v_\pi(s) = E[G_t|S_t = s] = E[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s], \]
where \(\pi\) is a given policy.
Remark. Once again, the Markov property and the time independence of the policy mean that the state value does not depend on time.
We remark that
\[\begin{align} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \nonumber\\ &=R_{t+1} + \gamma \left(R_{t+2} + \gamma R_{t+3}+ \dots \right) \nonumber\\ &=R_{t+1} + \gamma G_{t+1}. \end{align}\]
This expression of the return can be used in conjunction with the definition of the state value above to get
\[ v_\pi(s) = E[G_t|S_t = s]= E[R_{t+1}| S_t = s] + \gamma E[G_{t+1} | S_t = s]. \tag{4.1}\]
The first term is the expectation of immediate reward, following a certain policy \(\pi\), the second is the expectation of future rewards. Let us expand on that formula a bit more. We now make use of the “law of total expectation”:
Theorem 4.1 Let \(X\) and \(Y\) be random variables, and suppose \(E[|Y|]<\infty\). Then \[ E[Y] = E\left[E[Y|X]\right] \]
Using this, the expectation of immediate reward is
\[ E[R_{t+1}| S_t = s] = E\big[E[R_{t+1}|S_t = s,A_t]\big] = \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}} \sum_{s'\in \mathcal{S}}rp(s',r|s,a). \]
We now develop the second part in the RHS of Equation eq-state_value_part1, and use the law of total expectation again to get
\[ E[G_{t+1} | S_t = s] = E\big[E[G_{t+1} | S_t = s , S_{t+1}]\big] = \sum_{s'\in\mathcal{S}}p(s'|s)E[G_{t+1}|S_t = s, S_{t+1} = s'], \]
where \(p(s'|s) = \sum_{a\in\mathcal{A}} \sum_{r\in\mathcal{R}}p(s',r|s,a)\pi(a|s)\) is the probability of the next state being \(s'\) if the current state is \(s\). Because of the Markov property of the MDP, we can remove the conditioning \(S_t = s\) and thus, \(E[G_{t+1}|S_t = s, S_{t+1} = s'] = E[G_{t+1}|S_{t+1} = s] = v_\pi(s')\). Then \[ E[G_{t+1} | S_t = s] = \sum_{s'\in\mathcal{S}}\sum_{a\in\mathcal{A}}\sum_{r\in\mathcal{R}}v_\pi(s')\pi(a|s)p(s',r|s,a). \tag{4.2}\]
Putting Equation eq-state_value_part1 and Equation eq-state_value_part2 together, we get Bellman’s equation:
\[ v_\pi(s) = \sum_{a\in\mathcal{A}}\sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}\pi(a|s)p(s',r|s,a)\left[ r + \gamma v_\pi(s')\right]. \tag{4.3}\]
Remark. The Bellman equation depends on the given policy and gives a recursive relation for the state values. Solving this equation is called policy evaluation which involves fixed point iterations (see example below).
Example 4.1 We can directly derive the state values in the MDP in Figure fig-MDP_example. We remark that in this example, given a specific state transition, the reward we get is deterministic, which simplifies the computations.
In particular, for the state \(s_2\), There are two possible actions \(a_0\) and \(a_1\) we can take. The policy is to take action \(a_0\) with a probability \(0.6\), and action \(a_1\) with a probability \(0.4\). When we take for example action \(a_0\), the probability of the next state being \(s_1\) is \(0.3\), in which case the reward is \(5\). Proceeding similarly for all the possible actions and rewards, we get
\[\begin{align*} v_\pi(s_2) &= \sum_{a=0}^1\sum_{r\in\mathcal{R}}\sum_{s'=1}^2\pi(a|s)p(s',r|s_2,a)\left[ r + \gamma v_\pi(s')\right]\\ &= 0.6 \sum_{r\in\mathcal{R}}\sum_{s'=1}^2 p(s',r|s_2,a=0)\left[ r + \gamma v_\pi(s')\right] + 0.4 \sum_{r\in\mathcal{R}}\sum_{s'=1}^2 p(s',r|s_2,a=1)\left[ r + \gamma v_\pi(s')\right]\\ &= 0.6 \left[0.3(5 + \gamma v_\pi(s_1)) + 0.7(-2 + \gamma v_\pi(s_2))\right] + 0.4\left[ 0.2(-2 + \gamma v_\pi(s_2)) + 0.8(5+\gamma v_\pi(s_1) \right]. \end{align*}\]
After some computations, we end up with
\[ v_\pi(s_2) = 1.5 + \gamma (0.5,0.5)\begin{pmatrix} v_\pi(s_1) \\ v_\pi(s_2) \end{pmatrix}. \]
Similarly \(v_\pi(s_1) = 4.1 + \gamma(0.9,0.1)(v_\pi(s_1),v_\pi(s_2))^\intercal\). This leads to the system:
\[ \begin{pmatrix} v_\pi(s_1)\\ v_\pi(s_2) \end{pmatrix} = \begin{pmatrix} 4.1\\ 1.5 \end{pmatrix} + \gamma \begin{pmatrix} 0.9 & 0.1\\ 0.5 & 0.5 \end{pmatrix}\begin{pmatrix} v_\pi(s_1)\\ v_\pi(s_2) \end{pmatrix}. \]
We stop here to remark that this equation is of the form \(v_\pi = r_\pi + \gamma \mathbfit{P}_\pi v_\pi\). \(\mathbfit{P}_\pi\) can be related to a state transition matrix in a markov chain and is row stochastic. Furthermore, since \(\gamma<1\), we motivate solving the equation by using fixed point iterations. This is the main idea behind dynamic programming [4]. In this case, we can simply solve the system directly. For example, with \(\gamma=0.5\), we get the state values \(v_\pi(s_1) = 7.875\), \(v_\pi(s_2) = 4.625\).
4.5 Action Value
The state value gives information about a specific state, however, we are also often interested in knowing how much we stand to gain by taking a particular action at a particular state. This lead to the definition of the action value.
Definition 4.4 Action value The action value is defined as the expectation of discounted return \(G_t\), given a specific action \(a\), taken at the current state \(s\):
\[ q_\pi(a|s) = E\left[G_t|A_t=a,S_t=s\right] = E\left[\sum_|A_t=a,S_t=s\right], \]
where \(G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \dots\).
We also have, from Definition def-state_value, and the law of total expectation,
\[ v_\pi(s) = E[G_t|S_t = s] = E\left[E[G_t|S_t = s, A_t = a]\right]. \]
Then, \[ v_\pi(s) = \sum_{a\in\mathcal{A}}\pi(a|s)E\left[G_t|S_t=s,A_t =a\right], \]
and we can get the relation between state value and action value:
\[ v_\pi(s) = \sum_{a\in\mathcal{A}} \pi(a|s)q_\pi(a|s). \tag{4.4}\]
We remark that by viewing \(\pi(a|s)\) as a probability mass function, we can express the state values as another expectation:
\[ v_\pi(s) = E[q_\pi(a|s)], \]
where \(A\) is random variable with p.m.f \(\pi(a|s)\). Actions values are important in the sense that they tell us of the “value of taking an action over another”, and they appear naturally in almost all reinforcement learning algorithms. One important thing to note is that, by “comparing” Equation eq-state_value_action_relation and Equation eq-Bellman, we get an equivalent definitions of the action values as
\[ q_\pi(a|s) = \sum_{r\in\mathcal{R}}\sum_{s'\in\mathcal{S}}p(s',r|s,a)\left[ r + \gamma v_\pi(s')\right]. \tag{4.5}\]
Equation eq-action_value_to_state_value means that if we have access to the state values, we can compute the action values, while Equation eq-state_value_action_relation works in the opposite way, deriving state values from the action values.
Remark. A more rigorous approach to derive Equation eq-action_value_to_state_value would be similar to how we derive Bellman’s equation.
4.6 Optimal policy and value iteration
Now that we have defined the state values, we want to find a policy that maximizes them, that is, find a policy which we denote by \(\pi^*(a,s)\) such that, for any state \(s\) and for any policy \(\pi(a|s)\), \(v_{\pi^*}(s)\geq v_\pi(s)\). It turns out that not only this optimal policy exist, but that we can find it by repeating the following steps, starting from any policy \(\pi_0\):
- Test the current policy, that is evaluate the state values.
- From these state values, compute the action values.
- Using these action values, set a new and better policy that aim to choose the best actions.
More specifically, we present the pseudo code for the value iteration algorithm.
INPUT:
- An initial policy \(\pi_0\).
- Discount rate \(\gamma\).
- A stopping criterion.
OUTPUT: An approximation of the optimal policy \(\pi^*\), at an arbitrary precision;
i <- 0;
DO:
Compute the state values \(v_{\pi_i}(s)\), using fixed point iterations;
FOR all state \(s\):
Compute, for all \(a\in\mathcal{A}\),the action values \(q_{\pi_i}(a,s)\) using Equation eq-action_value_to_state_value and the computed state values;
Denote by \(a^*\) the action with the best action value \(q_{\pi_i}(a^*,s)\);
Set the new policy \(\pi_{i+1}(a^*,s) = 1\), and set for all the other actions \(\pi_{i+1}(a,s) = 0\);
END FOR
i <- i + 1
UNTIL Stopping criterion is met.
This algorithm is important in the sense that we can prove that it converges to an optimal policy, that maximizes all state values! Unfortunately, this algorithm scales poorly. In Example exm-the_example, we found that computing the state values is equivalent to solving a \(2\times 2\) linear system. In the general case, this system as the same dimensions as the number of states. Depending on the problem, solving this linear system can become prohibitively expensive (for example, there are several orders of magnitude more legal board states in a game of go than atoms in the observable universe [5]), and yet we can design a program that can beat the best human players handily [6]! Nevertheless, the main idea of starting with an initial policy, then getting a better and better policy over time is a fundamental idea in reinforcement learning.