5 Policy Gradient Method

Now that we have access to the main definitions used in RL, we can study the problem we had at the end of chapter 3 through the lens of RL.

The last chapter has been quite long as we introduced reinforcement learning, so as a reminder, we summarize the work we’ve done so far. We have a test problem, the convection diffusion equation Equation eq-convec_diff_steady, which we discretize, and with the following problem parameters:

A parameter \(b \in [0,1]\) in the steady-state convection diffusion equation , and
a discretization parameter \(n\in\mathbb{N}\) defining the number of interior points in the linear grid used to discretize this equation.

We end up with a linear system of the form \(\mathbfit{Mu} = \mathbfit{e}\) (Equation eq-test_problem_scaled_system). To solve this system, we solve the ODE \(\mathbfit{u}'(t) = \mathbfit{e} - \mathbfit{Mu}(t)\) using an explicit Runge-Kutta method with two solver parameters (see sec-RK_solver_test_problem):

\(\Delta t\), the (pseudo) time step, and
\(\alpha\), a parameter specific to the Runge Kutta method used.

We relate this solver to a stationary iterative method of the form \(\mathbfit{u}_{n+1} = \mathbfit{Ku}_n + \mathbfit{Le}\), where \(\mathbfit{K} = \mathbfit{I}-\mathbfit{LM}\). This method is convergent if and only if the spectral radius of \(\mathbfit{K}\) is strictly less than one. We could compute this spectral radius, but this is an computationally intensive task, so we use an approximation. This approximation is the residual ratio after 10 iterations of the Runge-Kutta solver, starting with \(\mathbfit{u}_0 = \mathbfit{e}\).

We define this ratio as \(\rho_{10, b, n}(\Delta t, \alpha)\), a function parametrized by \(b\) and \(n\), with arguments \(\Delta t\) and \(\alpha\). We are faced with the following optimization problem:

For any problem parameters \(b\), \(n\), find the optimal solver parameters \[ (\Delta t^*, \alpha^*) = \arg \min_{\Delta t, \alpha} \rho_{10, b , n}(\Delta t, \alpha). \tag{5.1}\]

Remark. We’ve already seen in sec-small_experiment that the optimal parameters can lead to divergence of the solver once more iterations are computed, which is problematic, so we are perfectly happy to find “good enough, but not optimal” solver parameters where this issue will not happen, hopefully. This issue can be mitigated by computing the residual ratio after more iterations, at the cost of it being more computationally expensive.

5.1 Modelling the problem as a reinforcement learning problem

We are interested in using reinforcement learning to solve the above problem. The last chapter provided an overview of the elements of reinforcement learning, and we can now translate our problem in a RL setting.

Modelling the states

We start by modelling the states. The most natural way of defining the states is to use the problem parameters \(b\) and \(n\). We thus define a specific state as a pair of problem parameters \(s = (b,n) \in [0,1]\times \mathbb{N^*}\).

Modelling the actions and the policy

Once we know a specific state, that is the problem parameters, we need to choose the two solver parameters \(\Delta t\) and \(\alpha\). A specific action is then a pair \(a = (\Delta t, \alpha) \in \mathbb{R^+}\times [0,1]\). The policy is then denoted by \(\pi(a = (\Delta t, \alpha) |s = (b,n))\). We will discuss the policy more in depth in the next chapter.

Modelling the rewards

Once a state-action pair is chosen, the residual ratio \(\rho_{10,b,n}(\Delta t, \alpha)\) is computed. The reward can then be defined as a function of the computed residual ratio,

\[ r = 1 - \rho_{10,b,n}(\Delta t, \alpha). \]

This reward is positive when the residual ratio is less than one, and negative otherwise. This mean that a reinforcement learning agent, which seek to maximize the reward it gets, will aim to minimize the residual ratio.

State transitions

In the definition of a Markov decision process (Definition def-Markov_decision_process), we also have a probabilistic model of the state and rewards transition \(p(s',r|s,a)\). Right away, we can see that this model is difficult to define, as we can not now, for a specific state and action, what reward we will get.

On the other hand, we can still control the state transitions. In this regard, we choose a new state, at random, after an action and reward is computed. More precisely, we choose a new parameter \(b\), uniformly between 0 and 1, and a new parameter \(n\), between \(5\) and \(200\), following a discrete uniform distribution as well. These values for \(n\) are arbitrary, with a maximum of \(200\) to spare us of long computational time when computing the rewards. We also cap the minimum value of \(n\) to an arbitrary minimum of \(5\) as those values are simply too low to get a acceptable discretization error, and we do not want to train an agent to solve for these states.

Other challenges

There are still several challenges that need to be addressed:

In our problem, the State-Action space is continuous. We previously assumed finite spaces.
In the definition of a MDP, we have a model: if we know a specific state and action, we have a probabilistic model of the reward the agent get and the next state of the environment. In our case, we know the model the state transitions, but we have no way of knowing the rewards.

5.2 Dealing with a large state-action space.

In the last chapter, we made the assumption that every space, be it state, action, or reward is finite. This assumption, while practical to derive theoretical results from, is in practice not always followed, as some states may be continuously defined for example.

We take our problem as formulated before. The state is defined as the problem parameters, that is \(b\in[0,1]\) and \(n = 1 , 2, \dots\). Without any adjustment, the state space is of the form \([0,1] \times \mathbb{N}\), and is not finite.

Similarly, the policy is defined by choosing the values \((\alpha,\Delta t) \in [0,1]\times \mathbb{R}^+\), depending on the state. Once again, the action space is continuous.

One approach would be to discretize the entire state \(\times\) action space, and then to apply classical dynamic programming algorithms to get some results. Then, after an optimal policy is found, do some form of interpolation for problem parameters outside of the discretized space. This approach has its own merit, as there are 3 dimensions that need to be discretized, and \(n\) can be chosen within a finite range.

Another approach is to use an approximation function. One way to do that is to approximate the value function \(v(s)\) by some parametrization \(v(s) \approx \hat{v}(s,\omega)\) where \(\omega \in \mathbb{R}^d\) are \(d\) parameters. Such methods are called value based. The method we use in this thesis, on the other hand, use an approximation of the policy function defined as \(\pi(a|s,\mathbfit{\theta})\), where \(\mathbfit{\theta}\in \mathbb{R}^d\) is a parameter vector. Such methods are called policy based. The reason to chose from this class of algorithm is two-fold.

When thinking about the test problem, a straightforward approach is to choose the solver parameters as a linear function of the problem parameters. A policy based approach allow us to do exactly this.
By doing so, we automatically take care of the need to interpolate between discrete states and action, which would be another headache we would have to deal with.

Remark. Approximation is usually done using neural networks. In the case of this thesis, a linear approximation is used.

5.3 Model-based, model-free

One problem we are faced with is the issue of the model of the state and rewards transition, that is

\[ p(s',r|s,a) = \Pr(S_{t+1} = s', R_{t+1} = a | S_t = s, A_t = a). \]

Thankfully, we are dealing with random variables, and with random variables, Monte-Carlo methods follow.

In particular, we often only need to compute the expectations of functions of random variables. This can be done is the following way. Let \(X\) be a random variable and \(x_1, x_2, \dots , x_n\) be independent samples of \(X\). Then, we can estimate \(E[X]\) as the empirical mean of our samples, that is:

\[ \hat{X}_n = \frac{x_1 + \dots + x_n}{n}. \]

5.4 Policy gradient methods

5.4.1 Objective function

Let \(\mathbfit{\theta} \in \mathbb{R}^d\) be a parameter vector and \(\pi(a|s,\mathbfit{\theta}) = p(A_t = a | S_t = s , \mathbfit{\theta})\) an approximate policy that is derivable w.r.t \(\theta\). We want to define an objective function \(J(\mathbfit{\theta})\) that we want to maximize in order to find the best value of \(\mathbfit{\theta}\).

To this end, we make the following assumption, the first one being for simplicity, and the second one being specific to the problem we modelled in the former sections in this chapter.

The states and action set are finite.
The states are uniformly distributed, and so are the state transitions. That is, for any \(s,s'\in\mathcal{S}, \Pr(S_t = s) = 1 / |\mathcal{S}| = \Pr(S_t = s | S_{t-1} = s')\), where \(|\mathcal{S}|\) is the number of element of \(S\). This correspond to the idea of taking a new state at random in our problem.

We define the objective function

\[ J(\mathbfit{\theta}) = \overline{v_\pi(S)} = \frac{1}{|\mathcal{S}|}\sum_{s\in S}v_\pi(s). \tag{5.2}\]

That is, \(J(\mathbfit{\theta})\) is the average, (non weighted, as per assumption) state value.

We want to maximize this objective function by changing the policy parameter \(\theta\). To this end, we use a gradient ascent algorithm of the form

\[ \mathbfit{\theta}_{t+1} = {\mathbfit{\theta}}_t + \alpha \nabla_{\mathbfit{\theta}} J(\mathbfit{\theta}), \tag{5.3}\]

where \(\nabla_{\mathbfit{\theta}}\) represents the gradient operator, w.r.t \(\mathbfit{\theta}\). This gradient is

\[ \nabla_{\mathbfit{\theta}} J(\mathbfit{\theta}) = \frac{1}{|\mathcal{S}|}\sum_{s\in S}\nabla_{\mathbfit{\theta}}v_\pi(s). \tag{5.4}\]

We are faced with the immediate issue that the algorithm requires knowing this gradient.

5.4.2 Policy gradient theorem

We prove, using the aforementioned assumptions a specific case of the policy gradient theorem. This proof is adapted from [1],chap 13.2. We also remind the reader that both the state values and the action values depend on the policy \(\pi\) and thus depend on \(\mathbfit{\theta}\).

From the last chapter, we have the expression of the state values \[ v_\pi(s) = \sum_{a\in\mathcal{A}}\pi(a|s, \mathbfit{\theta})q_\pi(a,s). \]

We take the gradient of \(v_\pi(s)\) w.r.t \(\theta\) to get \[ \nabla_{\mathbfit{\theta}} v_\pi(s) = \sum_{a\in\mathcal{A}}\nabla_{\mathbfit{\theta}}\pi(a|s, {\mathbfit{\theta}})q_\pi(a,s) + \pi(a|s,{\mathbfit{\theta}})\nabla_{\mathbfit{\theta}} q_\pi(a,s). \tag{5.5}\]

We now turn our attention to the \(\nabla q_\pi(a,s)\) term above. We use the expression of the actions value in Equation eq-action_value_to_state_value,

\[ \nabla_{\mathbfit{\theta}} q_\pi(a,s) = \nabla_{\mathbfit{\theta}} \left[ \sum_{r}\sum_{s'} p(s',r|s,a)(r + \gamma v_\pi(s'))\right]. \]

Both \(p(s',r|s,a)\) and the reward \(r\) do not depend on the policy, and therefore not on \(\mathbfit{\theta}\). The gradient is thus

\[ \nabla_{\mathbfit{\theta}} q_\pi(a,s) = \gamma\sum_{s'}\left[\sum_{r}p(s',r|a,s)\right]\nabla_{\mathbfit{\theta}} v_\pi(s'). \]

By the assumption of the state transition probabilities and the law of total probabilities \(\sum_{r}p(s',r|a,s)= 1/|\mathcal{S}|\), and thus

\[ \nabla q_\pi(a,s) = \gamma\sum_{s'}\frac{1}{|\mathcal{S}|}\nabla v_\pi(s'). \]

We recognize the expression of the objective function’s gradient \(\nabla_{\mathbfit{\theta}} J({\mathbfit{\theta}})\) to get \(\nabla_{\mathbfit{\theta}} q_\pi(a,s) = \gamma \nabla_{\mathbfit{\theta}} J({\mathbfit{\theta}})\). We insert this in Equation eq-gradient_wrt_theta and we get

\[ \nabla_{\mathbfit{\theta}} v_\pi(s) = \sum_{a\in\mathcal{A}}\nabla_{\mathbfit{\theta}}\pi(a|s, {\mathbfit{\theta}})q_\pi(a,s) + \gamma\pi(a|s, {\mathbfit{\theta}})\nabla_{\mathbfit{\theta}} J({\mathbfit{\theta}}). \]

Since the policy \(\pi(a|s)\) is a probability over the action space, it sums to \(1\) and we can get the second part of the RHS out of the sum

\[ \nabla_{\mathbfit{\theta}} v_\pi(s) = \gamma J({\mathbfit{\theta}}) + \sum_{a\in\mathcal{A}}\nabla_{\mathbfit{\theta}}\pi(a|s,{\mathbfit{\theta}})q_\pi(a,s). \]

Using \(\nabla J({\mathbfit{\theta}}) = \frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}}\nabla_{\mathbfit{\theta}} v_\pi(s)\), we get

\[\begin{align} \nabla_{\mathbfit{\theta}} J({\mathbfit{\theta}}) &= \frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}}\left[\gamma J({\mathbfit{\theta}}) + \sum_{a\in\mathcal{A}}\nabla_{\mathbfit{\theta}}\pi(a|s, {\mathbfit{\theta}})q_\pi(a,s)\right],\\ &= \gamma \nabla_{\mathbfit{\theta}} J({\mathbfit{\theta}}) + \sum_{s\in\mathcal{S}}\frac{1}{|\mathcal{S}|}\sum_{a\in\mathcal{A}}\nabla_{\mathbfit{\theta}}\pi(a|s, {\mathbfit{\theta}})q_\pi(a,s). \end{align}\]

And after a small rearrangement of the terms,

\[ \nabla_{\mathbfit{\theta}} J(\mathbfit{\theta}) = \frac{1}{1-\gamma}\sum_{s\in \mathcal{S}}\frac{1}{|\mathcal{S}|}\sum_{a\in\mathcal{A}}\nabla\pi(a|s,{\mathbfit{\theta}})q_\pi(a,s,{\mathbfit{\theta}}). \]

This is a special case of the policy gradient theorem. The reason to put the fraction \(1/|\mathcal{S}|\) inside the first sum is to get a parallel with the more general expression, where in general, we have a weighted sum with different weight depending on the state. Depending on the objective function used, this can be for example the stationary distribution of the states for a given policy.

We state the policy gradient theorem in a more general form.

Theorem 5.1 Policy gradient theorem (For continuing cases, with discount factor \(\gamma < 1\).)

Let \(\pi(a|s,\mathbfit{\theta})\) be a stochastic policy that is derivable w.r.t \(\mathbfit{\theta}\).

Let \(\mu(s)\) be the probability mass function of the stationary distribution of the states, following the policy \(\pi\).

Define the objective function \(J(\mathbfit{\theta}) = \overline{v_\pi(S)} = \sum_{s\in \mathcal{S}}\mu(s)v_\pi(s)\). The gradient of \(J\) w.r.t \(\mathbfit{\theta}\) is then proportional to the weighted sum

\[ \nabla_{\mathbfit{\theta}} J({\mathbfit{\theta}}) \propto \sum_s \mu(s) \sum_a q_\pi(a,s, {\mathbfit{\theta}})\nabla \pi(a|s, {\mathbfit{\theta}}). \]

Remark. With our assumptions \(\mu(s) = \frac{1}{|S|}\). A proper treatment of the problem would involve properly defining Markov chains and stationary distributions, which is out of the scope of this thesis. We’ve seen in example Example exm-the_example that the state transition matrix \(P_\pi\) appears. This relation between Markov chains and MDP is explored in [2], as well as the policy gradient theorem. For more information on Markov chains, see [3].

The policy gradient theorem is powerful in the sense that we can derive the gradient of the objective function, something that is tied to the environment, to establishing the gradient of the parametrized policy function, which we have more control over.

5.4.3 REINFORCE algorithm

Here, we introduce reinforce the classic REINFORCE algorithm [4]. Even with the policy gradient theorem, we are still faced with the problem of estimating the action values \(q_\pi\). We remark that the formula in the policy gradient is an expectation,

\[ \nabla_{\mathbfit{\theta}} J(\mathbfit{\theta}) \propto E\left[\sum_a q_\pi(a,S)\nabla \pi(a|S, \mathbfit{\theta})\right], \]

where \(S\) is the random variable given by the probability mass function \(\mu(s)\). By using the identity \(\frac{\nabla f}{f} = \nabla \ln f\), we can also rewrite the inner term as

\[ \sum_a q_\pi(a,S)\nabla_{\mathbfit{\theta}} \pi(a|S, {\mathbfit{\theta}}) = \sum_a \pi(a|S)q_\pi(a,S)\nabla_{\mathbfit{\theta}} \ln \pi(a|S, {\mathbfit{\theta}}), \]

which is also an expectation, and thus

\[ \nabla J(\theta) \propto E\left[q_\pi(A,S)\nabla_{\mathbfit{\theta}} \ln \pi(A|S, {\mathbfit{\theta}})\right]. \]

We also know from before that the action value is also the conditional expectation of the return \(q_\pi(s,a) = E[G_t|S_t = s, A_t = a]\). Thus,

\[ \nabla_{\mathbfit{\theta}} J({\mathbfit{\theta}}) \propto E\left[G_t\nabla \ln \pi(A_t|S_t, {\mathbfit{\theta}})\right]. \tag{5.6}\]

Note that the variable \(t\) has been introduced Since this is an expectation, we can estimate it by using samples. Retracing our steps, the \(k\)’th sample, which we note as \(e_k\) have to be chosen as follow.

Chose a state \(S_0 = s\) at random, following its stationary distribution.
Chose an action \(A_0 = a\) according to the policy \(\pi(A_0 = a|S_0 = s, {\mathbfit{\theta}})\).
Compute the log policy gradient. Then, get the return \(G_0 = g\) for the state-action pair \((s,a)\). The sample is then \(e_k = g\nabla_{\mathbfit{\theta}} \ln \pi(a|s, {\mathbfit{\theta}})\).

Then, the estimator for the RHS in Equation eq-policy_gradient_expectation is given by

\[ \hat{E}_n = \frac{1}{n}\sum_{k=1}^n e_k, \]

where \(n\) is the number of samples we have. Using a gradient ascent algorithm, we can update the parameters \(\mathbfit{\theta}\),

\[ \mathbfit{\theta}_{t+1} = \mathbfit{\theta}_t + \alpha\frac{1}{n}\sum_{k=1}^n e_k. \]

This method has three problems:

The states need to be chosen according to the stationary distribution \(\mu(s)\), which is not trivial. Thankfully, with our assumption of random state transitions, \(\mu(s) = \frac{1}{|\mathcal{S}|}\).
To get each sample \(e_k\), we need to compute a return. Doing so, we end up visiting a lot of different states and gathering a lot of information that we end up discarding. This issue is an issue of low sample efficiency and is usually best handled via temporal difference based methods, where one estimate the returns after a finite number of steps. These methods are out of scope of the thesis.
For continuing cases(where the are no final states), the return is an infinite sum of random variable, which we can not sample. We will have to stop after \(\tau\) transitions and use the estimate \(G_t \approx \sum_{t=0}^\tau \gamma^t R_{t+1}\). This introduces some bias, in particular when \(\gamma \approx 1\) and \(\tau\) is small. Once again, this can be resolved by temporal difference based methods.

Let us forget about the truncations issues for now. When we sample the expectation in Equation eq-policy_gradient_expectation, we get a trajectory

\[ s_0,a_0 \to s_{1}, a_{1} \to s_{2},a_{2} \dots . \]

Then, we can estimate, via Monte Carlo estimation, the return \(G_0\). Doing this, we also have access to the trajectory

\[ s_{1}, a_{1} \to s_{2},a_{2} \dots, \]

and thus we can also estimate the return \(G_{1}\)! Therefore, we can use a single episode to estimate multiple samples! Using this idea, we can generate an episode of length \(\tau +1\):

\[ s_0,a_0 \to s_1,a_1,r_1 \to s_2,a_2,r_2 \to \dots \to s_{\tau +1}, r_{\tau +1}. \]

For any \(t = 0,\dots ,T\), the estimated return is then defined as

\[ \hat{G}_t = \sum_{k=t}^\tau \gamma^{t-k}r_{k+1}. \]

Remark. Because the initial state is chosen following a stationary distribution, we also ensure that the subsequent states are chosen following this same distribution.

We can now state the REINFORCE algorithm [4], also called policy gradient Monte-Carlo in pseudo code format:

REINFORCE algorithm pseudocode
INPUT:
- A parameter vector \(\mathbfit{\theta} \in\mathbb{R}^d\), and a parametrized policy \(\pi(a|s,\mathbfit{\theta})\) with computable gradient \(\nabla_{\mathbfit{\theta}} \pi(a|s,\mathbfit{\theta})\);
- Learning rate \(\alpha\);
- Discount rate \(\gamma\);
- Episode length \(\tau+1\);
- Number of episode to iterate for \(n\);
OUTPUT: The updated parameter \(\mathbfit{\theta}\);

FOR \(n\) episodes:
      Generate an episode, following \(\pi(a|s,\mathbfit{\theta})\), of length T+2 the form \(s_0,a_0 \to s_1,a_1,r_1 \to s_2,a_2,r_2 \to \dots \to s_{\tau+1}, r_{\tau+1}\);
      FOR t=0 … :
          Compute the estimated return \(\hat{G_t} = \sum_{k=t}^\tau \gamma^{t-k}r_{k+1}\);
          Compute the log gradient \(\nabla \ln \pi(a_t|s_t,\theta)\);
          Update \(\theta \leftarrow \theta + \alpha \hat{G_t} \nabla \ln \pi(a_t|s_t,\mathbfit{\theta})\);

Remark. Because of the finite episode length, the REINFORCE algorithm is more suited for episodic tasks, but it is also usable for continuing tasks, if we accept some bias. Another alternative to reduce bias would be to discard the last few estimated returns \(\hat{G}_\tau, \hat{G}_{\tau-1}, \dots\) as they are the most biased.

The REINFORCE update can be interpreted as updating the parameters to make it more likely to take an action if the estimated sample return is good, and the opposite otherwise. Furthermore, by looking at the term \(\nabla \ln \pi = \frac{\nabla \pi}{\pi}\), we can see that if the probability of taking the action is low, then the gradient becomes bigger! That way, this gradient act as a balance between exploration and exploitation. Otherwise we would update the parameters as much for a rare action than a common one, and the common action is taken more often which lead to the common action having much more sway in the process.

[1]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, Second. The MIT Press, 2018. Available: http://incompleteideas.net/book/the-book-2nd.html

[2]

S. Zhao, “Mathematical foundations of reinforcement learning.” 2023. https://github.com/MathFoundationRL/Book-Mathmatical-Foundation-of-Reinforcement-Learning (accessed Mar. 30, 2023).

[3]

S. M. Ross and E. A. Peköz, A second course in probability. ProbabilityBookstore.com, 2007. Available: https://books.google.se/books?id=g5j6DwAAQBAJ

[4]

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, May 1992, doi: 10.1007/BF00992696.