Reinforcement Learning: Monte Carlo Control

Author

Michael Hahsler

This code is provided under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.

1 Introduction

Reinforcement Learning: An Introduction (RL) by Sutton and Barto (2020) introduce several Monte Carlo control algorithms in Chapter 5: Monte Carlo Methods.

We will implement the key concepts using R for the AIMA 3x4 grid world example. The used environment is an MDP, but instead of trying to solve the MDP directly and estimating the value function estimates $U(s)$, we will try to learn a policy from interactions with the environment. The MDP’s transition model will only be used to simulate complete episodes following a policy that the algorithm will use to updates its current policy.

The code in this notebook defines explicit functions matching the textbook definitions and is for demonstration purposes only. Efficient implementations for larger problems use fast vector multiplications instead.

2 The 4x3 Grid World Example

The used example is the simple 4x3 grid world described in AIMA Figure 12.1 and used again in 22.1 as:

AIMA Figure 17.1: (a) A simple, stochastic $4 \times 3$ environment that presents the agent with a sequential decision problem. (b) Illustration of the transition model of the environment: the “intended” outcome occurs with probability 0.8, but with probability 0.2 the agent moves at right angles to the intended direction. A collision with a wall results in no movement. Transitions into the two terminal states have reward +1 and -1, respectively, and all other transitions have a reward of -0.04.

In the following, we implement states, actions, the reward function,and the transition model. We also show how to represent a policy and how to estimate the expected utility using simulation.

The MDP will be defined as the following global variables/functions:

S: set of states.
A: set of actions.
actions(s): returns the available actions for state $s$.
R(s, a, s_prime): reward function for the transition from $s$ to $s`$ with action a.
P(sp, s, a): a function returning the transition probability $P(s' | s, a)$.

Policies are represented as:

Deterministic policies are a vector with the action for each state.
Stochastic policies are a matrix with probabilities where each row is a state and the columns are the actions

Other useful functions:

sample_transition(s, a): returns a $s'$ sampled using the transition model.
simulate_utilities(pi, s0 = 1, N = 1000, max_t = 100): Estimates the utility of following policy $\pi$, starting $N$ episodes in state $s_0$. The maximal episode length is max_t to ensure that the function finishes also for policies that do not lead to a terminal state.

2.1 States

We define the atomic state space $S$ by labeling the states $1, 2, ...$. We convert coordinates (rows, columns) to the state label.

# I use capitalized variables as global constants
COLS <- 4
ROWS <- 3

S = seq_len(ROWS * COLS)

LAYOUT <- matrix(S, nrow = ROWS, ncol = COLS)
LAYOUT

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Note that the rows are displayed upside-down compared to the text book, so we use a function to display them in reverse order.

show_layout <- function(x) {
    x <- matrix(x, ncol = COLS, nrow = ROWS, dimnames = list(row = seq_len(ROWS),
        col = seq_len(COLS)))
    x[rev(seq_len(ROWS)), ]
}

show_layout(LAYOUT)

   col
row 1 2 3  4
  3 3 6 9 12
  2 2 5 8 11
  1 1 4 7 10

Convert between coordinates and state labels.

rc_to_s <- function(rc) LAYOUT[rbind(rc)]

s_to_rc <- function(s) drop(which(LAYOUT == s, arr.ind = TRUE, useNames = FALSE))


rc_to_s(c(3, 4))

[1] 12

s_to_rc(12)

[1] 3 4

Start state ::: {.cell}

start <- 1L

:::

Define terminal states.

is_terminal <- function(s) s %in% c(5, 11, 12)

2.2 Actions

The complete set of actions is $A = \{\mathrm{'Up', 'Right', 'Down', 'Left', 'None'}\}$. Not all actions are available in every state. Also, action None is added as the only possible action in an absorbing state.

A = c("Up", "Right", "Down", "Left", "None")

actions <- function(s) {

    # absorbing states
    if (s == 11 || s == 12)
        return("None")

    # illegal state
    if (s == 5)
        return("None")

    c("Up", "Right", "Down", "Left")
}

lapply(S, actions)

[[1]]
[1] "Up"    "Right" "Down"  "Left" 

[[2]]
[1] "Up"    "Right" "Down"  "Left" 

[[3]]
[1] "Up"    "Right" "Down"  "Left" 

[[4]]
[1] "Up"    "Right" "Down"  "Left" 

[[5]]
[1] "None"

[[6]]
[1] "Up"    "Right" "Down"  "Left" 

[[7]]
[1] "Up"    "Right" "Down"  "Left" 

[[8]]
[1] "Up"    "Right" "Down"  "Left" 

[[9]]
[1] "Up"    "Right" "Down"  "Left" 

[[10]]
[1] "Up"    "Right" "Down"  "Left" 

[[11]]
[1] "None"

[[12]]
[1] "None"

2.3 Rewards

$R(s, a, s')$ define the reward for the transition from $s$ to $s'$ with action $a$.

For the textbook example we have:

Any move costs utility (a reward of -0.04).
Going to state 12 has a reward of +1
Going to state 11 has a reward of -1.

Note that once you are in an absorbing state (11 or 12), then the problem is over and there is no more reward!

R <- function(s, a, s_prime) {
    ## no more reward when we in 11 or 12.
    if (a == "None" || s == 11 || s == 12)
        return(0)

    ## transition to the absorbing states.
    if (s_prime == 12)
        return(+1)
    if (s_prime == 11)
        return(-1)

    ## cost for each move
    return(-0.04)
}

R(1, "Up", 2)

[1] -0.04

R(9, "Right", 12)

[1] 1

R(12, "None", 12)

[1] 0

2.4 Transition Model

$P(s' | s, a)$ is the probability of going from state $s$ to $s'$ by when taking action $a$. We will create a matrix $P_a(s' | s)$ for each action.

calc_transition <- function(s, action) {
    action <- match.arg(action, choices = A)

    if (length(s) > 1)
        return(t(sapply(s, calc_transition, action = action)))

    # deal with absorbing and illegal state
    if (s == 11 || s == 12 || s == 5 || action == "None") {
        P <- rep(0, length(S))
        P[s] <- 1
        return(P)
    }

    action_to_delta <- list(Up = c(+1, 0), Down = c(-1, 0), Right = c(0, +1), Left = c(0,
        -1))
    delta <- action_to_delta[[action]]
    dr <- delta[1]
    dc <- delta[2]

    rc <- s_to_rc(s)
    r <- rc[1]
    c <- rc[2]

    if (dr != 0 && dc != 0)
        stop("You can only go up/down or right/left!")

    P <- matrix(0, nrow = ROWS, ncol = COLS)

    # UP/DOWN
    if (dr != 0) {
        new_r <- r + dr
        if (new_r > ROWS || new_r < 1)
            new_r <- r
        ## can't got to (2, 2)
        if (new_r == 2 && c == 2)
            new_r <- r
        P[new_r, c] <- 0.8

        if (c < COLS & !(r == 2 & (c + 1) == 2))
            P[r, c + 1] <- 0.1 else P[r, c] <- P[r, c] + 0.1
        if (c > 1 & !(r == 2 & (c - 1) == 2))
            P[r, c - 1] <- 0.1 else P[r, c] <- P[r, c] + 0.1
    }

    # RIGHT/LEFT
    if (dc != 0) {
        new_c <- c + dc
        if (new_c > COLS || new_c < 1)
            new_c <- c
        ## can't got to (2, 2)
        if (r == 2 && new_c == 2)
            new_c <- c
        P[r, new_c] <- 0.8

        if (r < ROWS & !((r + 1) == 2 & c == 2))
            P[r + 1, c] <- 0.1 else P[r, c] <- P[r, c] + 0.1
        if (r > 1 & !((r - 1) == 2 & c == 2))
            P[r - 1, c] <- 0.1 else P[r, c] <- P[r, c] + 0.1
    }

    as.vector(P)
}

Try to go up from state 1 (this is (1,1), the bottom left corner). Note: we cannot go left so there is a .1 chance to stay in place. ::: {.cell}

calc_transition(1, "Up")

 [1] 0.1 0.8 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

show_layout(calc_transition(1, "Up"))

   col
row   1   2 3 4
  3 0.0 0.0 0 0
  2 0.8 0.0 0 0
  1 0.1 0.1 0 0

:::

Try to go right from (2,1). Since right is blocked, there is a .8 probability of staying in place. ::: {.cell}

show_layout(calc_transition(2, "Right"))

   col
row   1 2 3 4
  3 0.1 0 0 0
  2 0.8 0 0 0
  1 0.1 0 0 0

:::

Calculate transitions for each state to each other state. Each row represents a state $s$ and each column a state $s'$ so we get a complete definition for $P_a(s' | s)$. Note that the matrix is stochastic (all rows add up to 1).

Create a matrix for each action.

P_matrices <- lapply(A, FUN = function(a) calc_transition(S, a))
names(P_matrices) <- A
str(P_matrices)

List of 5
 $ Up   : num [1:12, 1:12] 0.1 0 0 0.1 0 0 0 0 0 0 ...
 $ Right: num [1:12, 1:12] 0.1 0.1 0 0 0 0 0 0 0 0 ...
 $ Down : num [1:12, 1:12] 0.9 0.8 0 0.1 0 0 0 0 0 0 ...
 $ Left : num [1:12, 1:12] 0.9 0.1 0 0.8 0 0 0 0 0 0 ...
 $ None : num [1:12, 1:12] 1 0 0 0 0 0 0 0 0 0 ...

Create a function interface for $P(s' | s, a)$.

P <- function(sp, s, a) P_matrices[[a]][s, sp]

P(2, 1, "Up")

[1] 0.8

P(5, 4, "Up")

[1] 0

2.5 Policy

The solution to an MDP is a policy $\pi$ which defines which action to take in each state. We represent deterministic policies as a vector of actions. I make up a policy that always goes up and then to the right once the agent hits the top.

pi_manual <- rep("Up", times = length(S))
pi_manual[c(3, 6, 9)] <- "Right"
pi_manual

 [1] "Up"    "Up"    "Right" "Up"    "Up"    "Right" "Up"    "Up"    "Right"
[10] "Up"    "Up"    "Up"

show_layout(pi_manual)

   col
row 1       2       3       4   
  3 "Right" "Right" "Right" "Up"
  2 "Up"    "Up"    "Up"    "Up"
  1 "Up"    "Up"    "Up"    "Up"

We can also create a random policy by randomly choosing from the available actions for each state.

create_random_deterministic_policy <- function() structure(sapply(S, FUN = function(s) sample(actions(s),
    1L)), names = S)

set.seed(1234)
pi_random <- create_random_deterministic_policy()
pi_random

      1       2       3       4       5       6       7       8       9      10 
 "Left"  "Left" "Right" "Right"  "None"  "Left"  "Down"    "Up"    "Up" "Right" 
     11      12 
 "None"  "None"

show_layout(pi_random)

   col
row 1       2       3      4      
  3 "Right" "Left"  "Up"   "None" 
  2 "Left"  "None"  "Up"   "None" 
  1 "Left"  "Right" "Down" "Right"

Stochastic policies use probabilities of actions in each state. We use as simple table with probabilities where each row is a state and the columns are the actions. Here we create a random $\epsilon$-soft policy. Each available has at least a probability of $\epsilon$.

We can make a deterministic policy soft. ::: {.cell}

make_policy_soft <- function(pi, epsilon = 0.1) {
    if (!is.vector(pi))
        stop("pi is not a deterministic policy!")

    p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))

    for (s in S) {
        p[s, actions(s)] <- epsilon/length(actions(s))
        p[s, pi[s]] <- p[s, pi[s]] + (1 - epsilon)
    }

    p
}

make_policy_soft(pi_random)

      Up Right  Down  Left None
1  0.025 0.025 0.025 0.925    0
2  0.025 0.025 0.025 0.925    0
3  0.025 0.925 0.025 0.025    0
4  0.025 0.925 0.025 0.025    0
5  0.000 0.000 0.000 0.000    1
6  0.025 0.025 0.025 0.925    0
7  0.025 0.025 0.925 0.025    0
8  0.925 0.025 0.025 0.025    0
9  0.925 0.025 0.025 0.025    0
10 0.025 0.925 0.025 0.025    0
11 0.000 0.000 0.000 0.000    1
12 0.000 0.000 0.000 0.000    1

:::

Or we can create a completely random soft policy

create_random_epsilon_soft_policy <- function(epsilon = 0.1) {

    # split total randomly into n numbers that add up to total
    random_split <- function(n, total) {
        if (n == 1)
            return(total)

        bordersR <- c(sort(runif(n - 1)), 1)
        bordersL <- c(0, bordersR[1:(n - 1)])
        (bordersR - bordersL) * total
    }

    p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))
    for (s in S) p[s, actions(s)] <- epsilon/length(actions(s)) + random_split(n = length(actions(s)),
        1 - epsilon)

    p
}

set.seed(1234)
pi_random_epsilon_soft <- create_random_epsilon_soft_policy()
pi_random_epsilon_soft

      Up Right  Down  Left None
1  0.127 0.471 0.037 0.365    0
2  0.586 0.040 0.224 0.150    0
3  0.034 0.226 0.415 0.326    0
4  0.488 0.053 0.159 0.301    0
5  0.000 0.000 0.000 0.000    1
6  0.279 0.034 0.593 0.094    0
7  0.265 0.042 0.521 0.171    0
8  0.193 0.066 0.101 0.640    0
9  0.061 0.132 0.154 0.653    0
10 0.222 0.301 0.281 0.195    0
11 0.000 0.000 0.000 0.000    1
12 0.000 0.000 0.000 0.000    1

2.6 Expected Utility

The expected utility can be calculated by

$U^\pi = E\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(s_t), s_{t+1})\right]$

We need to define the discount factor.

GAMMA <- 1

We can evaluate the utility of a policy using simulation. For the stochastic transition model, we need to be able to sample the state $s'$ the system transitions to when using action $a$ in state $s$.

sample_transition <- function(s, a) sample(S, size = 1, prob = P_matrices[[a]][s,
    ])

sample_transition(1, "Up")

[1] 1

table(replicate(n = 100, sample_transition(1, "Up")))


 1  2  4 
 9 84  7

We can now simulate the utility for one episode. Note that we use the cutoff max_t in case a policy does not end up in a terminal state before that.

simulate_utility <- function(pi, s0 = 1, max_t = 100) {
    s <- s0
    U <- 0
    t <- 0

    while (TRUE) {
        ## get action from policy (matrix means it is a stochastic policy)
        if (!is.matrix(pi))
            a <- pi[s] else a <- sample(A, size = 1, prob = pi[s, ])

        ## sample a transition given the action from the policy
        s_prime <- sample_transition(s, a)

        ##
        U <- U + GAMMA^t * R(s, a, s_prime)

        s <- s_prime

        ## reached an absorbing state?
        if (s == 11 || s == 12 || s == 5)
            break

        t <- t + 1
        if (t >= max_t)
            break
    }

    U
}

Simulate the N episodes.

simulate_utilities <- function(pi, s0 = 1, N = 1000, max_t = 100) replicate(N, simulate_utility(pi,
    s0, max_t))

utility_manual <- simulate_utilities(pi_manual)

The expected utility for starting from state $s_0 = 1$ is.

mean(utility_manual)

[1] 0.7

hist(utility_manual, xlim = c(-1, 1))

Compare with the utility of the random policy. ::: {.cell}

utility_random <- simulate_utilities(pi_random, max_t = 100)

table(utility_random)

utility_random
  -4 
1000

:::

The random policy performs really bad and most likely always stumbles around for max_t moves at a cost of .04 each. The manually created policy is obviously better.

We can use simulation to estimate the expected utility for starting from each state following the policy.

U_manual <- sapply(S, FUN = function(s) mean(simulate_utilities(pi_manual, s0 = s)))
show_layout(U_manual)

   col
row    1     2    3     4
  3 0.86  0.90 0.95  0.00
  2 0.79 -0.04 0.72  0.00
  1 0.72  0.36 0.52 -0.87

U_random <- sapply(S, FUN = function(s) mean(simulate_utilities(pi_random, s0 = s)))
show_layout(U_random)

   col
row  1    2    3    4
  3 -4 -4.0 -1.6  0.0
  2 -4  0.0 -1.5  0.0
  1 -4 -1.8 -1.8 -1.4

3 Monte Carlo Methods

This method uses experience by sampling sequences ($s, a, r, s, ...$) from the environment and then averaged sample returns for each state-action pair (i.e., Q-values). They perform generalized policy iteration by performing after each episode:

Current policy evaluation to estimate the new action value function $Q$.
Policy improvement by changing the policy to a greedy policy with respect to $Q$.

An important issue is that we make sure that the algorithm keeps exploring. We will implement several different MC methods that incorporate exploration in different ways.

First, we implement several helper functions used my MC algorithms.

Find the greedy (an $\epsilon$-greedy) action given the action-value function $Q$. Find the greedy policy given $Q$.

greedy_action <- function(s, Q, epsilon = 0) {
    available_A <- actions(s)

    if (epsilon == 0 || length(available_A) == 1L || runif(1) > epsilon) {
        a <- available_A[which.max(Q[s, available_A])]
    } else {
        a <- sample(available_A, size = 1L)
    }

    a
}

greedy_policy <- function(Q, epsilon = 0) {
    if (epsilon == 0) {
        p <- structure(A[apply(Q, MARGIN = 1, which.max)], names = S)
    } else {
        p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))
        for (s in S) p[s, actions(s)] <- epsilon/length(actions(s))
        a <- greedy_action(Q, s)
        p[s, a] + p[s, a] + (1 - epsilon)
    }
    p
}

Find the action used in state $s$ given a soft or deterministic policy $\pi$.

next_action <- function(pi, s) {
    if (!is.matrix(pi))
        pi[s] else sample(A, size = 1L, prob = pi[s, ])
}

Simulate an episode following policy $\pi$ and starting in state $s_0$ with $action $a_0$. max_length is used to make sure that the episode length cannot become infinite for a policy that does not end in a terminal state.

simulate_episode <- function(pi, s_0, a_0 = NULL, max_length = 100) {
    # rows are s_t, a_t, r_t+1; row 1 is t = 0
    episode <- data.frame(t = rep(NA_integer_, max_length), s = rep(NA_character_,
        max_length), a = rep(NA_character_, max_length), r = rep(NA_real_, max_length))

    if (is.null(a_0))
        a_0 <- next_action(pi, s_0)

    s <- s_0
    a <- a_0
    r <- NA
    i <- 1L  # i == 1 means t == 0!
    while (TRUE) {

        if (is_terminal(s))
            break

        s_prime <- sample_transition(s, a)
        r <- R(s, a, s_prime)

        episode[i, ] <- data.frame(t = i - 1L, s, a, r)

        if (is_terminal(s_prime))
            break
        if (i >= max_length)
            break

        s <- s_prime
        a <- next_action(pi, s)

        i <- i + 1L
    }

    episode[1:i, ]
}

Simulate an episode following a randomly generated epsilon soft policy.

pi <- create_random_epsilon_soft_policy(epsilon = 0.1)
pi

      Up Right  Down  Left None
1  0.364  0.49 0.109 0.032    0
2  0.321  0.21 0.206 0.265    0
3  0.026  0.11 0.470 0.395    0
4  0.165  0.14 0.395 0.302    0
5  0.000  0.00 0.000 0.000    1
6  0.097  0.32 0.045 0.539    0
7  0.373  0.16 0.334 0.130    0
8  0.113  0.30 0.075 0.507    0
9  0.037  0.66 0.115 0.192    0
10 0.230  0.14 0.105 0.521    0
11 0.000  0.00 0.000 0.000    1
12 0.000  0.00 0.000 0.000    1

simulate_episode(pi, 1, "Up")

  t s     a     r
1 0 1    Up -0.04
2 1 2    Up -0.04
3 2 3  Left -0.04
4 3 3  Down -0.04
5 4 2  Left -0.04
6 5 3  Down -0.04
7 6 6 Right -0.04
8 7 9 Right  1.00

Note that the table represents the sequence:

\[s_0, a_0, r_1, s_1, a_1, ..., s_{T-1}, a_{T-1}, r_T\]

Each row contains $s_t, a_t, r_{t+1}$ and the row index for $t=0$ is 1.

4 Implementing Monte Carlo Exploring Starts Control

The most simple MC algorithm learns a greedy policy. In order to still keep exploring it uses the idea of exploring starts: All state-action pairs have a non-zero probability of being selected as the start of an episode.

Here is the pseudo code from the RL book, Chapter 5:

Reinforcement Learning Chapter 5: MC ES control

We implement the algorithm with the following change. Instead of collecting the utilities $G$ as lists in the Results data structure and then averaging them to calculate new $Q$-values, we keep the number of utilities used to average each $Q$-value and then update the running average.

MC_exploring_states <- function(N = 100, gamma = 1, verbose = FALSE) {
    # Initialize
    pi <- create_random_deterministic_policy()

    Q <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))

    # instead of returns we use a more efficient running average where Q_N is
    # the number of averaged values.
    Q_N <- matrix(0L, nrow = length(S), ncol = length(A), dimnames = list(S, A))

    if (verbose) {
        cat("Initial policy:\n")
        print(pi)

        cat("Initial Q:\n")
        print(Q)
    }

    # Loop through N episodes
    for (e in seq(N)) {
        # Sample a starting state and action (= exploring states)
        s_0 <- sample(S[!is_terminal(S)], size = 1L)
        a_0 <- sample(actions(s_0), size = 1L)

        ep <- simulate_episode(pi, s_0, a_0)

        if (verbose) {
            cat(paste("*** Episode", e, "***\n"))
            print(ep)
        }

        G <- 0
        for (i in rev(seq(nrow(ep)))) {
            r_t_plus_1 <- ep$r[i]
            s_t <- ep$s[i]
            a_t <- ep$a[i]

            G <- gamma * G + r_t_plus_1

            # Only update for first visit of a s/a combination
            if (i < 2L || !any(s_t == ep$s[1:(i - 1L)] & a_t == ep$a[1:(i - 1L)])) {

                if (verbose)
                  cat(paste0("Update at step ", i, ":\n", "  - Q(", s_t, ", ", a_t,
                    "): ", round(Q[s_t, a_t], 3)))

                # running average instead of averaging Returns lists.
                Q[s_t, a_t] <- (Q[s_t, a_t] * Q_N[s_t, a_t] + G)/(Q_N[s_t, a_t] +
                  1)
                Q_N[s_t, a_t] <- Q_N[s_t, a_t] + 1L

                if (verbose)
                  cat(paste0(" -> ", round(Q[s_t, a_t], 3), " (G = ", round(G, 3),
                    ")\n"))

                if (verbose)
                  cat(paste0("  - pi[", s_t, "]: ", pi[s_t]))

                pi[s_t] <- greedy_action(s_t, Q)

                if (verbose)
                  cat(paste0(" -> ", pi[s_t], "\n"))
            }
        }
    }
    list(Q = Q, pi = pi)
}

ret <- MC_exploring_states(N = 1000, verbose = FALSE)

ret

$Q
       Up  Right   Down   Left None
1   0.034  0.159 0.0151 -0.785    0
2   0.747  0.143 0.2178  0.169    0
3   0.201  0.844 0.5342  0.541    0
4   0.194 -0.085 0.0065 -0.079    0
5   0.000  0.000 0.0000  0.000    0
6   0.383  0.912 0.6295  0.510    0
7   0.502 -0.043 0.4736 -0.174    0
8   0.699 -0.731 0.1969  0.587    0
9   0.858  0.959 0.9092  0.686    0
10 -0.829 -0.162 0.2240  0.312    0
11  0.000  0.000 0.0000  0.000    0
12  0.000  0.000 0.0000  0.000    0

$pi
      1       2       3       4       5       6       7       8       9      10 
"Right"    "Up" "Right"    "Up"  "None" "Right"    "Up"    "Up" "Right"  "Left" 
     11      12 
 "None"  "None"

show_layout(ret$pi)

   col
row 1       2       3       4     
  3 "Right" "Right" "Right" "None"
  2 "Up"    "None"  "Up"    "None"
  1 "Right" "Up"    "Up"    "Left"

5 Implementing On-Policy Monte Carlo Control

Exploring starts are not always an option. E.g., when learning from interaction with the environment where we cannot simply set the starting condition to $s_0$ and $a_0$.

The first option is to learn an $\epsilon$-greedy policy and also use it as the behavior policy (on-policy control).

Here is the pseudo code from the RL book, Chapter 5:

Reinforcement Learning Chapter 5: On-policy MC control We implement the algorithm again with a running average.

MC_on_policy <- function(N = 100, epsilon = 0.1, gamma = 1, verbose = FALSE) {
    # Initialize
    pi <- create_random_epsilon_soft_policy(epsilon)

    Q <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))

    Q_N <- matrix(0L, nrow = length(S), ncol = length(A), dimnames = list(S, A))

    # Loop through N episodes
    for (e in seq(N)) {
        # always start from the start state defined by the problem
        s_0 <- start

        ep <- simulate_episode(pi, s_0)

        if (verbose) {
            cat(paste("*** Episode", e, "***\n"))
            print(ep)
        }

        G <- 0
        for (i in rev(seq(nrow(ep)))) {
            r_t_plus_1 <- ep$r[i]
            s_t <- ep$s[i]
            a_t <- ep$a[i]

            G <- gamma * G + r_t_plus_1

            # Only update for first visit of a s/a combination
            if (i < 2L || !any(s_t == ep$s[1:(i - 1L)] & a_t == ep$a[1:(i - 1L)])) {
                if (verbose)
                  cat(paste0("Update at step ", i, ":\n", "  - Q(", s_t, ", ", a_t,
                    "): ", round(Q[s_t, a_t], 3)))

                # running average instead of averaging Returns lists.
                Q[s_t, a_t] <- (Q[s_t, a_t] * Q_N[s_t, a_t] + G)/(Q_N[s_t, a_t] +
                  1)
                Q_N[s_t, a_t] <- Q_N[s_t, a_t] + 1L

                if (verbose)
                  cat(paste0(" -> ", round(Q[s_t, a_t], 3), " (G = ", round(G, 3),
                    ")\n"))

                a_star <- greedy_action(s_t, Q)

                if (verbose) {
                  cat(paste0("  - pi for state ", s_t, " is updated:\n"))
                  print(pi[s_t, ])
                }
                pi[s_t, actions(s_t)] <- epsilon/length(actions(s_t))
                pi[s_t, a_star] <- pi[s_t, a_star] + (1 - epsilon)
                if (verbose) {
                  print(pi[s_t, ])
                  cat("\n")
                }

            }
        }
    }
    list(Q = Q, pi = pi)
}

ret <- MC_on_policy(N = 1000, epsilon = 0.1, verbose = FALSE)

ret

$Q
        Up Right   Down  Left None
1   0.6389  0.31  0.449  0.45    0
2   0.7061  0.58  0.469  0.57    0
3   0.5284  0.78  0.456  0.66    0
4   0.0031 -0.34  0.228  0.52    0
5   0.0000  0.00  0.000  0.00    0
6   0.6897  0.84  0.689  0.75    0
7   0.0274 -0.86 -1.240 -0.28    0
8   0.4527 -0.46  0.074  0.19    0
9   0.7900  0.92  0.280  0.68    0
10 -1.0053 -1.04 -1.360 -1.28    0
11  0.0000  0.00  0.000  0.00    0
12  0.0000  0.00  0.000  0.00    0

$pi
      Up Right  Down  Left None
1  0.925 0.025 0.025 0.025    0
2  0.925 0.025 0.025 0.025    0
3  0.025 0.925 0.025 0.025    0
4  0.025 0.025 0.025 0.925    0
5  0.000 0.000 0.000 0.000    1
6  0.025 0.925 0.025 0.025    0
7  0.925 0.025 0.025 0.025    0
8  0.925 0.025 0.025 0.025    0
9  0.025 0.925 0.025 0.025    0
10 0.925 0.025 0.025 0.025    0
11 0.000 0.000 0.000 0.000    1
12 0.000 0.000 0.000 0.000    1

We have learned a soft ($\epsilon$-greedy) policy. We can extract a deterministic policy by always using the action with the larges execution probability.

show_layout(A[apply(ret$pi, MARGIN = 1, which.max)])

   col
row 1       2       3       4     
  3 "Right" "Right" "Right" "None"
  2 "Up"    "None"  "Up"    "None"
  1 "Up"    "Left"  "Up"    "Up"

To learn a policy that is closer to the a deterministic greedy policy, $\epsilon$ can be reduced over time.

6 Implementing Off-Policy Monte Carlo Control

The on-policy control algorithm above learns an $\epsilon$-greedy policy. Off-policy control uses an arbitrary soft policy for the behavior and uses the generated episodes to learn a deterministic greedy policy. This is done by adjusting the observed returns from the behavior policy using importance sampling.

The importance sampling ration for the reward at $t$ till the end of the episode $T-1$ denoted by $G_{t:T-1}$ is

\[\rho_{t:T-1} = \prod_{k=t}^{T-1}\frac{\pi(a_k|s_k)}{b(a_k|s_k)}\] where $\pi$ is the target policy and $b$ is the behavior policy. Importance sampling makes sure that the expected rescaled value following the behavior policy $b$ starting at state $s$ gives the value of the state following $\pi$.

\[\mathbb{E}[\rho_{t:T-1} G_t | s_t=s] = v_\pi(s)\] The expectation can be estimated from sequences by averaging the rescaled returns for each state. Regular averaging results in extremely high variance when only few reward values are available. An alternative to regular averaging is called weighted importance sampling which uses a weighted average defined as:

\[V(s) \dot = \frac{\sum_{k=1}^{n-1} W_kG_k}{\sum_{k=1}^{n-1} W_k}, \qquad n\ge 2\] Weighted importance sampling has lower variance and is used here.

Here is the pseudo code from the RL book, Chapter 5:

Reinforcement Learning Chapter 5: Off-policy MC control

MC_off_policy <- function(N = 100, epsilon = 0.1, gamma = 1, verbose = FALSE) {
    # Initialize
    Q <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))

    # cumulative sum of the weights W used in incremental updates
    C <- matrix(0L, nrow = length(S), ncol = length(A), dimnames = list(S, A))

    pi <- greedy_policy(Q)

    # Loop through N episodes
    for (e in seq(N)) {
        b <- create_random_epsilon_soft_policy(epsilon)

        s_0 <- start
        ep <- simulate_episode(b, s_0)

        if (verbose) {
            cat(paste("*** Episode", e, "***\n"))
            print(ep)
        }

        G <- 0
        W <- 1
        for (i in rev(seq(nrow(ep)))) {
            r_t_plus_1 <- ep$r[i]
            s_t <- ep$s[i]
            a_t <- ep$a[i]

            G <- gamma * G + r_t_plus_1

            # increase cumulative sum of W and update Q with weighted G
            C[s_t, a_t] <- C[s_t, a_t] + W
            Q[s_t, a_t] <- Q[s_t, a_t] + (W/C[s_t, a_t]) * (G - Q[s_t, a_t])

            pi[s_t] <- greedy_action(s_t, Q)

            # the algorithm can only learn from the tail of the episode where b
            # also used the greedy actions in pi. The method is inefficient and
            # cannot use all the data!
            if (a_t != pi[s_t])
                break

            W <- W * 1/b[s_t, a_t]
            # note that pi[s_t, a_t] is by construction 1!
        }
    }

    list(Q = Q, pi = pi)
}

ret <- MC_off_policy(N = 1000, epsilon = 0.1, verbose = FALSE)

ret

$Q
      Up Right   Down  Left None
1   0.79  0.67 -0.087  0.75    0
2   0.85  0.83 -0.040  0.84    0
3   0.88  0.91  0.920  0.87    0
4   0.30  0.78 -1.052 -0.04    0
5   0.00  0.00  0.000  0.00    0
6   0.90  0.95  0.914  0.86    0
7   0.77  0.69 -0.403  0.55    0
8   0.78 -0.36 -0.452  0.68    0
9   0.92  0.96  0.698  0.90    0
10 -1.00  0.28 -0.801  0.30    0
11  0.00  0.00  0.000  0.00    0
12  0.00  0.00  0.000  0.00    0

$pi
      1       2       3       4       5       6       7       8       9      10 
   "Up"    "Up"  "Down" "Right"    "Up" "Right"    "Up"    "Up" "Right"  "Left" 
     11      12 
   "Up"    "Up"

show_layout(ret$pi)

   col
row 1      2       3       4     
  3 "Down" "Right" "Right" "Up"  
  2 "Up"   "Up"    "Up"    "Up"  
  1 "Up"   "Right" "Up"    "Left"