Reinforcement Learning: A Q-Learning Agent

Author

Michael Hahsler

This code is provided under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.

1 Introduction

AIMA chapter 22 is about reinforcement learning where the agent learns from reward signals. We will implement the key concepts using R for the AIMA 3x4 grid world example. A more detailed treatment of the algorithm and its properties can be found in Reinforcement Learning: An Introduction (RL) by Sutton and Barto (2020) in Chapter 6: Temporal-Difference Learning.

The used environment is an MDP, but instead of trying to solve the MDP directly and estimating the value function estimates \(U(s)\), we will try to learn a policy from interactions with the environment. The MDP’s transition model will only be used to simulate the response of the environment to actions by the agent.

The code in this notebook defines explicit functions matching the textbook definitions and is for demonstration purposes only. Efficient implementations for larger problems use fast vector multiplications instead.

2 The 4x3 Grid World Example

The used example is the simple 4x3 grid world described in AIMA Figure 12.1 and used again in 22.1 as:

AIMA Figure 17.1: (a) A simple, stochastic \(4 \times 3\) environment that presents the agent with a sequential decision problem. (b) Illustration of the transition model of the environment: the “intended” outcome occurs with probability 0.8, but with probability 0.2 the agent moves at right angles to the intended direction. A collision with a wall results in no movement. Transitions into the two terminal states have reward +1 and -1, respectively, and all other transitions have a reward of -0.04.

In the following, we implement states, actions, the reward function,and the transition model. We also show how to represent a policy and how to estimate the expected utility using simulation.

The MDP will be defined as the following global variables/functions:

S: set of states.
A: set of actions.
actions(s): returns the available actions for state \(s\).
P(sp, s, a): a function returning the transition probability \(P(s' | s, a)\).
R(s, a, s_prime): reward function for the transition from \(s\) to \(s`\) with action a.

Policies are represented as:

Deterministic policies are a vector with the action for each state.
Stochastic policies are a matrix with probabilities where each row is a state and the columns are the actions

Other useful functions:

sample_transition(s, a): returns a \(s'\) sampled using the transition model.
simulate_utilities(pi, s0 = 1, N = 1000, max_t = 100): Estimates the utility of following policy \(\pi\), starting \(N\) episodes in state \(s_0\). The maximal episode length is max_t to ensure that the function finishes also for policies that do not lead to a terminal state.

2.1 States

We define the atomic state space \(S\) by labeling the states \(1, 2, ...\). We convert coordinates (rows, columns) to the state label.

# I use capitalized variables as global constants
COLS <- 4
ROWS <- 3

S = seq_len(ROWS * COLS)

LAYOUT <- matrix(S, nrow = ROWS , ncol = COLS)
LAYOUT

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Note that the rows are displayed upside-down compared to the text book, so we use a function to display them in reverse order.

show_layout <- function(x) {
  x <- matrix(x, ncol = COLS, nrow = ROWS, 
    dimnames = list(row = seq_len(ROWS), col = seq_len(COLS)))
  x[rev(seq_len(ROWS)), ]
  }

show_layout(LAYOUT)

   col
row 1 2 3  4
  3 3 6 9 12
  2 2 5 8 11
  1 1 4 7 10

Convert between coordinates and state labels.

rc_to_s <- function(rc)
  LAYOUT[rbind(rc)]

s_to_rc <-
  function(s)
    drop(which(LAYOUT == s, arr.ind = TRUE, useNames = FALSE))
  
    
rc_to_s(c(3, 4))

[1] 12

s_to_rc(12)

[1] 3 4

Start state

start <- 1L

Define terminal states.

is_terminal <- function(s) s %in% c(5, 11, 12)

2.2 Actions

The complete set of actions is \(A = \{\mathrm{'Up', 'Right', 'Down', 'Left', 'None'}\}\). Not all actions are available in every state. Also, action None is added as the only possible action in an absorbing state.

A = c('Up', 'Right', 'Down', 'Left', 'None')

actions <- function(s) { 
  
  # absorbing states
  if(s == 11 || s == 12) return('None')
  
  # illegal state
  if(s == 5) return('None')
  
  c('Up', 'Right', 'Down', 'Left')
}
     
lapply(S, actions)

[[1]]
[1] "Up"    "Right" "Down"  "Left" 

[[2]]
[1] "Up"    "Right" "Down"  "Left" 

[[3]]
[1] "Up"    "Right" "Down"  "Left" 

[[4]]
[1] "Up"    "Right" "Down"  "Left" 

[[5]]
[1] "None"

[[6]]
[1] "Up"    "Right" "Down"  "Left" 

[[7]]
[1] "Up"    "Right" "Down"  "Left" 

[[8]]
[1] "Up"    "Right" "Down"  "Left" 

[[9]]
[1] "Up"    "Right" "Down"  "Left" 

[[10]]
[1] "Up"    "Right" "Down"  "Left" 

[[11]]
[1] "None"

[[12]]
[1] "None"

2.3 Transition Model

\(P(s' | s, a)\) is the probability of going from state \(s\) to \(s'\) by when taking action \(a\). We will create a matrix \(P_a(s' | s)\) for each action.

calc_transition <- function(s, action) {
  action <- match.arg(action, choices = A)
  
  if(length(s) > 1) return(t(sapply(s, calc_transition, action = action)))
  
  # deal with absorbing and illegal state
  if(s == 11 || s == 12 || s == 5 || action == 'None') {
    P <- rep(0, length(S))
    P[s] <- 1
    return(P)    
  }
  
  action_to_delta <- list(
    'Up' = c(+1, 0),
    'Down' = c(-1, 0),
    'Right' = c(0, +1),
    'Left' = c(0, -1)
    )
  delta <- action_to_delta[[action]]
  dr <- delta[1]
  dc <- delta[2]
   
  rc <- s_to_rc(s)
  r <- rc[1]
  c <- rc[2]
  
  if(dr != 0 && dc != 0) 
    stop("You can only go up/down or right/left!")
  
  P <- matrix(0, nrow = ROWS, ncol = COLS)
  
  # UP/DOWN
  if(dr != 0) {
    new_r <- r + dr
    if(new_r > ROWS || new_r < 1) new_r <- r
    ## can't got to (2, 2)
    if(new_r == 2 && c  == 2) new_r <- r
    P[new_r, c] <- .8
    
    if(c < COLS & !(r == 2 & (c + 1) == 2)) 
      P[r, c + 1] <- .1 else P[r, c] <- P[r, c] + .1 
    if(c > 1 & !(r == 2 & (c - 1) == 2)) 
      P[r, c - 1] <- .1 else P[r, c] <- P[r, c] + .1 
  }
  
  # RIGHT/LEFT
  if(dc != 0) {
    new_c <- c + dc
    if(new_c > COLS || new_c < 1) new_c <- c
    ## can't got to (2, 2)
    if(r == 2 && new_c  == 2) new_c <- c
    P[r, new_c] <- .8
    
    if(r < ROWS & !((r + 1) == 2 & c  == 2)) 
      P[r + 1, c] <- .1 else P[r, c] <- P[r, c] + .1 
    if(r > 1 & !((r - 1) == 2 & c == 2)) 
      P[r - 1, c] <- .1 else P[r, c] <- P[r, c] + .1 
  }
  
  as.vector(P)
}

Try to go up from state 1 (this is (1,1), the bottom left corner). Note: we cannot go left so there is a .1 chance to stay in place.

calc_transition(1, 'Up')

 [1] 0.1 0.8 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

show_layout(calc_transition(1, 'Up'))

   col
row   1   2 3 4
  3 0.0 0.0 0 0
  2 0.8 0.0 0 0
  1 0.1 0.1 0 0

Try to go right from (2,1). Since right is blocked, there is a .8 probability of staying in place.

show_layout(calc_transition(2, 'Right'))

   col
row   1 2 3 4
  3 0.1 0 0 0
  2 0.8 0 0 0
  1 0.1 0 0 0

Calculate transitions for each state to each other state. Each row represents a state \(s\) and each column a state \(s'\) so we get a complete definition for \(P_a(s' | s)\). Note that the matrix is stochastic (all rows add up to 1).

Create a matrix for each action.

P_matrices <- lapply(A, FUN = function(a) calc_transition(S, a))
names(P_matrices) <- A
str(P_matrices)

List of 5
 $ Up   : num [1:12, 1:12] 0.1 0 0 0.1 0 0 0 0 0 0 ...
 $ Right: num [1:12, 1:12] 0.1 0.1 0 0 0 0 0 0 0 0 ...
 $ Down : num [1:12, 1:12] 0.9 0.8 0 0.1 0 0 0 0 0 0 ...
 $ Left : num [1:12, 1:12] 0.9 0.1 0 0.8 0 0 0 0 0 0 ...
 $ None : num [1:12, 1:12] 1 0 0 0 0 0 0 0 0 0 ...

Create a function interface for \(P(s' | s, a)\).

P <- function(sp, s, a) P_matrices[[a]][s, sp]

P(2, 1, 'Up')

[1] 0.8

P(5, 4, 'Up')

[1] 0

2.4 Reward

\(R(s, a, s')\) define the reward for the transition from \(s\) to \(s'\) with action \(a\).

For the textbook example we have:

Any move costs utility (a reward of -0.04).
Going to state 12 has a reward of +1
Going to state 11 has a reward of -1.

Note that once you are in an absorbing state (11 or 12), then the problem is over and there is no more reward!

R <- function(s, a, s_prime) {
  ## no more reward when we in 11 or 12.
  if(a == 'None' || s == 11 || s == 12) return(0)
  
  ## transition to the absorbing states.
  if(s_prime == 12) return(+1)
  if(s_prime == 11) return(-1)
  
  ## cost for each move
  return(-0.04)
}

R(1, 'Up', 2)

[1] -0.04

R(9, 'Right', 12)

[1] 1

R(12, 'None', 12)

[1] 0

2.5 Policy

The solution to an MDP is a policy \(\pi\) which defines which action to take in each state. We represent deterministic policies as a vector of actions. I make up a policy that always goes up and then to the right once the agent hits the top.

pi_manual <- rep('Up', times = length(S))
pi_manual[c(3, 6, 9)] <- 'Right'
pi_manual

 [1] "Up"    "Up"    "Right" "Up"    "Up"    "Right" "Up"    "Up"    "Right"
[10] "Up"    "Up"    "Up"

show_layout(pi_manual)

   col
row 1       2       3       4   
  3 "Right" "Right" "Right" "Up"
  2 "Up"    "Up"    "Up"    "Up"
  1 "Up"    "Up"    "Up"    "Up"

We can also create a random policy by randomly choosing from the available actions for each state.

create_random_deterministic_policy <-
  function()
    structure(sapply(
      S,
      FUN = function(s)
        sample(actions(s), 1L)
    ), names = S)

set.seed(1234)
pi_random <- create_random_deterministic_policy()
pi_random

      1       2       3       4       5       6       7       8       9      10 
 "Left"  "Left" "Right" "Right"  "None"  "Left"  "Down"    "Up"    "Up" "Right" 
     11      12 
 "None"  "None"

show_layout(pi_random)

   col
row 1       2       3      4      
  3 "Right" "Left"  "Up"   "None" 
  2 "Left"  "None"  "Up"   "None" 
  1 "Left"  "Right" "Down" "Right"

Stochastic policies use probabilities of actions in each state. We use as simple table with probabilities where each row is a state and the columns are the actions. Here we create a random \(\epsilon\)-soft policy. Each available has at least a probability of \(\epsilon\).

We can make a deterministic policy soft.

make_policy_soft <- 
  function(pi, epsilon = 0.1) {
    if(!is.vector(pi))
      stop("pi is not a deterministic policy!")
    
     p <-
      matrix(0,
             nrow = length(S),
             ncol = length(A),
             dimnames = list(S, A))
      
     for (s in S) {
      p[s, actions(s)] <- epsilon / length(actions(s))
      p[s, pi[s]] <- p[s, pi[s]] + (1 - epsilon)
     }
     
    p
  }

make_policy_soft(pi_random)

      Up Right  Down  Left None
1  0.025 0.025 0.025 0.925    0
2  0.025 0.025 0.025 0.925    0
3  0.025 0.925 0.025 0.025    0
4  0.025 0.925 0.025 0.025    0
5  0.000 0.000 0.000 0.000    1
6  0.025 0.025 0.025 0.925    0
7  0.025 0.025 0.925 0.025    0
8  0.925 0.025 0.025 0.025    0
9  0.925 0.025 0.025 0.025    0
10 0.025 0.925 0.025 0.025    0
11 0.000 0.000 0.000 0.000    1
12 0.000 0.000 0.000 0.000    1

Or we can create a completely random soft policy

create_random_epsilon_soft_policy <-
  function(epsilon = 0.1) {
    
    # split total randomly into n numbers that add up to total
    random_split <- function(n, total) {
      if (n == 1)
        return(total)
      
      bordersR <- c(sort(runif(n - 1)), 1)
      bordersL <- c(0, bordersR[1:(n - 1)])
      (bordersR - bordersL) * total
    }
    
    p <-
      matrix(0,
             nrow = length(S),
             ncol = length(A),
             dimnames = list(S, A))
    for (s in S)
      p[s, actions(s)] <- epsilon / length(actions(s)) +
      random_split(n = length(actions(s)), 1 - epsilon)
    
    p
}

set.seed(1234)
pi_random_epsilon_soft <- create_random_epsilon_soft_policy()
pi_random_epsilon_soft

      Up Right  Down  Left None
1  0.127 0.471 0.037 0.365    0
2  0.586 0.040 0.224 0.150    0
3  0.034 0.226 0.415 0.326    0
4  0.488 0.053 0.159 0.301    0
5  0.000 0.000 0.000 0.000    1
6  0.279 0.034 0.593 0.094    0
7  0.265 0.042 0.521 0.171    0
8  0.193 0.066 0.101 0.640    0
9  0.061 0.132 0.154 0.653    0
10 0.222 0.301 0.281 0.195    0
11 0.000 0.000 0.000 0.000    1
12 0.000 0.000 0.000 0.000    1

2.6 Expected Utility

The expected utility can be calculated by

\(U^\pi = E\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(s_t), s_{t+1})\right]\)

We need to define the discount factor.

GAMMA <- 1

We can evaluate the utility of a policy using simulation. For the stochastic transition model, we need to be able to sample the state \(s'\) the system transitions to when using action \(a\) in state \(s\).

sample_transition <- function(s, a)
  sample(S, size = 1, prob = P_matrices[[a]][s,])

sample_transition(1, 'Up')

[1] 1

table(replicate(n = 100, sample_transition(1, 'Up')))


 1  2  4 
 9 84  7

We can now simulate the utility for one episode. Note that we use the cutoff max_t in case a policy does not end up in a terminal state before that.

simulate_utility <- function(pi, s0 = 1, max_t = 100) {
  s <- s0
  U <- 0
  t <- 0
  
  while (TRUE) {
    ## get action from policy (matrix means it is a stochastic policy)
    if (!is.matrix(pi))
      a <- pi[s]
    else
      a <- sample(A, size = 1, prob = pi[s, ])
    
    ## sample a transition given the action from the policy
    s_prime <- sample_transition(s, a)
    
    ##
    U <- U + GAMMA ^ t * R(s, a, s_prime)
    
    s <- s_prime
    
    ## reached an absorbing state?
    if (s == 11 || s == 12 || s == 5)
      break
    
    t <- t + 1
    if (t >= max_t)
      break
  }
  
  U
}

Simulate the N episodes.

simulate_utilities <- function(pi, s0 = 1, N = 1000, max_t = 100)
  replicate(N, simulate_utility(pi, s0, max_t))

utility_manual <- simulate_utilities(pi_manual)

The expected utility for starting from state \(s_0 = 1\) is.

mean(utility_manual)

[1] 0.7

hist(utility_manual, xlim = c(-1, 1))

Compare with the utility of the random policy.

utility_random <- simulate_utilities(pi_random, max_t = 100)

table(utility_random)

utility_random
  -4 
1000

The random policy performs really bad and most likely always stumbles around for max_t moves at a cost of .04 each. The manually created policy is obviously better.

We can use simulation to estimate the expected utility for starting from each state following the policy.

U_manual <-
  sapply(
    S,
    FUN = function(s)
      mean(simulate_utilities(pi_manual, s0 = s))
  )
show_layout(U_manual)

   col
row    1     2    3     4
  3 0.86  0.90 0.95  0.00
  2 0.79 -0.04 0.72  0.00
  1 0.72  0.36 0.52 -0.87

U_random <-
  sapply(
    S,
    FUN = function(s)
      mean(simulate_utilities(pi_random, s0 = s))
  )
show_layout(U_random)

   col
row  1    2    3    4
  3 -4 -4.0 -1.6  0.0
  2 -4  0.0 -1.5  0.0
  1 -4 -1.8 -1.8 -1.4

3 Q-Learning

Q-Learning is an off-policy TD control algorithm to estimate \(Q \approx q_*.\) The pseudo code of a Q-Learning algorithm is given in AIMA Figure 22.8 as:

AIMA Figure 22.8: An exploratory Q-learning agent. It is an active learner that learns the value \(Q(s, a)\) of each action in each situation. It uses the exploration function \(f\).

The algorithm uses a temporal-difference (TD) learning since it updates using the TD error given by \(Q[s',a'] - Q[s, a]\).

The algorithm performs off-policy learning since it learns a different policy than the one used to perform actions.

Behavior policy: is defined by the exploration function \(f\). It needs to guarantee exploration. That is, every available action needs to have a chance to be tried. A popular policy is an \(\epsilon\)-greedy policy, but any \(\epsilon\)-soft policy that is increasing action probability with Q and decreases it with N can be used.
Target policy: This is the learned policy. Q-learning learns a deterministic greedy policy represented by the \(max_{a'} Q[s', a']\) in the update of Q.

3.1 Agent

The Q-learning agent parameters and helper functions.

# helper: argmax breaking ties randomly.
argmax <- function (x) 
{
    y <- seq_along(x)[x == max(x)]
    if (length(y) > 1L) 
        sample(y, 1L)
    else y
}

We use a class to store persistent information: Agent state, \(Q\) and \(N\) tables. Reference classes use <<- to assign values to member variables (fields).

QL_Agent <- setRefClass(
  "QL_Agent",
  fields = c("S", "A", "Q", "N", "s", "a", "gamma"),
  methods = list(
    initialize = function(S, A, gamma = 1) {
      S <<- S
      A <<- A
      gamma <<- gamma
      reset()
    },
    
    next_episode = function() {
      "Start a new episode."
      
      s <<- NULL
      a <<- 'None'
    },
    
    reset = function() {
      "Reset the agent's state and erase all persistent information."
      
      # set q-values to 0 except use NA for unavailable actions
      Q <<-
        matrix(NA_real_,
               nrow = length(S),
               ncol = length(A),
               dimnames = list(S, A))
      
      for (s in S) Q[s, actions(s)] <<- 0
      
      N <<-
        matrix(0L,
               nrow = length(S),
               ncol = length(A),
               dimnames = list(S, A))
      
      next_episode()
    }
    
  )
)

The agent function is called run with the precepts including the new state and the associated reward.

QL_Agent$methods(
  # Learning rate:
  # Fixed learning rate
  #alpha = function(n, rate = 0.1) return(rate)

  # A decreasing learning rate converges faster. n is the number a 
  # state was visited. The numbers depend on the problem.
  alpha = function(n) {
    return (10 / (9 + n))
  },

  # Exploration function (behavior policy): We use a simple 
  # epsilon-greedy policy here.
  maxf = function(s_prime, a_best, epsilon = .8) {
    if (runif(1) < epsilon)
      action <- a_best
    else
      action <- sample(actions(s_prime), size = 1)
  
    return(action)
  },
 
  # Q-learning algorithm 
  run = function(s_prime, r) {
    "Agent function. The pecepts are the current state and the last reward."
    
    # how often did the agent try this s/a combination
    N[s, a] <<- N[s, a] + 1L
    
    # Greedy target policy: find best available action for s' (breaking ties randomly).
    available_actions <- actions(s_prime)
    a_best <- available_actions[argmax(Q[s_prime,available_actions])]
    
    # No update for start of new episode
    if (!is.null(s)) {
      Q[s, a] <<-
        Q[s, a] + alpha(N[s, a]) * (r + gamma * Q[s_prime, a_best] - Q[s, a])
    }
    
    if (is_terminal(s_prime))
      a <<- 'None'
    else
      a <<- maxf(s_prime, a_best)
    
    s <<- s_prime
    
    return(a)
  }
  
)

Create an agent.

agent <- QL_Agent$new(S, A)
agent

Reference class object of class "QL_Agent"
Field "S":
 [1]  1  2  3  4  5  6  7  8  9 10 11 12
Field "A":
[1] "Up"    "Right" "Down"  "Left"  "None" 
Field "Q":
   Up Right Down Left None
1   0     0    0    0   NA
2   0     0    0    0   NA
3   0     0    0    0   NA
4   0     0    0    0   NA
5  NA    NA   NA   NA    0
6   0     0    0    0   NA
7   0     0    0    0   NA
8   0     0    0    0   NA
9   0     0    0    0   NA
10  0     0    0    0   NA
11 NA    NA   NA   NA    0
12 NA    NA   NA   NA    0
Field "N":
   Up Right Down Left None
1   0     0    0    0    0
2   0     0    0    0    0
3   0     0    0    0    0
4   0     0    0    0    0
5   0     0    0    0    0
6   0     0    0    0    0
7   0     0    0    0    0
8   0     0    0    0    0
9   0     0    0    0    0
10  0     0    0    0    0
11  0     0    0    0    0
12  0     0    0    0    0
Field "s":
NULL
Field "a":
[1] "None"
Field "gamma":
[1] 1

3.2 Environment

This is an environment that can simulate \(n\) episodes and returns the utility the agent received for each episode.

simulate_environment <-
  function(n = 1000,
           agent,
           reset_agent = TRUE,
           verbose = FALSE) {
    
    reward_history <- numeric(n)
    state <- start
    
    if (reset_agent)
      agent$reset()
    agent$next_episode()
    
    i <- 1L # episode counter
    j <- 1L # step in episode counter
    reward <- 0
    reward_episode <- 0
    while (TRUE) {
      reward_episode <- reward_episode + GAMMA ^ j * reward
      
      # call agent function with percepts (current state and reward)
      action <- agent$run(state, reward)
      
      if (verbose) {
        if (j == 1L)
          cat("\n*** Episode", i , "***\n")
        cat("Step",
            j,
            paste0("(s = ", state, ", r = ",
                   reward, ") -> ", action),
            "\n")
      }
      
      j <- j + 1L
      
      if (is_terminal(state)) {
        reward_history[i] <- reward_episode
        if (verbose) {
          cat("Episode reward: ", reward_episode, "\n")
          
          cat("Q:\n")
          print(agent$Q)
          
          cat("U:\n")
          print(show_layout(apply(
            agent$Q, MARGIN = 1, max, na.rm = TRUE
          )))
        }
        
        j <- 1L
        
        # finished n episodes?
        i <- i + 1L
        if (i > n)
          break
        
        state <- start
        reward <- 0
        reward_episode <- 0
        
        agent$next_episode()
      }
      else
        if (action %in% actions(state)) {
          state_prime <- sample_transition(state, action)
          reward <- R(state, action, state_prime)
          state <- state_prime
        } else {
          stop("The agent chose an illegal action!")
          #reward <- -1e9
        }
    }
    reward_history
  }

3.3 Analyzing the First Few Episodes

Run a few episodes.

simulate_environment(n = 10, agent, verbose = TRUE)


*** Episode 1 ***
Step 1 (s = 1, r = 0) -> Up 
Step 2 (s = 4, r = -0.04) -> Up 
Step 3 (s = 4, r = -0.04) -> Right 
Step 4 (s = 7, r = -0.04) -> Right 
Step 5 (s = 10, r = -0.04) -> Right 
Step 6 (s = 10, r = -0.04) -> Left 
Step 7 (s = 11, r = -1) -> None 
Episode reward:  -1.2 
Q:
      Up Right Down Left None
1  -0.04  0.00    0    0   NA
2   0.00  0.00    0    0   NA
3   0.00  0.00    0    0   NA
4  -0.04 -0.04    0    0   NA
5     NA    NA   NA   NA    0
6   0.00  0.00    0    0   NA
7   0.00 -0.04    0    0   NA
8   0.00  0.00    0    0   NA
9   0.00  0.00    0    0   NA
10  0.00 -0.04    0   -1   NA
11    NA    NA   NA   NA    0
12    NA    NA   NA   NA    0
U:
   col
row 1 2 3 4
  3 0 0 0 0
  2 0 0 0 0
  1 0 0 0 0

*** Episode 2 ***
Step 1 (s = 1, r = 0) -> Right 
Step 2 (s = 4, r = -0.04) -> Right 
Step 3 (s = 7, r = -0.04) -> Down 
Step 4 (s = 7, r = -0.04) -> Up 
Step 5 (s = 8, r = -0.04) -> Right 
Step 6 (s = 11, r = -1) -> None 
Episode reward:  -1.2 
Q:
      Up Right  Down Left None
1  -0.04 -0.04  0.00    0   NA
2   0.00  0.00  0.00    0   NA
3   0.00  0.00  0.00    0   NA
4  -0.04 -0.04  0.00    0   NA
5     NA    NA    NA   NA    0
6   0.00  0.00  0.00    0   NA
7  -0.04 -0.04 -0.04    0   NA
8   0.00 -1.00  0.00    0   NA
9   0.00  0.00  0.00    0   NA
10  0.00 -0.04  0.00   -1   NA
11    NA    NA    NA   NA    0
12    NA    NA    NA   NA    0
U:
   col
row 1 2 3 4
  3 0 0 0 0
  2 0 0 0 0
  1 0 0 0 0

*** Episode 3 ***
Step 1 (s = 1, r = 0) -> Right 
Step 2 (s = 1, r = -0.04) -> Down 
Step 3 (s = 1, r = -0.04) -> Down 
Step 4 (s = 1, r = -0.04) -> Left 
Step 5 (s = 1, r = -0.04) -> Left 
Step 6 (s = 1, r = -0.04) -> Up 
Step 7 (s = 2, r = -0.04) -> Up 
Step 8 (s = 2, r = -0.04) -> Right 
Step 9 (s = 2, r = -0.04) -> Down 
Step 10 (s = 1, r = -0.04) -> Up 
Step 11 (s = 2, r = -0.04) -> Left 
Step 12 (s = 2, r = -0.04) -> Left 
Step 13 (s = 3, r = -0.04) -> Right 
Step 14 (s = 6, r = -0.04) -> Up 
Step 15 (s = 9, r = -0.04) -> Down 
Step 16 (s = 8, r = -0.04) -> Up 
Step 17 (s = 11, r = -1) -> None 
Episode reward:  -1.6 
Q:
      Up Right  Down   Left None
1  -0.04 -0.04 -0.04 -0.076   NA
2  -0.04 -0.04 -0.08 -0.040   NA
3   0.00 -0.04  0.00  0.000   NA
4  -0.04 -0.04  0.00  0.000   NA
5     NA    NA    NA     NA    0
6  -0.04  0.00  0.00  0.000   NA
7  -0.04 -0.04 -0.04  0.000   NA
8  -1.00 -1.00  0.00  0.000   NA
9   0.00  0.00 -0.04  0.000   NA
10  0.00 -0.04  0.00 -1.000   NA
11    NA    NA    NA     NA    0
12    NA    NA    NA     NA    0
U:
   col
row     1 2 3 4
  3  0.00 0 0 0
  2 -0.04 0 0 0
  1 -0.04 0 0 0

*** Episode 4 ***
Step 1 (s = 1, r = 0) -> Down 
Step 2 (s = 1, r = -0.04) -> Right 
Step 3 (s = 4, r = -0.04) -> Down 
Step 4 (s = 4, r = -0.04) -> Left 
Step 5 (s = 1, r = -0.04) -> Right 
Step 6 (s = 4, r = -0.04) -> Down 
Step 7 (s = 1, r = -0.04) -> Up 
Step 8 (s = 2, r = -0.04) -> Left 
Step 9 (s = 3, r = -0.04) -> Left 
Step 10 (s = 2, r = -0.04) -> Right 
Step 11 (s = 2, r = -0.04) -> Up 
Step 12 (s = 3, r = -0.04) -> Up 
Step 13 (s = 3, r = -0.04) -> Down 
Step 14 (s = 2, r = -0.04) -> Left 
Step 15 (s = 2, r = -0.04) -> Left 
Step 16 (s = 2, r = -0.04) -> Up 
Step 17 (s = 3, r = -0.04) -> Right 
Step 18 (s = 2, r = -0.04) -> Up 
Step 19 (s = 3, r = -0.04) -> Up 
Step 20 (s = 3, r = -0.04) -> Left 
Step 21 (s = 3, r = -0.04) -> Down 
Step 22 (s = 6, r = -0.04) -> Left 
Step 23 (s = 6, r = -0.04) -> Down 
Step 24 (s = 3, r = -0.04) -> Down 
Step 25 (s = 2, r = -0.04) -> Down 
Step 26 (s = 2, r = -0.04) -> Right 
Step 27 (s = 2, r = -0.04) -> Left 
Step 28 (s = 2, r = -0.04) -> Left 
Step 29 (s = 2, r = -0.04) -> Up 
Step 30 (s = 3, r = -0.04) -> Up 
Step 31 (s = 6, r = -0.04) -> Right 
Step 32 (s = 9, r = -0.04) -> Up 
Step 33 (s = 6, r = -0.04) -> Up 
Step 34 (s = 6, r = -0.04) -> Up 
Step 35 (s = 3, r = -0.04) -> Up 
Step 36 (s = 3, r = -0.04) -> Up 
Step 37 (s = 3, r = -0.04) -> Up 
Step 38 (s = 3, r = -0.04) -> Down 
Step 39 (s = 6, r = -0.04) -> Left 
Step 40 (s = 3, r = -0.04) -> Down 
Step 41 (s = 6, r = -0.04) -> Right 
Step 42 (s = 9, r = -0.04) -> Right 
Step 43 (s = 12, r = 1) -> None 
Episode reward:  -0.64 
Q:
       Up  Right   Down   Left None
1  -0.071 -0.071 -0.073 -0.076   NA
2  -0.106 -0.110 -0.113 -0.113   NA
3  -0.131 -0.107 -0.082 -0.113   NA
4  -0.040 -0.040 -0.076 -0.080   NA
5      NA     NA     NA     NA    0
6  -0.084 -0.040 -0.084 -0.118   NA
7  -0.040 -0.040 -0.040  0.000   NA
8  -1.000 -1.000  0.000  0.000   NA
9  -0.080  1.000 -0.040  0.000   NA
10  0.000 -0.040  0.000 -1.000   NA
11     NA     NA     NA     NA    0
12     NA     NA     NA     NA    0
U:
   col
row      1     2 3 4
  3 -0.082 -0.04 1 0
  2 -0.106  0.00 0 0
  1 -0.071 -0.04 0 0

*** Episode 5 ***
Step 1 (s = 1, r = 0) -> Down 
Step 2 (s = 1, r = -0.04) -> Right 
Step 3 (s = 4, r = -0.04) -> Right 
Step 4 (s = 7, r = -0.04) -> Left 
Step 5 (s = 4, r = -0.04) -> Right 
Step 6 (s = 7, r = -0.04) -> Down 
Step 7 (s = 7, r = -0.04) -> Up 
Step 8 (s = 8, r = -0.04) -> Left 
Step 9 (s = 8, r = -0.04) -> Left 
Step 10 (s = 9, r = -0.04) -> Right 
Step 11 (s = 12, r = 1) -> None 
Episode reward:  0.64 
Q:
       Up  Right   Down   Left None
1  -0.071 -0.077 -0.102 -0.076   NA
2  -0.106 -0.110 -0.113 -0.113   NA
3  -0.131 -0.107 -0.082 -0.113   NA
4  -0.040 -0.071 -0.076 -0.080   NA
5      NA     NA     NA     NA    0
6  -0.084 -0.040 -0.084 -0.118   NA
7  -0.040 -0.040 -0.076 -0.080   NA
8  -1.000 -1.000  0.000  0.869   NA
9  -0.080  1.000 -0.040  0.000   NA
10  0.000 -0.040  0.000 -1.000   NA
11     NA     NA     NA     NA    0
12     NA     NA     NA     NA    0
U:
   col
row      1     2     3 4
  3 -0.082 -0.04  1.00 0
  2 -0.106  0.00  0.87 0
  1 -0.071 -0.04 -0.04 0

*** Episode 6 ***
Step 1 (s = 1, r = 0) -> Up 
Step 2 (s = 2, r = -0.04) -> Right 
Step 3 (s = 2, r = -0.04) -> Up 
Step 4 (s = 3, r = -0.04) -> Down 
Step 5 (s = 2, r = -0.04) -> Right 
Step 6 (s = 2, r = -0.04) -> Left 
Step 7 (s = 2, r = -0.04) -> Left 
Step 8 (s = 2, r = -0.04) -> Down 
Step 9 (s = 2, r = -0.04) -> Down 
Step 10 (s = 1, r = -0.04) -> Left 
Step 11 (s = 1, r = -0.04) -> Left 
Step 12 (s = 2, r = -0.04) -> Up 
Step 13 (s = 3, r = -0.04) -> Down 
Step 14 (s = 6, r = -0.04) -> Right 
Step 15 (s = 6, r = -0.04) -> Right 
Step 16 (s = 9, r = -0.04) -> Right 
Step 17 (s = 12, r = 1) -> None 
Episode reward:  0.4 
Q:
       Up  Right   Down  Left None
1  -0.124 -0.077 -0.102 -0.15   NA
2  -0.135 -0.149 -0.123 -0.15   NA
3  -0.131 -0.107 -0.098 -0.11   NA
4  -0.040 -0.071 -0.076 -0.08   NA
5      NA     NA     NA    NA    0
6  -0.084  0.722 -0.084 -0.12   NA
7  -0.040 -0.040 -0.076 -0.08   NA
8  -1.000 -1.000  0.000  0.87   NA
9  -0.080  1.000 -0.040  0.00   NA
10  0.000 -0.040  0.000 -1.00   NA
11     NA     NA     NA    NA    0
12     NA     NA     NA    NA    0
U:
   col
row      1     2     3 4
  3 -0.098  0.72  1.00 0
  2 -0.123  0.00  0.87 0
  1 -0.077 -0.04 -0.04 0

*** Episode 7 ***
Step 1 (s = 1, r = 0) -> Right 
Step 2 (s = 4, r = -0.04) -> Up 
Step 3 (s = 4, r = -0.04) -> Up 
Step 4 (s = 1, r = -0.04) -> Right 
Step 5 (s = 4, r = -0.04) -> Right 
Step 6 (s = 7, r = -0.04) -> Right 
Step 7 (s = 10, r = -0.04) -> Down 
Step 8 (s = 10, r = -0.04) -> Down 
Step 9 (s = 10, r = -0.04) -> Up 
Step 10 (s = 11, r = -1) -> None 
Episode reward:  -1.3 
Q:
       Up  Right   Down  Left None
1  -0.124 -0.099 -0.102 -0.15   NA
2  -0.135 -0.149 -0.123 -0.15   NA
3  -0.131 -0.107 -0.098 -0.11   NA
4  -0.112 -0.077 -0.076 -0.08   NA
5      NA     NA     NA    NA    0
6  -0.084  0.722 -0.084 -0.12   NA
7  -0.040 -0.040 -0.076 -0.08   NA
8  -1.000 -1.000  0.000  0.87   NA
9  -0.080  1.000 -0.040  0.00   NA
10 -1.000 -0.040 -0.040 -1.00   NA
11     NA     NA     NA    NA    0
12     NA     NA     NA    NA    0
U:
   col
row      1      2     3     4
  3 -0.098  0.722  1.00  0.00
  2 -0.123  0.000  0.87  0.00
  1 -0.099 -0.076 -0.04 -0.04

*** Episode 8 ***
Step 1 (s = 1, r = 0) -> Right 
Step 2 (s = 4, r = -0.04) -> Down 
Step 3 (s = 4, r = -0.04) -> Down 
Step 4 (s = 4, r = -0.04) -> Left 
Step 5 (s = 1, r = -0.04) -> Down 
Step 6 (s = 4, r = -0.04) -> Right 
Step 7 (s = 7, r = -0.04) -> Up 
Step 8 (s = 8, r = -0.04) -> Left 
Step 9 (s = 8, r = -0.04) -> Left 
Step 10 (s = 8, r = -0.04) -> Left 
Step 11 (s = 8, r = -0.04) -> Left 
Step 12 (s = 8, r = -0.04) -> Left 
Step 13 (s = 8, r = -0.04) -> Up 
Step 14 (s = 9, r = -0.04) -> Right 
Step 15 (s = 12, r = 1) -> None 
Episode reward:  0.48 
Q:
       Up  Right   Down  Left None
1  -0.124 -0.109 -0.113 -0.15   NA
2  -0.135 -0.149 -0.123 -0.15   NA
3  -0.131 -0.107 -0.098 -0.11   NA
4  -0.112 -0.079 -0.116 -0.14   NA
5      NA     NA     NA    NA    0
6  -0.084  0.722 -0.084 -0.12   NA
7   0.684 -0.040 -0.076 -0.08   NA
8   0.782 -1.000  0.000  0.72   NA
9  -0.080  1.000 -0.040  0.00   NA
10 -1.000 -0.040 -0.040 -1.00   NA
11     NA     NA     NA    NA    0
12     NA     NA     NA    NA    0
U:
   col
row      1      2    3     4
  3 -0.098  0.722 1.00  0.00
  2 -0.123  0.000 0.78  0.00
  1 -0.109 -0.079 0.68 -0.04

*** Episode 9 ***
Step 1 (s = 1, r = 0) -> Up 
Step 2 (s = 2, r = -0.04) -> Down 
Step 3 (s = 1, r = -0.04) -> Right 
Step 4 (s = 4, r = -0.04) -> Left 
Step 5 (s = 1, r = -0.04) -> Down 
Step 6 (s = 1, r = -0.04) -> Down 
Step 7 (s = 4, r = -0.04) -> Right 
Step 8 (s = 7, r = -0.04) -> Up 
Step 9 (s = 8, r = -0.04) -> Up 
Step 10 (s = 9, r = -0.04) -> Right 
Step 11 (s = 9, r = -0.04) -> Right 
Step 12 (s = 8, r = -0.04) -> Up 
Step 13 (s = 9, r = -0.04) -> Right 
Step 14 (s = 12, r = 1) -> None 
Episode reward:  0.52 
Q:
       Up Right   Down  Left None
1  -0.150 -0.11 -0.127 -0.15   NA
2  -0.135 -0.15 -0.142 -0.15   NA
3  -0.131 -0.11 -0.098 -0.11   NA
4  -0.112  0.37 -0.116 -0.15   NA
5      NA    NA     NA    NA    0
6  -0.084  0.72 -0.084 -0.12   NA
7   0.729 -0.04 -0.076 -0.08   NA
8   0.890 -1.00  0.000  0.72   NA
9  -0.080  0.97 -0.040  0.00   NA
10 -1.000 -0.04 -0.040 -1.00   NA
11     NA    NA     NA    NA    0
12     NA    NA     NA    NA    0
U:
   col
row      1    2    3     4
  3 -0.098 0.72 0.97  0.00
  2 -0.135 0.00 0.89  0.00
  1 -0.115 0.37 0.73 -0.04

*** Episode 10 ***
Step 1 (s = 1, r = 0) -> Right 
Step 2 (s = 4, r = -0.04) -> Right 
Step 3 (s = 4, r = -0.04) -> Right 
Step 4 (s = 7, r = -0.04) -> Up 
Step 5 (s = 8, r = -0.04) -> Up 
Step 6 (s = 9, r = -0.04) -> Right 
Step 7 (s = 9, r = -0.04) -> Up 
Step 8 (s = 9, r = -0.04) -> Right 
Step 9 (s = 12, r = 1) -> None 
Episode reward:  0.72 
Q:
       Up Right   Down  Left None
1  -0.150  0.12 -0.127 -0.15   NA
2  -0.135 -0.15 -0.142 -0.15   NA
3  -0.131 -0.11 -0.098 -0.11   NA
4  -0.112  0.54 -0.116 -0.15   NA
5      NA    NA     NA    NA    0
6  -0.084  0.72 -0.084 -0.12   NA
7   0.815 -0.04 -0.076 -0.08   NA
8   0.918 -1.00  0.000  0.72   NA
9   0.816  0.98 -0.040  0.00   NA
10 -1.000 -0.04 -0.040 -1.00   NA
11     NA    NA     NA    NA    0
12     NA    NA     NA    NA    0
U:
   col
row      1    2    3     4
  3 -0.098 0.72 0.98  0.00
  2 -0.135 0.00 0.92  0.00
  1  0.121 0.54 0.81 -0.04

 [1] -1.20 -1.16 -1.60 -0.64  0.64  0.40 -1.32  0.48  0.52  0.72

Note that the agent has initial no idea about the optimal policy and runs around randomly. After it randomly finds the goal, the rewards starts to spread one square every episode where the agent reaches the goal. Also, the agent starts to find the goal faster.

Calculate the value function \(U\) from the learned Q-function as the largest Q value of any action in a state.

U <- apply(agent$Q, MARGIN = 1, max, na.rm = TRUE)
show_layout(U)

   col
row      1    2    3     4
  3 -0.098 0.72 0.98  0.00
  2 -0.135 0.00 0.92  0.00
  1  0.121 0.54 0.81 -0.04

Note that not all states are reached the same amount of times and some estimates may be worse than others. Since Q-Learning uses a \(\epsilon\)-greedy behavior policy, the best actions are used most often leading to better estimates around the optimal policy.

Extract the learned policy from \(Q\) as the action with the highest Q-value for each state.

pi <- A[apply(agent$Q, MARGIN = 1, which.max)]
show_layout(pi)

   col
row 1       2       3       4      
  3 "Down"  "Right" "Right" "None" 
  2 "Up"    "None"  "Up"    "None" 
  1 "Right" "Right" "Up"    "Right"

Even after simulating just 10 episodes, we get a reasonable policy that will lead the agent mostly to the goal state. We can estimate the expected utility of the policy using simulation.

mean(simulate_utilities(pi, s0 = 1, N = 100))

[1] 0.3

3.4 Leaning a Good Policy

Simulating more episodes results in better estimates (also for the rarely visited states).

reward_history <- simulate_environment(n = 10000, agent, 
    verbose = FALSE, reset_agent = TRUE)

plot(cumsum(reward_history[1:500]), type = "l", 
     ylab = "Total reward collected", xlab = "Episode")

The total reward approaches a straight line after just a few epochs, meaning that a reasonable policy was already found.

show_layout(LAYOUT)

   col
row 1 2 3  4
  3 3 6 9 12
  2 2 5 8 11
  1 1 4 7 10

agent$Q

      Up Right Down  Left None
1   0.75  0.67 0.70 0.715   NA
2   0.80  0.76 0.72 0.762   NA
3   0.82  0.85 0.78 0.810   NA
4   0.66  0.62 0.66 0.699   NA
5     NA    NA   NA    NA    0
6   0.87  0.91 0.87 0.829   NA
7   0.62  0.38 0.58 0.657   NA
8   0.66 -0.57 0.39 0.653   NA
9   0.92  0.96 0.71 0.860   NA
10 -0.40  0.18 0.24 0.078   NA
11    NA    NA   NA    NA    0
12    NA    NA   NA    NA    0

Learned state-value function.

U <- apply(agent$Q, MARGIN = 1, max, na.rm = TRUE)
show_layout(U)

   col
row    1    2    3    4
  3 0.85 0.91 0.96 0.00
  2 0.80 0.00 0.66 0.00
  1 0.75 0.70 0.66 0.24

Extract the greedy policy.

pi <- A[apply(agent$Q, MARGIN = 1, which.max)]
show_layout(pi)

   col
row 1       2       3       4     
  3 "Right" "Right" "Right" "None"
  2 "Up"    "None"  "Up"    "None"
  1 "Up"    "Left"  "Left"  "Down"

We can estimate the expected utility of the policy using simulation.

mean(simulate_utilities(pi, s0 = 1, N = 100))

[1] 0.7