# I use capitalized variables as global constants
COLS <- 4
ROWS <- 3
S = seq_len(ROWS * COLS)
LAYOUT <- matrix(S, nrow = ROWS, ncol = COLS)
LAYOUT
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
This code is provided under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.
AIMA chapter 22 is about reinforcement learning where the agent learns from reward signals. We will implement the key concepts using R for the AIMA 3x4 grid world example. A more detailed treatment of the algorithm and its properties can be found in Reinforcement Learning: An Introduction (RL) by Sutton and Barto (2020) in Chapter 6: Temporal-Difference Learning.
The used environment is an MDP, but instead of trying to solve the MDP directly and estimating the value function estimates \(U(s)\), we will try to learn a policy from interactions with the environment. The MDP’s transition model will only be used to simulate the response of the environment to actions by the agent.
The code in this notebook defines explicit functions matching the textbook definitions and is for demonstration purposes only. Efficient implementations for larger problems use fast vector multiplications instead.
The used example is the simple 4x3 grid world described in AIMA Figure 12.1 and used again in 22.1 as:
In the following, we implement states, actions, the reward function,and the transition model. We also show how to represent a policy and how to estimate the expected utility using simulation.
The MDP will be defined as the following global variables/functions:
S
: set of states.A
: set of actions.actions(s)
: returns the available actions for state \(s\).P(sp, s, a)
: a function returning the transition probability \(P(s' | s, a)\).R(s, a, s_prime)
: reward function for the transition from \(s\) to \(s`\) with action a
.Policies are represented as:
Other useful functions:
sample_transition(s, a)
: returns a \(s'\) sampled using the transition model.simulate_utilities(pi, s0 = 1, N = 1000, max_t = 100)
: Estimates the utility of following policy \(\pi\), starting \(N\) episodes in state \(s_0\). The maximal episode length is max_t
to ensure that the function finishes also for policies that do not lead to a terminal state.We define the atomic state space \(S\) by labeling the states \(1, 2, ...\). We convert coordinates (rows, columns)
to the state label.
# I use capitalized variables as global constants
COLS <- 4
ROWS <- 3
S = seq_len(ROWS * COLS)
LAYOUT <- matrix(S, nrow = ROWS, ncol = COLS)
LAYOUT
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Note that the rows are displayed upside-down compared to the text book, so we use a function to display them in reverse order.
show_layout <- function(x) {
x <- matrix(x, ncol = COLS, nrow = ROWS, dimnames = list(row = seq_len(ROWS),
col = seq_len(COLS)))
x[rev(seq_len(ROWS)), ]
}
show_layout(LAYOUT)
col
row 1 2 3 4
3 3 6 9 12
2 2 5 8 11
1 1 4 7 10
Convert between coordinates and state labels.
rc_to_s <- function(rc) LAYOUT[rbind(rc)]
s_to_rc <- function(s) drop(which(LAYOUT == s, arr.ind = TRUE, useNames = FALSE))
rc_to_s(c(3, 4))
[1] 12
[1] 3 4
Start state ::: {.cell}
:::
Define terminal states.
The complete set of actions is \(A = \{\mathrm{'Up', 'Right', 'Down', 'Left', 'None'}\}\). Not all actions are available in every state. Also, action None
is added as the only possible action in an absorbing state.
A = c("Up", "Right", "Down", "Left", "None")
actions <- function(s) {
# absorbing states
if (s == 11 || s == 12)
return("None")
# illegal state
if (s == 5)
return("None")
c("Up", "Right", "Down", "Left")
}
lapply(S, actions)
[[1]]
[1] "Up" "Right" "Down" "Left"
[[2]]
[1] "Up" "Right" "Down" "Left"
[[3]]
[1] "Up" "Right" "Down" "Left"
[[4]]
[1] "Up" "Right" "Down" "Left"
[[5]]
[1] "None"
[[6]]
[1] "Up" "Right" "Down" "Left"
[[7]]
[1] "Up" "Right" "Down" "Left"
[[8]]
[1] "Up" "Right" "Down" "Left"
[[9]]
[1] "Up" "Right" "Down" "Left"
[[10]]
[1] "Up" "Right" "Down" "Left"
[[11]]
[1] "None"
[[12]]
[1] "None"
\(P(s' | s, a)\) is the probability of going from state \(s\) to \(s'\) by when taking action \(a\). We will create a matrix \(P_a(s' | s)\) for each action.
calc_transition <- function(s, action) {
action <- match.arg(action, choices = A)
if (length(s) > 1)
return(t(sapply(s, calc_transition, action = action)))
# deal with absorbing and illegal state
if (s == 11 || s == 12 || s == 5 || action == "None") {
P <- rep(0, length(S))
P[s] <- 1
return(P)
}
action_to_delta <- list(Up = c(+1, 0), Down = c(-1, 0), Right = c(0, +1), Left = c(0,
-1))
delta <- action_to_delta[[action]]
dr <- delta[1]
dc <- delta[2]
rc <- s_to_rc(s)
r <- rc[1]
c <- rc[2]
if (dr != 0 && dc != 0)
stop("You can only go up/down or right/left!")
P <- matrix(0, nrow = ROWS, ncol = COLS)
# UP/DOWN
if (dr != 0) {
new_r <- r + dr
if (new_r > ROWS || new_r < 1)
new_r <- r
## can't got to (2, 2)
if (new_r == 2 && c == 2)
new_r <- r
P[new_r, c] <- 0.8
if (c < COLS & !(r == 2 & (c + 1) == 2))
P[r, c + 1] <- 0.1 else P[r, c] <- P[r, c] + 0.1
if (c > 1 & !(r == 2 & (c - 1) == 2))
P[r, c - 1] <- 0.1 else P[r, c] <- P[r, c] + 0.1
}
# RIGHT/LEFT
if (dc != 0) {
new_c <- c + dc
if (new_c > COLS || new_c < 1)
new_c <- c
## can't got to (2, 2)
if (r == 2 && new_c == 2)
new_c <- c
P[r, new_c] <- 0.8
if (r < ROWS & !((r + 1) == 2 & c == 2))
P[r + 1, c] <- 0.1 else P[r, c] <- P[r, c] + 0.1
if (r > 1 & !((r - 1) == 2 & c == 2))
P[r - 1, c] <- 0.1 else P[r, c] <- P[r, c] + 0.1
}
as.vector(P)
}
Try to go up from state 1 (this is (1,1), the bottom left corner). Note: we cannot go left so there is a .1 chance to stay in place. ::: {.cell}
[1] 0.1 0.8 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
col
row 1 2 3 4
3 0.0 0.0 0 0
2 0.8 0.0 0 0
1 0.1 0.1 0 0
:::
Try to go right from (2,1). Since right is blocked, there is a .8 probability of staying in place. ::: {.cell}
col
row 1 2 3 4
3 0.1 0 0 0
2 0.8 0 0 0
1 0.1 0 0 0
:::
Calculate transitions for each state to each other state. Each row represents a state \(s\) and each column a state \(s'\) so we get a complete definition for \(P_a(s' | s)\). Note that the matrix is stochastic (all rows add up to 1).
Create a matrix for each action.
P_matrices <- lapply(A, FUN = function(a) calc_transition(S, a))
names(P_matrices) <- A
str(P_matrices)
List of 5
$ Up : num [1:12, 1:12] 0.1 0 0 0.1 0 0 0 0 0 0 ...
$ Right: num [1:12, 1:12] 0.1 0.1 0 0 0 0 0 0 0 0 ...
$ Down : num [1:12, 1:12] 0.9 0.8 0 0.1 0 0 0 0 0 0 ...
$ Left : num [1:12, 1:12] 0.9 0.1 0 0.8 0 0 0 0 0 0 ...
$ None : num [1:12, 1:12] 1 0 0 0 0 0 0 0 0 0 ...
Create a function interface for \(P(s' | s, a)\).
\(R(s, a, s')\) define the reward for the transition from \(s\) to \(s'\) with action \(a\).
For the textbook example we have:
Note that once you are in an absorbing state (11 or 12), then the problem is over and there is no more reward!
The solution to an MDP is a policy \(\pi\) which defines which action to take in each state. We represent deterministic policies as a vector of actions. I make up a policy that always goes up and then to the right once the agent hits the top.
[1] "Up" "Up" "Right" "Up" "Up" "Right" "Up" "Up" "Right"
[10] "Up" "Up" "Up"
col
row 1 2 3 4
3 "Right" "Right" "Right" "Up"
2 "Up" "Up" "Up" "Up"
1 "Up" "Up" "Up" "Up"
We can also create a random policy by randomly choosing from the available actions for each state.
create_random_deterministic_policy <- function() structure(sapply(S, FUN = function(s) sample(actions(s),
1L)), names = S)
set.seed(1234)
pi_random <- create_random_deterministic_policy()
pi_random
1 2 3 4 5 6 7 8 9 10
"Left" "Left" "Right" "Right" "None" "Left" "Down" "Up" "Up" "Right"
11 12
"None" "None"
col
row 1 2 3 4
3 "Right" "Left" "Up" "None"
2 "Left" "None" "Up" "None"
1 "Left" "Right" "Down" "Right"
Stochastic policies use probabilities of actions in each state. We use as simple table with probabilities where each row is a state and the columns are the actions. Here we create a random \(\epsilon\)-soft policy. Each available has at least a probability of \(\epsilon\).
We can make a deterministic policy soft. ::: {.cell}
make_policy_soft <- function(pi, epsilon = 0.1) {
if (!is.vector(pi))
stop("pi is not a deterministic policy!")
p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))
for (s in S) {
p[s, actions(s)] <- epsilon/length(actions(s))
p[s, pi[s]] <- p[s, pi[s]] + (1 - epsilon)
}
p
}
make_policy_soft(pi_random)
Up Right Down Left None
1 0.025 0.025 0.025 0.925 0
2 0.025 0.025 0.025 0.925 0
3 0.025 0.925 0.025 0.025 0
4 0.025 0.925 0.025 0.025 0
5 0.000 0.000 0.000 0.000 1
6 0.025 0.025 0.025 0.925 0
7 0.025 0.025 0.925 0.025 0
8 0.925 0.025 0.025 0.025 0
9 0.925 0.025 0.025 0.025 0
10 0.025 0.925 0.025 0.025 0
11 0.000 0.000 0.000 0.000 1
12 0.000 0.000 0.000 0.000 1
:::
Or we can create a completely random soft policy
create_random_epsilon_soft_policy <- function(epsilon = 0.1) {
# split total randomly into n numbers that add up to total
random_split <- function(n, total) {
if (n == 1)
return(total)
bordersR <- c(sort(runif(n - 1)), 1)
bordersL <- c(0, bordersR[1:(n - 1)])
(bordersR - bordersL) * total
}
p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))
for (s in S) p[s, actions(s)] <- epsilon/length(actions(s)) + random_split(n = length(actions(s)),
1 - epsilon)
p
}
set.seed(1234)
pi_random_epsilon_soft <- create_random_epsilon_soft_policy()
pi_random_epsilon_soft
Up Right Down Left None
1 0.127 0.471 0.037 0.365 0
2 0.586 0.040 0.224 0.150 0
3 0.034 0.226 0.415 0.326 0
4 0.488 0.053 0.159 0.301 0
5 0.000 0.000 0.000 0.000 1
6 0.279 0.034 0.593 0.094 0
7 0.265 0.042 0.521 0.171 0
8 0.193 0.066 0.101 0.640 0
9 0.061 0.132 0.154 0.653 0
10 0.222 0.301 0.281 0.195 0
11 0.000 0.000 0.000 0.000 1
12 0.000 0.000 0.000 0.000 1
The expected utility can be calculated by
\(U^\pi = E\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(s_t), s_{t+1})\right]\)
We need to define the discount factor.
We can evaluate the utility of a policy using simulation. For the stochastic transition model, we need to be able to sample the state \(s'\) the system transitions to when using action \(a\) in state \(s\).
sample_transition <- function(s, a) sample(S, size = 1, prob = P_matrices[[a]][s,
])
sample_transition(1, "Up")
[1] 1
1 2 4
9 84 7
We can now simulate the utility for one episode. Note that we use the cutoff max_t
in case a policy does not end up in a terminal state before that.
simulate_utility <- function(pi, s0 = 1, max_t = 100) {
s <- s0
U <- 0
t <- 0
while (TRUE) {
## get action from policy (matrix means it is a stochastic policy)
if (!is.matrix(pi))
a <- pi[s] else a <- sample(A, size = 1, prob = pi[s, ])
## sample a transition given the action from the policy
s_prime <- sample_transition(s, a)
##
U <- U + GAMMA^t * R(s, a, s_prime)
s <- s_prime
## reached an absorbing state?
if (s == 11 || s == 12 || s == 5)
break
t <- t + 1
if (t >= max_t)
break
}
U
}
Simulate the N
episodes.
The expected utility for starting from state \(s_0 = 1\) is.
Compare with the utility of the random policy. ::: {.cell}
utility_random
-4
1000
:::
The random policy performs really bad and most likely always stumbles around for max_t
moves at a cost of .04 each. The manually created policy is obviously better.
We can use simulation to estimate the expected utility for starting from each state following the policy.
Q-Learning is an off-policy TD control algorithm to estimate \(Q \approx q_*.\) The pseudo code of a Q-Learning algorithm is given in AIMA Figure 22.8 as:
The algorithm uses a temporal-difference (TD) learning since it updates using the TD error given by \(Q[s',a'] - Q[s, a]\).
The algorithm performs off-policy learning since it learns a different policy than the one used to perform actions.
Behavior policy: is defined by the exploration function \(f\). It needs to guarantee exploration. That is, every available action needs to have a chance to be tried. A popular policy is an \(\epsilon\)-greedy policy, but any \(\epsilon\)-soft policy that is increasing action probability with Q and decreases it with N can be used.
Target policy: This is the learned policy. Q-learning learns a deterministic greedy policy represented by the \(max_{a'} Q[s', a']\) in the update of Q.
The Q-learning agent parameters and helper functions.
We use a class to store persistent information: Agent state, \(Q\) and \(N\) tables. Reference classes use <<-
to assign values to member variables (fields).
QL_Agent <- setRefClass(
"QL_Agent",
fields = c("S", "A", "Q", "N", "s", "a", "gamma"),
methods = list(
initialize = function(S, A, gamma = 1) {
S <<- S
A <<- A
gamma <<- gamma
reset()
},
next_episode = function() {
"Start a new episode."
s <<- NULL
a <<- 'None'
},
reset = function() {
"Reset the agent's state and erase all persistent information."
# set q-values to 0 except use NA for unavailable actions
Q <<-
matrix(NA_real_,
nrow = length(S),
ncol = length(A),
dimnames = list(S, A))
for (s in S) Q[s, actions(s)] <<- 0
N <<-
matrix(0L,
nrow = length(S),
ncol = length(A),
dimnames = list(S, A))
next_episode()
}
)
)
The agent function is called run
with the precepts including the new state and the associated reward.
QL_Agent$methods(
# Learning rate:
# Fixed learning rate
#alpha = function(n, rate = 0.1) return(rate)
# A decreasing learning rate converges faster. n is the number a
# state was visited. The numbers depend on the problem.
alpha = function(n) {
return (10 / (9 + n))
},
# Exploration function (behavior policy): We use a simple
# epsilon-greedy policy here.
maxf = function(s_prime, a_best, epsilon = .8) {
if (runif(1) < epsilon)
action <- a_best
else
action <- sample(actions(s_prime), size = 1)
return(action)
},
# Q-learning algorithm
run = function(s_prime, r) {
"Agent function. The pecepts are the current state and the last reward."
# how often did the agent try this s/a combination
N[s, a] <<- N[s, a] + 1L
# Greedy target policy: find best available action for s' (breaking ties randomly).
available_actions <- actions(s_prime)
a_best <- available_actions[argmax(Q[s_prime,available_actions])]
# No update for start of new episode
if (!is.null(s)) {
Q[s, a] <<-
Q[s, a] + alpha(N[s, a]) * (r + gamma * Q[s_prime, a_best] - Q[s, a])
}
if (is_terminal(s_prime))
a <<- 'None'
else
a <<- maxf(s_prime, a_best)
s <<- s_prime
return(a)
}
)
Create an agent.
Reference class object of class "QL_Agent"
Field "S":
[1] 1 2 3 4 5 6 7 8 9 10 11 12
Field "A":
[1] "Up" "Right" "Down" "Left" "None"
Field "Q":
Up Right Down Left None
1 0 0 0 0 NA
2 0 0 0 0 NA
3 0 0 0 0 NA
4 0 0 0 0 NA
5 NA NA NA NA 0
6 0 0 0 0 NA
7 0 0 0 0 NA
8 0 0 0 0 NA
9 0 0 0 0 NA
10 0 0 0 0 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
Field "N":
Up Right Down Left None
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
11 0 0 0 0 0
12 0 0 0 0 0
Field "s":
NULL
Field "a":
[1] "None"
Field "gamma":
[1] 1
This is an environment that can simulate \(n\) episodes and returns the utility the agent received for each episode.
simulate_environment <- function(n = 1000, agent, reset_agent = TRUE, verbose = FALSE) {
reward_history <- numeric(n)
state <- start
if (reset_agent)
agent$reset()
agent$next_episode()
i <- 1L # episode counter
j <- 1L # step in episode counter
reward <- 0
reward_episode <- 0
while (TRUE) {
reward_episode <- reward_episode + GAMMA^j * reward
# call agent function with percepts (current state and reward)
action <- agent$run(state, reward)
if (verbose) {
if (j == 1L)
cat("\n*** Episode", i, "***\n")
cat("Step", j, paste0("(s = ", state, ", r = ", reward, ") -> ", action),
"\n")
}
j <- j + 1L
if (is_terminal(state)) {
reward_history[i] <- reward_episode
if (verbose) {
cat("Episode reward: ", reward_episode, "\n")
cat("Q:\n")
print(agent$Q)
cat("U:\n")
print(show_layout(apply(agent$Q, MARGIN = 1, max, na.rm = TRUE)))
}
j <- 1L
# finished n episodes?
i <- i + 1L
if (i > n)
break
state <- start
reward <- 0
reward_episode <- 0
agent$next_episode()
} else if (action %in% actions(state)) {
state_prime <- sample_transition(state, action)
reward <- R(state, action, state_prime)
state <- state_prime
} else {
stop("The agent chose an illegal action!")
# reward <- -1e9
}
}
reward_history
}
Run a few episodes.
*** Episode 1 ***
Step 1 (s = 1, r = 0) -> Up
Step 2 (s = 4, r = -0.04) -> Up
Step 3 (s = 4, r = -0.04) -> Right
Step 4 (s = 7, r = -0.04) -> Right
Step 5 (s = 10, r = -0.04) -> Right
Step 6 (s = 10, r = -0.04) -> Left
Step 7 (s = 11, r = -1) -> None
Episode reward: -1.2
Q:
Up Right Down Left None
1 -0.04 0.00 0 0 NA
2 0.00 0.00 0 0 NA
3 0.00 0.00 0 0 NA
4 -0.04 -0.04 0 0 NA
5 NA NA NA NA 0
6 0.00 0.00 0 0 NA
7 0.00 -0.04 0 0 NA
8 0.00 0.00 0 0 NA
9 0.00 0.00 0 0 NA
10 0.00 -0.04 0 -1 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 0 0 0 0
2 0 0 0 0
1 0 0 0 0
*** Episode 2 ***
Step 1 (s = 1, r = 0) -> Right
Step 2 (s = 4, r = -0.04) -> Right
Step 3 (s = 7, r = -0.04) -> Down
Step 4 (s = 7, r = -0.04) -> Up
Step 5 (s = 8, r = -0.04) -> Right
Step 6 (s = 11, r = -1) -> None
Episode reward: -1.2
Q:
Up Right Down Left None
1 -0.04 -0.04 0.00 0 NA
2 0.00 0.00 0.00 0 NA
3 0.00 0.00 0.00 0 NA
4 -0.04 -0.04 0.00 0 NA
5 NA NA NA NA 0
6 0.00 0.00 0.00 0 NA
7 -0.04 -0.04 -0.04 0 NA
8 0.00 -1.00 0.00 0 NA
9 0.00 0.00 0.00 0 NA
10 0.00 -0.04 0.00 -1 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 0 0 0 0
2 0 0 0 0
1 0 0 0 0
*** Episode 3 ***
Step 1 (s = 1, r = 0) -> Right
Step 2 (s = 1, r = -0.04) -> Down
Step 3 (s = 1, r = -0.04) -> Down
Step 4 (s = 1, r = -0.04) -> Left
Step 5 (s = 1, r = -0.04) -> Left
Step 6 (s = 1, r = -0.04) -> Up
Step 7 (s = 2, r = -0.04) -> Up
Step 8 (s = 2, r = -0.04) -> Right
Step 9 (s = 2, r = -0.04) -> Down
Step 10 (s = 1, r = -0.04) -> Up
Step 11 (s = 2, r = -0.04) -> Left
Step 12 (s = 2, r = -0.04) -> Left
Step 13 (s = 3, r = -0.04) -> Right
Step 14 (s = 6, r = -0.04) -> Up
Step 15 (s = 9, r = -0.04) -> Down
Step 16 (s = 8, r = -0.04) -> Up
Step 17 (s = 11, r = -1) -> None
Episode reward: -1.6
Q:
Up Right Down Left None
1 -0.04 -0.04 -0.04 -0.076 NA
2 -0.04 -0.04 -0.08 -0.040 NA
3 0.00 -0.04 0.00 0.000 NA
4 -0.04 -0.04 0.00 0.000 NA
5 NA NA NA NA 0
6 -0.04 0.00 0.00 0.000 NA
7 -0.04 -0.04 -0.04 0.000 NA
8 -1.00 -1.00 0.00 0.000 NA
9 0.00 0.00 -0.04 0.000 NA
10 0.00 -0.04 0.00 -1.000 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 0.00 0 0 0
2 -0.04 0 0 0
1 -0.04 0 0 0
*** Episode 4 ***
Step 1 (s = 1, r = 0) -> Down
Step 2 (s = 1, r = -0.04) -> Right
Step 3 (s = 4, r = -0.04) -> Down
Step 4 (s = 4, r = -0.04) -> Left
Step 5 (s = 1, r = -0.04) -> Right
Step 6 (s = 4, r = -0.04) -> Down
Step 7 (s = 1, r = -0.04) -> Up
Step 8 (s = 2, r = -0.04) -> Left
Step 9 (s = 3, r = -0.04) -> Left
Step 10 (s = 2, r = -0.04) -> Right
Step 11 (s = 2, r = -0.04) -> Up
Step 12 (s = 3, r = -0.04) -> Up
Step 13 (s = 3, r = -0.04) -> Down
Step 14 (s = 2, r = -0.04) -> Left
Step 15 (s = 2, r = -0.04) -> Left
Step 16 (s = 2, r = -0.04) -> Up
Step 17 (s = 3, r = -0.04) -> Right
Step 18 (s = 2, r = -0.04) -> Up
Step 19 (s = 3, r = -0.04) -> Up
Step 20 (s = 3, r = -0.04) -> Left
Step 21 (s = 3, r = -0.04) -> Down
Step 22 (s = 6, r = -0.04) -> Left
Step 23 (s = 6, r = -0.04) -> Down
Step 24 (s = 3, r = -0.04) -> Down
Step 25 (s = 2, r = -0.04) -> Down
Step 26 (s = 2, r = -0.04) -> Right
Step 27 (s = 2, r = -0.04) -> Left
Step 28 (s = 2, r = -0.04) -> Left
Step 29 (s = 2, r = -0.04) -> Up
Step 30 (s = 3, r = -0.04) -> Up
Step 31 (s = 6, r = -0.04) -> Right
Step 32 (s = 9, r = -0.04) -> Up
Step 33 (s = 6, r = -0.04) -> Up
Step 34 (s = 6, r = -0.04) -> Up
Step 35 (s = 3, r = -0.04) -> Up
Step 36 (s = 3, r = -0.04) -> Up
Step 37 (s = 3, r = -0.04) -> Up
Step 38 (s = 3, r = -0.04) -> Down
Step 39 (s = 6, r = -0.04) -> Left
Step 40 (s = 3, r = -0.04) -> Down
Step 41 (s = 6, r = -0.04) -> Right
Step 42 (s = 9, r = -0.04) -> Right
Step 43 (s = 12, r = 1) -> None
Episode reward: -0.64
Q:
Up Right Down Left None
1 -0.071 -0.071 -0.073 -0.076 NA
2 -0.106 -0.110 -0.113 -0.113 NA
3 -0.131 -0.107 -0.082 -0.113 NA
4 -0.040 -0.040 -0.076 -0.080 NA
5 NA NA NA NA 0
6 -0.084 -0.040 -0.084 -0.118 NA
7 -0.040 -0.040 -0.040 0.000 NA
8 -1.000 -1.000 0.000 0.000 NA
9 -0.080 1.000 -0.040 0.000 NA
10 0.000 -0.040 0.000 -1.000 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 -0.082 -0.04 1 0
2 -0.106 0.00 0 0
1 -0.071 -0.04 0 0
*** Episode 5 ***
Step 1 (s = 1, r = 0) -> Down
Step 2 (s = 1, r = -0.04) -> Right
Step 3 (s = 4, r = -0.04) -> Right
Step 4 (s = 7, r = -0.04) -> Left
Step 5 (s = 4, r = -0.04) -> Right
Step 6 (s = 7, r = -0.04) -> Down
Step 7 (s = 7, r = -0.04) -> Up
Step 8 (s = 8, r = -0.04) -> Left
Step 9 (s = 8, r = -0.04) -> Left
Step 10 (s = 9, r = -0.04) -> Right
Step 11 (s = 12, r = 1) -> None
Episode reward: 0.64
Q:
Up Right Down Left None
1 -0.071 -0.077 -0.102 -0.076 NA
2 -0.106 -0.110 -0.113 -0.113 NA
3 -0.131 -0.107 -0.082 -0.113 NA
4 -0.040 -0.071 -0.076 -0.080 NA
5 NA NA NA NA 0
6 -0.084 -0.040 -0.084 -0.118 NA
7 -0.040 -0.040 -0.076 -0.080 NA
8 -1.000 -1.000 0.000 0.869 NA
9 -0.080 1.000 -0.040 0.000 NA
10 0.000 -0.040 0.000 -1.000 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 -0.082 -0.04 1.00 0
2 -0.106 0.00 0.87 0
1 -0.071 -0.04 -0.04 0
*** Episode 6 ***
Step 1 (s = 1, r = 0) -> Up
Step 2 (s = 2, r = -0.04) -> Right
Step 3 (s = 2, r = -0.04) -> Up
Step 4 (s = 3, r = -0.04) -> Down
Step 5 (s = 2, r = -0.04) -> Right
Step 6 (s = 2, r = -0.04) -> Left
Step 7 (s = 2, r = -0.04) -> Left
Step 8 (s = 2, r = -0.04) -> Down
Step 9 (s = 2, r = -0.04) -> Down
Step 10 (s = 1, r = -0.04) -> Left
Step 11 (s = 1, r = -0.04) -> Left
Step 12 (s = 2, r = -0.04) -> Up
Step 13 (s = 3, r = -0.04) -> Down
Step 14 (s = 6, r = -0.04) -> Right
Step 15 (s = 6, r = -0.04) -> Right
Step 16 (s = 9, r = -0.04) -> Right
Step 17 (s = 12, r = 1) -> None
Episode reward: 0.4
Q:
Up Right Down Left None
1 -0.124 -0.077 -0.102 -0.15 NA
2 -0.135 -0.149 -0.123 -0.15 NA
3 -0.131 -0.107 -0.098 -0.11 NA
4 -0.040 -0.071 -0.076 -0.08 NA
5 NA NA NA NA 0
6 -0.084 0.722 -0.084 -0.12 NA
7 -0.040 -0.040 -0.076 -0.08 NA
8 -1.000 -1.000 0.000 0.87 NA
9 -0.080 1.000 -0.040 0.00 NA
10 0.000 -0.040 0.000 -1.00 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 -0.098 0.72 1.00 0
2 -0.123 0.00 0.87 0
1 -0.077 -0.04 -0.04 0
*** Episode 7 ***
Step 1 (s = 1, r = 0) -> Right
Step 2 (s = 4, r = -0.04) -> Up
Step 3 (s = 4, r = -0.04) -> Up
Step 4 (s = 1, r = -0.04) -> Right
Step 5 (s = 4, r = -0.04) -> Right
Step 6 (s = 7, r = -0.04) -> Right
Step 7 (s = 10, r = -0.04) -> Down
Step 8 (s = 10, r = -0.04) -> Down
Step 9 (s = 10, r = -0.04) -> Up
Step 10 (s = 11, r = -1) -> None
Episode reward: -1.3
Q:
Up Right Down Left None
1 -0.124 -0.099 -0.102 -0.15 NA
2 -0.135 -0.149 -0.123 -0.15 NA
3 -0.131 -0.107 -0.098 -0.11 NA
4 -0.112 -0.077 -0.076 -0.08 NA
5 NA NA NA NA 0
6 -0.084 0.722 -0.084 -0.12 NA
7 -0.040 -0.040 -0.076 -0.08 NA
8 -1.000 -1.000 0.000 0.87 NA
9 -0.080 1.000 -0.040 0.00 NA
10 -1.000 -0.040 -0.040 -1.00 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 -0.098 0.722 1.00 0.00
2 -0.123 0.000 0.87 0.00
1 -0.099 -0.076 -0.04 -0.04
*** Episode 8 ***
Step 1 (s = 1, r = 0) -> Right
Step 2 (s = 4, r = -0.04) -> Down
Step 3 (s = 4, r = -0.04) -> Down
Step 4 (s = 4, r = -0.04) -> Left
Step 5 (s = 1, r = -0.04) -> Down
Step 6 (s = 4, r = -0.04) -> Right
Step 7 (s = 7, r = -0.04) -> Up
Step 8 (s = 8, r = -0.04) -> Left
Step 9 (s = 8, r = -0.04) -> Left
Step 10 (s = 8, r = -0.04) -> Left
Step 11 (s = 8, r = -0.04) -> Left
Step 12 (s = 8, r = -0.04) -> Left
Step 13 (s = 8, r = -0.04) -> Up
Step 14 (s = 9, r = -0.04) -> Right
Step 15 (s = 12, r = 1) -> None
Episode reward: 0.48
Q:
Up Right Down Left None
1 -0.124 -0.109 -0.113 -0.15 NA
2 -0.135 -0.149 -0.123 -0.15 NA
3 -0.131 -0.107 -0.098 -0.11 NA
4 -0.112 -0.079 -0.116 -0.14 NA
5 NA NA NA NA 0
6 -0.084 0.722 -0.084 -0.12 NA
7 0.684 -0.040 -0.076 -0.08 NA
8 0.782 -1.000 0.000 0.72 NA
9 -0.080 1.000 -0.040 0.00 NA
10 -1.000 -0.040 -0.040 -1.00 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 -0.098 0.722 1.00 0.00
2 -0.123 0.000 0.78 0.00
1 -0.109 -0.079 0.68 -0.04
*** Episode 9 ***
Step 1 (s = 1, r = 0) -> Up
Step 2 (s = 2, r = -0.04) -> Down
Step 3 (s = 1, r = -0.04) -> Right
Step 4 (s = 4, r = -0.04) -> Left
Step 5 (s = 1, r = -0.04) -> Down
Step 6 (s = 1, r = -0.04) -> Down
Step 7 (s = 4, r = -0.04) -> Right
Step 8 (s = 7, r = -0.04) -> Up
Step 9 (s = 8, r = -0.04) -> Up
Step 10 (s = 9, r = -0.04) -> Right
Step 11 (s = 9, r = -0.04) -> Right
Step 12 (s = 8, r = -0.04) -> Up
Step 13 (s = 9, r = -0.04) -> Right
Step 14 (s = 12, r = 1) -> None
Episode reward: 0.52
Q:
Up Right Down Left None
1 -0.150 -0.11 -0.127 -0.15 NA
2 -0.135 -0.15 -0.142 -0.15 NA
3 -0.131 -0.11 -0.098 -0.11 NA
4 -0.112 0.37 -0.116 -0.15 NA
5 NA NA NA NA 0
6 -0.084 0.72 -0.084 -0.12 NA
7 0.729 -0.04 -0.076 -0.08 NA
8 0.890 -1.00 0.000 0.72 NA
9 -0.080 0.97 -0.040 0.00 NA
10 -1.000 -0.04 -0.040 -1.00 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 -0.098 0.72 0.97 0.00
2 -0.135 0.00 0.89 0.00
1 -0.115 0.37 0.73 -0.04
*** Episode 10 ***
Step 1 (s = 1, r = 0) -> Right
Step 2 (s = 4, r = -0.04) -> Right
Step 3 (s = 4, r = -0.04) -> Right
Step 4 (s = 7, r = -0.04) -> Up
Step 5 (s = 8, r = -0.04) -> Up
Step 6 (s = 9, r = -0.04) -> Right
Step 7 (s = 9, r = -0.04) -> Up
Step 8 (s = 9, r = -0.04) -> Right
Step 9 (s = 12, r = 1) -> None
Episode reward: 0.72
Q:
Up Right Down Left None
1 -0.150 0.12 -0.127 -0.15 NA
2 -0.135 -0.15 -0.142 -0.15 NA
3 -0.131 -0.11 -0.098 -0.11 NA
4 -0.112 0.54 -0.116 -0.15 NA
5 NA NA NA NA 0
6 -0.084 0.72 -0.084 -0.12 NA
7 0.815 -0.04 -0.076 -0.08 NA
8 0.918 -1.00 0.000 0.72 NA
9 0.816 0.98 -0.040 0.00 NA
10 -1.000 -0.04 -0.040 -1.00 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
U:
col
row 1 2 3 4
3 -0.098 0.72 0.98 0.00
2 -0.135 0.00 0.92 0.00
1 0.121 0.54 0.81 -0.04
[1] -1.20 -1.16 -1.60 -0.64 0.64 0.40 -1.32 0.48 0.52 0.72
Note that the agent has initial no idea about the optimal policy and runs around randomly. After it randomly finds the goal, the rewards starts to spread one square every episode where the agent reaches the goal. Also, the agent starts to find the goal faster.
Calculate the value function \(U\) from the learned Q-function as the largest Q value of any action in a state.
col
row 1 2 3 4
3 -0.098 0.72 0.98 0.00
2 -0.135 0.00 0.92 0.00
1 0.121 0.54 0.81 -0.04
Note that not all states are reached the same amount of times and some estimates may be worse than others. Since Q-Learning uses a \(\epsilon\)-greedy behavior policy, the best actions are used most often leading to better estimates around the optimal policy.
Extract the learned policy from \(Q\) as the action with the highest Q-value for each state.
col
row 1 2 3 4
3 "Down" "Right" "Right" "None"
2 "Up" "None" "Up" "None"
1 "Right" "Right" "Up" "Right"
Even after simulating just 10 episodes, we get a reasonable policy that will lead the agent mostly to the goal state. We can estimate the expected utility of the policy using simulation.
Simulating more episodes results in better estimates (also for the rarely visited states).
reward_history <- simulate_environment(n = 10000, agent, verbose = FALSE, reset_agent = TRUE)
plot(cumsum(reward_history[1:500]), type = "l", ylab = "Total reward collected",
xlab = "Episode")
The total reward approaches a straight line after just a few epochs, meaning that a reasonable policy was already found.
col
row 1 2 3 4
3 3 6 9 12
2 2 5 8 11
1 1 4 7 10
Up Right Down Left None
1 0.75 0.67 0.70 0.715 NA
2 0.80 0.76 0.72 0.762 NA
3 0.82 0.85 0.78 0.810 NA
4 0.66 0.62 0.66 0.699 NA
5 NA NA NA NA 0
6 0.87 0.91 0.87 0.829 NA
7 0.62 0.38 0.58 0.657 NA
8 0.66 -0.57 0.39 0.653 NA
9 0.92 0.96 0.71 0.860 NA
10 -0.40 0.18 0.24 0.078 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
Learned state-value function.
col
row 1 2 3 4
3 0.85 0.91 0.96 0.00
2 0.80 0.00 0.66 0.00
1 0.75 0.70 0.66 0.24
Extract the greedy policy.
col
row 1 2 3 4
3 "Right" "Right" "Right" "None"
2 "Up" "None" "Up" "None"
1 "Up" "Left" "Left" "Down"
We can estimate the expected utility of the policy using simulation.