# I use capitalized variables as global constants
COLS <- 4
ROWS <- 3
S = seq_len(ROWS * COLS)
LAYOUT <- matrix(S, nrow = ROWS, ncol = COLS)
LAYOUT
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
This code is provided under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.
Reinforcement Learning: An Introduction (RL) by Sutton and Barto (2020) introduce several temporal-difference learning control algorithms in Chapter 6: Temporal-Difference Learning. Here we implement the on-policy TD control algorithm Sarsa.
We will implement the key concepts using R for the AIMA 3x4 grid world example. The used environment is an MDP, but instead of trying to solve the MDP directly and estimating the value function estimates \(U(s)\), we will try to learn a policy from interactions with the environment. The MDP’s transition model will only be used to simulate the response of the environment to actions by the agent.
The code in this notebook defines explicit functions matching the textbook definitions and is for demonstration purposes only. Efficient implementations for larger problems use fast vector multiplications instead.
The used example is the simple 4x3 grid world described in AIMA Figure 12.1 and used again in 22.1 as:
In the following, we implement states, actions, the reward function,and the transition model. We also show how to represent a policy and how to estimate the expected utility using simulation.
The MDP will be defined as the following global variables/functions:
S
: set of states.A
: set of actions.actions(s)
: returns the available actions for state \(s\).P(sp, s, a)
: a function returning the transition probability \(P(s' | s, a)\).R(s, a, s_prime)
: reward function for the transition from \(s\) to \(s`\) with action a
.Policies are represented as:
Other useful functions:
sample_transition(s, a)
: returns a \(s'\) sampled using the transition model.simulate_utilities(pi, s0 = 1, N = 1000, max_t = 100)
: Estimates the utility of following policy \(\pi\), starting \(N\) episodes in state \(s_0\). The maximal episode length is max_t
to ensure that the function finishes also for policies that do not lead to a terminal state.We define the atomic state space \(S\) by labeling the states \(1, 2, ...\). We convert coordinates (rows, columns)
to the state label.
# I use capitalized variables as global constants
COLS <- 4
ROWS <- 3
S = seq_len(ROWS * COLS)
LAYOUT <- matrix(S, nrow = ROWS, ncol = COLS)
LAYOUT
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
Note that the rows are displayed upside-down compared to the text book, so we use a function to display them in reverse order.
show_layout <- function(x) {
x <- matrix(x, ncol = COLS, nrow = ROWS, dimnames = list(row = seq_len(ROWS),
col = seq_len(COLS)))
x[rev(seq_len(ROWS)), ]
}
show_layout(LAYOUT)
col
row 1 2 3 4
3 3 6 9 12
2 2 5 8 11
1 1 4 7 10
Convert between coordinates and state labels.
rc_to_s <- function(rc) LAYOUT[rbind(rc)]
s_to_rc <- function(s) drop(which(LAYOUT == s, arr.ind = TRUE, useNames = FALSE))
rc_to_s(c(3, 4))
[1] 12
[1] 3 4
Start state ::: {.cell}
:::
Define terminal states.
The complete set of actions is \(A = \{\mathrm{'Up', 'Right', 'Down', 'Left', 'None'}\}\). Not all actions are available in every state. Also, action None
is added as the only possible action in an absorbing state.
A = c("Up", "Right", "Down", "Left", "None")
actions <- function(s) {
# absorbing states
if (s == 11 || s == 12)
return("None")
# illegal state
if (s == 5)
return("None")
c("Up", "Right", "Down", "Left")
}
lapply(S, actions)
[[1]]
[1] "Up" "Right" "Down" "Left"
[[2]]
[1] "Up" "Right" "Down" "Left"
[[3]]
[1] "Up" "Right" "Down" "Left"
[[4]]
[1] "Up" "Right" "Down" "Left"
[[5]]
[1] "None"
[[6]]
[1] "Up" "Right" "Down" "Left"
[[7]]
[1] "Up" "Right" "Down" "Left"
[[8]]
[1] "Up" "Right" "Down" "Left"
[[9]]
[1] "Up" "Right" "Down" "Left"
[[10]]
[1] "Up" "Right" "Down" "Left"
[[11]]
[1] "None"
[[12]]
[1] "None"
\(P(s' | s, a)\) is the probability of going from state \(s\) to \(s'\) by when taking action \(a\). We will create a matrix \(P_a(s' | s)\) for each action.
calc_transition <- function(s, action) {
action <- match.arg(action, choices = A)
if (length(s) > 1)
return(t(sapply(s, calc_transition, action = action)))
# deal with absorbing and illegal state
if (s == 11 || s == 12 || s == 5 || action == "None") {
P <- rep(0, length(S))
P[s] <- 1
return(P)
}
action_to_delta <- list(Up = c(+1, 0), Down = c(-1, 0), Right = c(0, +1), Left = c(0,
-1))
delta <- action_to_delta[[action]]
dr <- delta[1]
dc <- delta[2]
rc <- s_to_rc(s)
r <- rc[1]
c <- rc[2]
if (dr != 0 && dc != 0)
stop("You can only go up/down or right/left!")
P <- matrix(0, nrow = ROWS, ncol = COLS)
# UP/DOWN
if (dr != 0) {
new_r <- r + dr
if (new_r > ROWS || new_r < 1)
new_r <- r
## can't got to (2, 2)
if (new_r == 2 && c == 2)
new_r <- r
P[new_r, c] <- 0.8
if (c < COLS & !(r == 2 & (c + 1) == 2))
P[r, c + 1] <- 0.1 else P[r, c] <- P[r, c] + 0.1
if (c > 1 & !(r == 2 & (c - 1) == 2))
P[r, c - 1] <- 0.1 else P[r, c] <- P[r, c] + 0.1
}
# RIGHT/LEFT
if (dc != 0) {
new_c <- c + dc
if (new_c > COLS || new_c < 1)
new_c <- c
## can't got to (2, 2)
if (r == 2 && new_c == 2)
new_c <- c
P[r, new_c] <- 0.8
if (r < ROWS & !((r + 1) == 2 & c == 2))
P[r + 1, c] <- 0.1 else P[r, c] <- P[r, c] + 0.1
if (r > 1 & !((r - 1) == 2 & c == 2))
P[r - 1, c] <- 0.1 else P[r, c] <- P[r, c] + 0.1
}
as.vector(P)
}
Try to go up from state 1 (this is (1,1), the bottom left corner). Note: we cannot go left so there is a .1 chance to stay in place. ::: {.cell}
[1] 0.1 0.8 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
col
row 1 2 3 4
3 0.0 0.0 0 0
2 0.8 0.0 0 0
1 0.1 0.1 0 0
:::
Try to go right from (2,1). Since right is blocked, there is a .8 probability of staying in place. ::: {.cell}
col
row 1 2 3 4
3 0.1 0 0 0
2 0.8 0 0 0
1 0.1 0 0 0
:::
Calculate transitions for each state to each other state. Each row represents a state \(s\) and each column a state \(s'\) so we get a complete definition for \(P_a(s' | s)\). Note that the matrix is stochastic (all rows add up to 1).
Create a matrix for each action.
P_matrices <- lapply(A, FUN = function(a) calc_transition(S, a))
names(P_matrices) <- A
str(P_matrices)
List of 5
$ Up : num [1:12, 1:12] 0.1 0 0 0.1 0 0 0 0 0 0 ...
$ Right: num [1:12, 1:12] 0.1 0.1 0 0 0 0 0 0 0 0 ...
$ Down : num [1:12, 1:12] 0.9 0.8 0 0.1 0 0 0 0 0 0 ...
$ Left : num [1:12, 1:12] 0.9 0.1 0 0.8 0 0 0 0 0 0 ...
$ None : num [1:12, 1:12] 1 0 0 0 0 0 0 0 0 0 ...
Create a function interface for \(P(s' | s, a)\).
\(R(s, a, s')\) define the reward for the transition from \(s\) to \(s'\) with action \(a\).
For the textbook example we have:
Note that once you are in an absorbing state (11 or 12), then the problem is over and there is no more reward!
The solution to an MDP is a policy \(\pi\) which defines which action to take in each state. We represent deterministic policies as a vector of actions. I make up a policy that always goes up and then to the right once the agent hits the top.
[1] "Up" "Up" "Right" "Up" "Up" "Right" "Up" "Up" "Right"
[10] "Up" "Up" "Up"
col
row 1 2 3 4
3 "Right" "Right" "Right" "Up"
2 "Up" "Up" "Up" "Up"
1 "Up" "Up" "Up" "Up"
We can also create a random policy by randomly choosing from the available actions for each state.
create_random_deterministic_policy <- function() structure(sapply(S, FUN = function(s) sample(actions(s),
1L)), names = S)
set.seed(1234)
pi_random <- create_random_deterministic_policy()
pi_random
1 2 3 4 5 6 7 8 9 10
"Left" "Left" "Right" "Right" "None" "Left" "Down" "Up" "Up" "Right"
11 12
"None" "None"
col
row 1 2 3 4
3 "Right" "Left" "Up" "None"
2 "Left" "None" "Up" "None"
1 "Left" "Right" "Down" "Right"
Stochastic policies use probabilities of actions in each state. We use as simple table with probabilities where each row is a state and the columns are the actions. Here we create a random \(\epsilon\)-soft policy. Each available has at least a probability of \(\epsilon\).
We can make a deterministic policy soft. ::: {.cell}
make_policy_soft <- function(pi, epsilon = 0.1) {
if (!is.vector(pi))
stop("pi is not a deterministic policy!")
p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))
for (s in S) {
p[s, actions(s)] <- epsilon/length(actions(s))
p[s, pi[s]] <- p[s, pi[s]] + (1 - epsilon)
}
p
}
make_policy_soft(pi_random)
Up Right Down Left None
1 0.025 0.025 0.025 0.925 0
2 0.025 0.025 0.025 0.925 0
3 0.025 0.925 0.025 0.025 0
4 0.025 0.925 0.025 0.025 0
5 0.000 0.000 0.000 0.000 1
6 0.025 0.025 0.025 0.925 0
7 0.025 0.025 0.925 0.025 0
8 0.925 0.025 0.025 0.025 0
9 0.925 0.025 0.025 0.025 0
10 0.025 0.925 0.025 0.025 0
11 0.000 0.000 0.000 0.000 1
12 0.000 0.000 0.000 0.000 1
:::
Or we can create a completely random soft policy
create_random_epsilon_soft_policy <- function(epsilon = 0.1) {
# split total randomly into n numbers that add up to total
random_split <- function(n, total) {
if (n == 1)
return(total)
bordersR <- c(sort(runif(n - 1)), 1)
bordersL <- c(0, bordersR[1:(n - 1)])
(bordersR - bordersL) * total
}
p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))
for (s in S) p[s, actions(s)] <- epsilon/length(actions(s)) + random_split(n = length(actions(s)),
1 - epsilon)
p
}
set.seed(1234)
pi_random_epsilon_soft <- create_random_epsilon_soft_policy()
pi_random_epsilon_soft
Up Right Down Left None
1 0.127 0.471 0.037 0.365 0
2 0.586 0.040 0.224 0.150 0
3 0.034 0.226 0.415 0.326 0
4 0.488 0.053 0.159 0.301 0
5 0.000 0.000 0.000 0.000 1
6 0.279 0.034 0.593 0.094 0
7 0.265 0.042 0.521 0.171 0
8 0.193 0.066 0.101 0.640 0
9 0.061 0.132 0.154 0.653 0
10 0.222 0.301 0.281 0.195 0
11 0.000 0.000 0.000 0.000 1
12 0.000 0.000 0.000 0.000 1
The expected utility can be calculated by
\(U^\pi = E\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(s_t), s_{t+1})\right]\)
We need to define the discount factor.
We can evaluate the utility of a policy using simulation. For the stochastic transition model, we need to be able to sample the state \(s'\) the system transitions to when using action \(a\) in state \(s\).
sample_transition <- function(s, a) sample(S, size = 1, prob = P_matrices[[a]][s,
])
sample_transition(1, "Up")
[1] 1
1 2 4
9 84 7
We can now simulate the utility for one episode. Note that we use the cutoff max_t
in case a policy does not end up in a terminal state before that.
simulate_utility <- function(pi, s0 = 1, max_t = 100) {
s <- s0
U <- 0
t <- 0
while (TRUE) {
## get action from policy (matrix means it is a stochastic policy)
if (!is.matrix(pi))
a <- pi[s] else a <- sample(A, size = 1, prob = pi[s, ])
## sample a transition given the action from the policy
s_prime <- sample_transition(s, a)
##
U <- U + GAMMA^t * R(s, a, s_prime)
s <- s_prime
## reached an absorbing state?
if (s == 11 || s == 12 || s == 5)
break
t <- t + 1
if (t >= max_t)
break
}
U
}
Simulate the N
episodes.
The expected utility for starting from state \(s_0 = 1\) is.
Compare with the utility of the random policy. ::: {.cell}
utility_random
-4
1000
:::
The random policy performs really bad and most likely always stumbles around for max_t
moves at a cost of .04 each. The manually created policy is obviously better.
We can use simulation to estimate the expected utility for starting from each state following the policy.
Here is the pseudo code for Sarsa from the RL book, Chapter 6.4:
The algorithm uses a temporal-difference (TD) learning since it updates using the TD error given by \(Q(S',A') - Q(S, A)\).
The algorithm performs on-policy learning since it uses only a single policy (e.g., \(\epsilon\)-greedy) as the behavior and target policy.
Next, we implement the greedy and \(\epsilon\)-greedy policy action choice given \(Q\). The greedy policy is deterministic and always chooses the best action with the highest \(Q\)-value for the current state. The \(\epsilon\)-greedy policy is a stochastic policy which chooses the best action with probability \(1 - \epsilon\) and a random action otherwise. Setting \(\epsilon = 0\) reduces the \(\epsilon\)-greedy policy to a deterministic greedy policy.
greedy_action <- function(s, Q, epsilon = 0) {
available_A <- actions(s)
if (epsilon == 0 || length(available_A) == 1L || runif(1) > epsilon) {
a <- available_A[which.max(Q[s, available_A])]
} else {
a <- sample(available_A, size = 1L)
}
a
}
greedy_prob <- function(s, Q, epsilon = 0) {
p <- structure(rep(0, length(A)), names = A)
available_A <- actions(s)
a <- available_A[which.max(Q[s, available_A])]
p[a] <- 1 - epsilon
p[available_A] <- p[available_A] + epsilon/length(available_A)
p
}
The temporal-difference learning algorithm follows the Sarsa implementation, but just changing how the TD error is calculated lets the algorithm also perform Q-learning and expected Sarsa.
TD_learning <- function(method = "sarsa",
alpha = 0.1,
epsilon = 0.1,
gamma = 1,
N = 100,
verbose = FALSE) {
method <- match.arg(method, c("sarsa", "q", "expected_sarsa"))
# Initialize Q
Q <-
matrix(
NA_real_,
nrow = length(S),
ncol = length(A),
dimnames = list(S, A)
)
for (s in S)
Q[s, actions(s)] <- 0
# loop episodes
for (e in seq(N)) {
s <- start
a <- greedy_action(s, Q, epsilon)
# loop steps in episode
i <- 1L
while (TRUE) {
s_prime <- sample_transition(s, a)
r <- R(s, a, s_prime)
a_prime <- greedy_action(s_prime, Q, epsilon)
if (verbose) {
if (step == 1L)
cat("\n*** Episode", e, "***\n")
cat("Step", i, "- s a r s' a':", s, a, r, s_prime, a_prime, "\n")
}
if (method == "sarsa")
# is called Sarsa because it uses the sequence s, a, r, s', a'
Q[s, a] <-
Q[s, a] + alpha * (r + gamma * Q[s_prime, a_prime] - Q[s, a])
else if (method == "q") {
# a' is greedy instead of using the behavior policy
a_max <- greedy_action(s_prime, Q, epsilon = 0)
Q[s, a] <-
Q[s, a] + alpha * (r + gamma * Q[s_prime, a_max] - Q[s, a])
} else if (method == "expected_sarsa") {
p <- greedy_prob(s_prime, Q, epsilon)
exp_Q_prime <-
sum(greedy_prob(s_prime, Q, epsilon) * Q[s_prime, ], na.rm = TRUE)
Q[s, a] <-
Q[s, a] + alpha * (r + gamma * exp_Q_prime - Q[s, a])
}
s <- s_prime
a <- a_prime
if (is_terminal(s))
break
i <- i + 1L
}
}
Q
}
Sarsa is on-policy and calculates the TD-error for the update as:
\[ \gamma\,Q(S', A') - Q(S, A) \] where the \(S'\) and \(A'\) are determined by the same policy that is used for the agents behavior. In this case it is \(\epsilon\)-greedy.
Up Right Down Left None
1 0.72 0.600 0.647 0.66 NA
2 0.76 0.716 0.668 0.72 NA
3 0.80 0.810 0.737 0.78 NA
4 0.57 0.402 0.582 0.64 NA
5 NA NA NA NA 0
6 0.82 0.882 0.826 0.79 NA
7 0.37 -0.045 0.020 0.23 NA
8 0.73 -0.845 0.252 0.52 NA
9 0.87 0.973 0.722 0.82 NA
10 -0.19 -0.175 -0.052 0.13 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
Calculate the value function \(U\) from the learned Q-function as the largest Q value of any action in a state.
col
row 1 2 3 4
3 0.81 0.88 0.97 0.00
2 0.76 0.00 0.73 0.00
1 0.72 0.64 0.37 0.13
Extract the greedy policy for the learned \(Q\)-value function.
Q-Learning is off-policy and calculates the TD-error for the update as:
\[ \gamma\,\mathrm{max}_a Q(S', a) - Q(S, A) \] where the target policy is greedy reflected by the maximum that chooses the action with the largest \(Q\)-value.
Up Right Down Left None
1 0.74 0.671 0.703 0.71 NA
2 0.79 0.766 0.713 0.76 NA
3 0.80 0.854 0.781 0.80 NA
4 0.66 0.574 0.651 0.69 NA
5 NA NA NA NA 0
6 0.86 0.911 0.851 0.83 NA
7 0.38 0.177 0.388 0.64 NA
8 0.67 -0.503 0.363 0.64 NA
9 0.86 0.968 0.725 0.81 NA
10 -0.37 -0.062 -0.013 0.24 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
Calculate the value function \(U\) from the learned Q-function as the largest Q value of any action in a state.
col
row 1 2 3 4
3 0.85 0.91 0.97 0.00
2 0.79 0.00 0.67 0.00
1 0.74 0.69 0.64 0.24
Extract the greedy policy for the learned \(Q\)-value function.
Expected Sarsa calculates the TD-error for the update as:
\[ \gamma\, E_\pi[Q(S', Q')] - Q(S, A) = \gamma \sum_a\pi(a|S')Q(S', a) - Q(S, A) \] using the expected value under the current policy. It moves deterministically in the same direction as Sarsa moved in expectation. Because it uses the expectation, we can set \(\alpha\) to large values and even 1.
Up Right Down Left None
1 0.625 -0.16 -0.070 -0.0375 NA
2 0.714 0.62 0.068 -0.0035 NA
3 0.342 0.79 0.160 0.3404 NA
4 -0.456 -0.39 -0.512 -0.0318 NA
5 NA NA NA NA 0
6 0.056 0.91 0.032 0.2461 NA
7 0.079 -1.04 -1.043 0.0315 NA
8 0.891 -1.00 0.308 -1.0409 NA
9 0.792 1.00 0.169 -0.0726 NA
10 -1.000 -1.09 -1.043 -1.0000 NA
11 NA NA NA NA 0
12 NA NA NA NA 0
Calculate the value function \(U\) from the learned Q-function as the largest Q value of any action in a state.
col
row 1 2 3 4
3 0.79 0.907 1.000 0
2 0.71 0.000 0.891 0
1 0.62 -0.032 0.079 -1
Extract the greedy policy for the learned \(Q\)-value function.
col
row 1 2 3 4
3 "Right" "Right" "Right" "None"
2 "Up" "None" "Up" "None"
1 "Up" "Left" "Up" "Up"
Not implemented yet.
To improve convergence, \(\epsilon\) and \(\alpha\) are typically reduced slowly over time. This is not implemented here.