MDP - Markov Decision Processes

Author

Michael Hahsler

This code is provided under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.

CC BY-SA 4.0

1 Introduction

AIMA chapter 17 is about sequential decision problems where the agent’s utility depends on a sequence of decisions. We will implement the key concepts using R for the AIMA 3x4 grid world example.

More details can be found in Reinforcement Learning: An Introduction (RL) by Sutton and Barto (2020) in Chapter 4: Dynamic Programming.

Note that the code in this notebook defines explicit functions matching the textbook definitions and is for demonstration purposes only. Efficient implementations for larger problems use fast vector multiplications instead.

2 Markov Decision Processes

MDPs are sequential decision problems with

  • a fully observable, stochastic environment,
  • a Markovian transition model, and
  • additive rewards.

MDPs are defined by:

  • A set of states \(S\) with an initial state \(s_0\).
  • A set of available \(\mathrm{actions}(s)\) in each state.
  • A transition model \(P(s'|s,a)\) to define how we move between states depending on actions.
  • A reward function \(R(s, a, s')\) defined on state transitions and the actions taken.

A policy \(\pi = \{\pi(s_0), \pi(s_1), \dots\}\) defines for each state which action to take. If we assume that under policy \(\pi\), the agent will go through the state sequence \(s_0, s_1, ..., s_\infty\), then the expected utility of being in state \(s_0\) can be calculated as a sum. To incorporate that earlier rewards are more important, a discount factor \(\gamma\) is used.

\(U^\pi = E\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(s_t), s_{t+1})\right]\)

The goal of solving a MDP is to find an optimal policy that maximizes the expected future utility.

\(\pi^*(s) = \mathrm{argmax}_\pi U^\pi(s)\) for all \(s \in S\)

3 The 4x3 Grid World Example

The used example is the simple 4x3 grid world described in AIMA Figure 12.1 and used again in 22.1 as:

AIMA Figure 17.1: (a) A simple, stochastic \(4 \times 3\) environment that presents the agent with a sequential decision problem. (b) Illustration of the transition model of the environment: the “intended” outcome occurs with probability 0.8, but with probability 0.2 the agent moves at right angles to the intended direction. A collision with a wall results in no movement. Transitions into the two terminal states have reward +1 and -1, respectively, and all other transitions have a reward of -0.04.

In the following, we implement states, actions, the reward function,and the transition model. We also show how to represent a policy and how to estimate the expected utility using simulation.

The MDP will be defined as the following global variables/functions:

  • S: set of states.
  • A: set of actions.
  • actions(s): returns the available actions for state \(s\).
  • P(sp, s, a): a function returning the transition probability \(P(s' | s, a)\).
  • R(s, a, s_prime): reward function for the transition from \(s\) to \(s`\) with action a.

Policies are represented as:

  • Deterministic policies are a vector with the action for each state.
  • Stochastic policies are a matrix with probabilities where each row is a state and the columns are the actions

Other useful functions:

  • sample_transition(s, a): returns a \(s'\) sampled using the transition model.
  • simulate_utilities(pi, s0 = 1, N = 1000, max_t = 100): Estimates the utility of following policy \(\pi\), starting \(N\) episodes in state \(s_0\). The maximal episode length is max_t to ensure that the function finishes also for policies that do not lead to a terminal state.

3.1 States

We define the atomic state space \(S\) by labeling the states \(1, 2, ...\). We convert coordinates (rows, columns) to the state label.

# I use capitalized variables as global constants
COLS <- 4
ROWS <- 3

S = seq_len(ROWS * COLS)

LAYOUT <- matrix(S, nrow = ROWS, ncol = COLS)
LAYOUT
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Note that the rows are displayed upside-down compared to the text book, so we use a function to display them in reverse order.

show_layout <- function(x) {
    x <- matrix(x, ncol = COLS, nrow = ROWS, dimnames = list(row = seq_len(ROWS),
        col = seq_len(COLS)))
    x[rev(seq_len(ROWS)), ]
}

show_layout(LAYOUT)
   col
row 1 2 3  4
  3 3 6 9 12
  2 2 5 8 11
  1 1 4 7 10

Convert between coordinates and state labels.

rc_to_s <- function(rc) LAYOUT[rbind(rc)]

s_to_rc <- function(s) drop(which(LAYOUT == s, arr.ind = TRUE, useNames = FALSE))


rc_to_s(c(3, 4))
[1] 12
s_to_rc(12)
[1] 3 4

Start state ::: {.cell}

start <- 1L

:::

Define terminal states.

is_terminal <- function(s) s %in% c(5, 11, 12)

3.2 Actions

The complete set of actions is \(A = \{\mathrm{'Up', 'Right', 'Down', 'Left', 'None'}\}\). Not all actions are available in every state. Also, action None is added as the only possible action in an absorbing state.

A = c("Up", "Right", "Down", "Left", "None")

actions <- function(s) {

    # absorbing states
    if (s == 11 || s == 12)
        return("None")

    # illegal state
    if (s == 5)
        return("None")

    c("Up", "Right", "Down", "Left")
}

lapply(S, actions)
[[1]]
[1] "Up"    "Right" "Down"  "Left" 

[[2]]
[1] "Up"    "Right" "Down"  "Left" 

[[3]]
[1] "Up"    "Right" "Down"  "Left" 

[[4]]
[1] "Up"    "Right" "Down"  "Left" 

[[5]]
[1] "None"

[[6]]
[1] "Up"    "Right" "Down"  "Left" 

[[7]]
[1] "Up"    "Right" "Down"  "Left" 

[[8]]
[1] "Up"    "Right" "Down"  "Left" 

[[9]]
[1] "Up"    "Right" "Down"  "Left" 

[[10]]
[1] "Up"    "Right" "Down"  "Left" 

[[11]]
[1] "None"

[[12]]
[1] "None"

3.3 Transition Model

\(P(s' | s, a)\) is the probability of going from state \(s\) to \(s'\) by when taking action \(a\). We will create a matrix \(P_a(s' | s)\) for each action.

calc_transition <- function(s, action) {
    action <- match.arg(action, choices = A)

    if (length(s) > 1)
        return(t(sapply(s, calc_transition, action = action)))

    # deal with absorbing and illegal state
    if (s == 11 || s == 12 || s == 5 || action == "None") {
        P <- rep(0, length(S))
        P[s] <- 1
        return(P)
    }

    action_to_delta <- list(Up = c(+1, 0), Down = c(-1, 0), Right = c(0, +1), Left = c(0,
        -1))
    delta <- action_to_delta[[action]]
    dr <- delta[1]
    dc <- delta[2]

    rc <- s_to_rc(s)
    r <- rc[1]
    c <- rc[2]

    if (dr != 0 && dc != 0)
        stop("You can only go up/down or right/left!")

    P <- matrix(0, nrow = ROWS, ncol = COLS)

    # UP/DOWN
    if (dr != 0) {
        new_r <- r + dr
        if (new_r > ROWS || new_r < 1)
            new_r <- r
        ## can't got to (2, 2)
        if (new_r == 2 && c == 2)
            new_r <- r
        P[new_r, c] <- 0.8

        if (c < COLS & !(r == 2 & (c + 1) == 2))
            P[r, c + 1] <- 0.1 else P[r, c] <- P[r, c] + 0.1
        if (c > 1 & !(r == 2 & (c - 1) == 2))
            P[r, c - 1] <- 0.1 else P[r, c] <- P[r, c] + 0.1
    }

    # RIGHT/LEFT
    if (dc != 0) {
        new_c <- c + dc
        if (new_c > COLS || new_c < 1)
            new_c <- c
        ## can't got to (2, 2)
        if (r == 2 && new_c == 2)
            new_c <- c
        P[r, new_c] <- 0.8

        if (r < ROWS & !((r + 1) == 2 & c == 2))
            P[r + 1, c] <- 0.1 else P[r, c] <- P[r, c] + 0.1
        if (r > 1 & !((r - 1) == 2 & c == 2))
            P[r - 1, c] <- 0.1 else P[r, c] <- P[r, c] + 0.1
    }

    as.vector(P)
}

Try to go up from state 1 (this is (1,1), the bottom left corner). Note: we cannot go left so there is a .1 chance to stay in place. ::: {.cell}

calc_transition(1, "Up")
 [1] 0.1 0.8 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
show_layout(calc_transition(1, "Up"))
   col
row   1   2 3 4
  3 0.0 0.0 0 0
  2 0.8 0.0 0 0
  1 0.1 0.1 0 0

:::

Try to go right from (2,1). Since right is blocked, there is a .8 probability of staying in place. ::: {.cell}

show_layout(calc_transition(2, "Right"))
   col
row   1 2 3 4
  3 0.1 0 0 0
  2 0.8 0 0 0
  1 0.1 0 0 0

:::

Calculate transitions for each state to each other state. Each row represents a state \(s\) and each column a state \(s'\) so we get a complete definition for \(P_a(s' | s)\). Note that the matrix is stochastic (all rows add up to 1).

Create a matrix for each action.

P_matrices <- lapply(A, FUN = function(a) calc_transition(S, a))
names(P_matrices) <- A
str(P_matrices)
List of 5
 $ Up   : num [1:12, 1:12] 0.1 0 0 0.1 0 0 0 0 0 0 ...
 $ Right: num [1:12, 1:12] 0.1 0.1 0 0 0 0 0 0 0 0 ...
 $ Down : num [1:12, 1:12] 0.9 0.8 0 0.1 0 0 0 0 0 0 ...
 $ Left : num [1:12, 1:12] 0.9 0.1 0 0.8 0 0 0 0 0 0 ...
 $ None : num [1:12, 1:12] 1 0 0 0 0 0 0 0 0 0 ...

Create a function interface for \(P(s' | s, a)\).

P <- function(sp, s, a) P_matrices[[a]][s, sp]

P(2, 1, "Up")
[1] 0.8
P(5, 4, "Up")
[1] 0

3.4 Reward

\(R(s, a, s')\) define the reward for the transition from \(s\) to \(s'\) with action \(a\).

For the textbook example we have:

  • Any move costs utility (a reward of -0.04).
  • Going to state 12 has a reward of +1
  • Going to state 11 has a reward of -1.

Note that once you are in an absorbing state (11 or 12), then the problem is over and there is no more reward!

R <- function(s, a, s_prime) {
    ## no more reward when we in 11 or 12.
    if (a == "None" || s == 11 || s == 12)
        return(0)

    ## transition to the absorbing states.
    if (s_prime == 12)
        return(+1)
    if (s_prime == 11)
        return(-1)

    ## cost for each move
    return(-0.04)
}

R(1, "Up", 2)
[1] -0.04
R(9, "Right", 12)
[1] 1
R(12, "None", 12)
[1] 0

3.5 Policy

The solution to an MDP is a policy \(\pi\) which defines which action to take in each state. We represent deterministic policies as a vector of actions. I make up a policy that always goes up and then to the right once the agent hits the top.

pi_manual <- rep("Up", times = length(S))
pi_manual[c(3, 6, 9)] <- "Right"
pi_manual
 [1] "Up"    "Up"    "Right" "Up"    "Up"    "Right" "Up"    "Up"    "Right"
[10] "Up"    "Up"    "Up"   
show_layout(pi_manual)
   col
row 1       2       3       4   
  3 "Right" "Right" "Right" "Up"
  2 "Up"    "Up"    "Up"    "Up"
  1 "Up"    "Up"    "Up"    "Up"

We can also create a random policy by randomly choosing from the available actions for each state.

create_random_deterministic_policy <- function() structure(sapply(S, FUN = function(s) sample(actions(s),
    1L)), names = S)

set.seed(1234)
pi_random <- create_random_deterministic_policy()
pi_random
      1       2       3       4       5       6       7       8       9      10 
 "Left"  "Left" "Right" "Right"  "None"  "Left"  "Down"    "Up"    "Up" "Right" 
     11      12 
 "None"  "None" 
show_layout(pi_random)
   col
row 1       2       3      4      
  3 "Right" "Left"  "Up"   "None" 
  2 "Left"  "None"  "Up"   "None" 
  1 "Left"  "Right" "Down" "Right"

Stochastic policies use probabilities of actions in each state. We use as simple table with probabilities where each row is a state and the columns are the actions. Here we create a random \(\epsilon\)-soft policy. Each available has at least a probability of \(\epsilon\).

We can make a deterministic policy soft. ::: {.cell}

make_policy_soft <- function(pi, epsilon = 0.1) {
    if (!is.vector(pi))
        stop("pi is not a deterministic policy!")

    p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))

    for (s in S) {
        p[s, actions(s)] <- epsilon/length(actions(s))
        p[s, pi[s]] <- p[s, pi[s]] + (1 - epsilon)
    }

    p
}

make_policy_soft(pi_random)
      Up Right  Down  Left None
1  0.025 0.025 0.025 0.925    0
2  0.025 0.025 0.025 0.925    0
3  0.025 0.925 0.025 0.025    0
4  0.025 0.925 0.025 0.025    0
5  0.000 0.000 0.000 0.000    1
6  0.025 0.025 0.025 0.925    0
7  0.025 0.025 0.925 0.025    0
8  0.925 0.025 0.025 0.025    0
9  0.925 0.025 0.025 0.025    0
10 0.025 0.925 0.025 0.025    0
11 0.000 0.000 0.000 0.000    1
12 0.000 0.000 0.000 0.000    1

:::

Or we can create a completely random soft policy

create_random_epsilon_soft_policy <- function(epsilon = 0.1) {

    # split total randomly into n numbers that add up to total
    random_split <- function(n, total) {
        if (n == 1)
            return(total)

        bordersR <- c(sort(runif(n - 1)), 1)
        bordersL <- c(0, bordersR[1:(n - 1)])
        (bordersR - bordersL) * total
    }

    p <- matrix(0, nrow = length(S), ncol = length(A), dimnames = list(S, A))
    for (s in S) p[s, actions(s)] <- epsilon/length(actions(s)) + random_split(n = length(actions(s)),
        1 - epsilon)

    p
}

set.seed(1234)
pi_random_epsilon_soft <- create_random_epsilon_soft_policy()
pi_random_epsilon_soft
           Up      Right      Down       Left None
1  0.12733307 0.47101419 0.0367222 0.36493054    0
2  0.58604150 0.04023805 0.2235443 0.15017615    0
3  0.03354618 0.22574927 0.4151799 0.32552462    0
4  0.48782603 0.05265132 0.1587548 0.30076784    0
5  0.00000000 0.00000000 0.0000000 0.00000000    1
6  0.27946023 0.03362403 0.5930059 0.09390986    0
7  0.26513870 0.04246225 0.5209651 0.17143393    0
8  0.19305051 0.06595281 0.1009479 0.64004879    0
9  0.06099633 0.13214508 0.1542826 0.65257597    0
10 0.22191959 0.30120821 0.2814109 0.19546130    0
11 0.00000000 0.00000000 0.0000000 0.00000000    1
12 0.00000000 0.00000000 0.0000000 0.00000000    1

3.6 Expected Utility

The expected utility can be calculated by

\(U^\pi = E\left[\sum_{t=0}^\infty \gamma^t R(s_t, \pi(s_t), s_{t+1})\right]\)

We need to define the discount factor.

GAMMA <- 1

We can evaluate the utility of a policy using simulation. For the stochastic transition model, we need to be able to sample the state \(s'\) the system transitions to when using action \(a\) in state \(s\).

sample_transition <- function(s, a) sample(S, size = 1, prob = P_matrices[[a]][s,
    ])

sample_transition(1, "Up")
[1] 1
table(replicate(n = 100, sample_transition(1, "Up")))

 1  2  4 
 9 84  7 

We can now simulate the utility for one episode. Note that we use the cutoff max_t in case a policy does not end up in a terminal state before that.

simulate_utility <- function(pi, s0 = 1, max_t = 100) {
    s <- s0
    U <- 0
    t <- 0

    while (TRUE) {
        ## get action from policy (matrix means it is a stochastic policy)
        if (!is.matrix(pi))
            a <- pi[s] else a <- sample(A, size = 1, prob = pi[s, ])

        ## sample a transition given the action from the policy
        s_prime <- sample_transition(s, a)

        ##
        U <- U + GAMMA^t * R(s, a, s_prime)

        s <- s_prime

        ## reached an absorbing state?
        if (s == 11 || s == 12 || s == 5)
            break

        t <- t + 1
        if (t >= max_t)
            break
    }

    U
}

Simulate the N episodes.

simulate_utilities <- function(pi, s0 = 1, N = 1000, max_t = 100) replicate(N, simulate_utility(pi,
    s0, max_t))

utility_manual <- simulate_utilities(pi_manual)

The expected utility for starting from state \(s_0 = 1\) is.

mean(utility_manual)
[1] 0.69896
hist(utility_manual, xlim = c(-1, 1))

Compare with the utility of the random policy. ::: {.cell}

utility_random <- simulate_utilities(pi_random, max_t = 100)

table(utility_random)
utility_random
  -4 
1000 

:::

The random policy performs really bad and most likely always stumbles around for max_t moves at a cost of .04 each. The manually created policy is obviously better.

We can use simulation to estimate the expected utility for starting from each state following the policy.

U_manual <- sapply(S, FUN = function(s) mean(simulate_utilities(pi_manual, s0 = s)))
show_layout(U_manual)
   col
row       1        2       3      4
  3 0.86024  0.89644 0.95220  0.000
  2 0.79400 -0.04000 0.71636  0.000
  1 0.71748  0.35900 0.51628 -0.867
U_random <- sapply(S, FUN = function(s) mean(simulate_utilities(pi_random, s0 = s)))
show_layout(U_random)
   col
row  1        2        3       4
  3 -4 -4.00000 -1.61420  0.0000
  2 -4  0.00000 -1.45852  0.0000
  1 -4 -1.84936 -1.79792 -1.3744

4 Solution Methods

It is convenient for solving MDPs using state-action values so we first define the Q-function.

4.1 Q-Function

Let’s define the expected utility of a state given that the agent always chooses the optimal action (which it hopefully will if it is rational).

\(U(s) = \max_{a \in A(s)}\sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma U(s')]\)

This equation is called the Bellman equation resulting in an equation system with one equation per state \(s\). This system of equations is hard to solve for all \(U(s)\) values because of the nonlinear \(\max()\) operator.

Lets define a function for the expected utility of any possible action \(a\) (not just the optimal one) in a given state \(s\). This is called the (Q-function or state-action value function):

\(Q(s,a) = \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma U(s')]\)

This function is convenient for solving MDPs and can easily be implemented.

Q_value <- function(s, a, U) {
    if (!(a %in% actions(s)))
        return(NA)

    sum(sapply(S, FUN = function(sp) P(sp, s, a) * (R(s, a, sp) + GAMMA * U[sp])))
}

The issue is that we need to know \(U\) representing the expected utility of a state given optimal decisions is needed. Value iteration uses a simple iterative algorithm to solve this problem by successively updating Q and U.

Note that \(U(s) = \max_a Q(s, a)\) holds and we get:

\(Q(s,a) = \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma \max_{a'} Q(s', a')]\)

4.2 Value Iteration

The goal is to find the unique utility function \(U\) (a vector of utilities, one for each state) for the MDP and then derive the implied optimal policy \(\pi^*\).

Algorithm: Start with a \(U(s)\) vector of 0 for all states and then update (Bellman update) the vector iteratively until it converges. This procedure is guaranteed to converge to the unique optimal solution. The pseudocode from AIMA Figure 17.6.:

AIMA Figure 17.6: The value iteration algorithm for calculating utilities of states.

Stopping criterion: \(||U^\pi - U||_\infty\) is called the policy loss (i.e., the most the agent can loose by using policy \(\pi\) instead of the optimal policy \(\pi^*\) implied in \(U\)). The max-norm \(||x||_\infty\) is defined as the largest component of a vector \(x\).

It can be shown that if \(||U_{i+1} - U_i||_\infty < \epsilon(1-\gamma)/\gamma\) then \(||U_{i+1} - U||_\infty < \epsilon\). This can be used as a stopping criterion with guarantee of a policy loss of less than \(\epsilon\).

value_iteration <- function(eps, verbose = FALSE) {
    U_prime <- rep(0, times = length(S))
    i <- 1L

    while (TRUE) {
        if (verbose)
            cat("Iteration:", i)
        # cat('U:', U_prime, '\n')

        U <- U_prime
        delta <- 0

        for (s in S) {
            U_prime[s] <- max(sapply(actions(s), FUN = function(a) Q_value(s, a,
                U)))
            delta <- max(delta, abs(U_prime[s] - U[s]))
        }

        if (verbose)
            cat(" -> delta:", delta, "\n")

        if (delta <= eps * (1 - GAMMA)/GAMMA)
            break

        i <- i + 1L
    }

    cat("Iterations needed:", i, "\n")

    U
}
U <- value_iteration(eps = 1e-06)
Iterations needed: 58 
show_layout(U)
   col
row         1         2         3         4
  3 0.8515582 0.9078082 0.9578082 0.0000000
  2 0.8015582 0.0000000 0.7002740 0.0000000
  1 0.7453082 0.6953082 0.6514155 0.4279249

For the optimal policy, we choose in each state the action that maximizes the expected utility. This is called the maximum expected utility (MEU) policy. The action that maximizes the utility can be found using the Q-function.

\(\pi^*(s) = \mathrm{argmax}_a Q(s, a)\)

For state 1, 'Up' is the best move

sapply(A, FUN = function(a) Q_value(s = 1, a, U = U))
       Up     Right      Down      Left      None 
0.7453082 0.6709332 0.7003082 0.7109332        NA 

Calculate the Q-function for all \(S \times A\).

Q_value_vec <- Vectorize(Q_value, vectorize.args = c("s", "a"))

QVs <- outer(S, A, FUN = function(s, a) Q_value_vec(s, a, U = U))
colnames(QVs) <- A
QVs
              Up      Right      Down      Left None
 [1,]  0.7453082  0.6709332 0.7003082 0.7109332   NA
 [2,]  0.8015582  0.7609332 0.7165582 0.7609332   NA
 [3,]  0.8171832  0.8515582 0.7771832 0.8065582   NA
 [4,]  0.6559189  0.6201941 0.6559189 0.6953082   NA
 [5,]         NA         NA        NA        NA    0
 [6,]  0.8671832  0.9078082 0.8671832 0.8228082   NA
 [7,]  0.6325425  0.4375089 0.5934557 0.6514155   NA
 [8,]  0.7002740 -0.6470776 0.4551598 0.6811416   NA
 [9,]  0.9210274  0.9578082 0.7150000 0.8520548   NA
[10,] -0.7000660  0.2491324 0.4102740 0.4279249   NA
[11,]         NA         NA        NA        NA    0
[12,]         NA         NA        NA        NA    0

The optimal policy is the greedy policy that always picks the action with the largest Q-value.

pi_star <- A[apply(QVs, MARGIN = 1, which.max)]
show_layout(pi_star)
   col
row 1       2       3       4     
  3 "Right" "Right" "Right" "None"
  2 "Up"    "None"  "Up"    "None"
  1 "Up"    "Left"  "Left"  "Left"

Estimate the expected utility using simulation.

utility_opt <- simulate_utilities(pi_star)

# expected utility
mean(utility_opt)
[1] 0.73692
hist(utility_opt, xlim = c(-1, 1))

Compare three policies.

c(random = mean(utility_random), manual = mean(utility_manual), opt = mean(utility_opt))
  random   manual      opt 
-4.00000  0.69896  0.73692 

Since we know that utility_opt is very close to \(U\), we can estimate the policy loss (i.e., the most the agent can loose by using \(\pi\) instead of \(\pi*\)) of the other policies given by:

\(||U^\pi - U||_\infty\)

Here is the policy loss for the manual policy. The maximum norm is the component with the largest difference. First, we calculate the absolute difference for each state.

show_layout(abs(U_manual - U))
   col
row           1          2           3        4
  3 0.008681781 0.01136822 0.005608219 0.000000
  2 0.007558219 0.04000000 0.016086027 0.000000
  1 0.027828219 0.33630822 0.135135525 1.294925

The maximum is:

max(abs(U_manual - U))
[1] 1.294925
which.max(abs(U_manual - U))
[1] 10

The policy loss is driven by the bad action taken in state 10 which is at coordinate (1, 4).

4.3 Policy Iteration

Policy iteration tries to directly find the optimal policy. It alternates between two steps:

  1. Policy evaluation: given a current policy \(\pi_i\), calculate \(U^{\pi_i}\).
  2. Policy improvement: calculate a new MEU policy \(\pi_{i+1}\).

The pseudocode from AIMA Figure 17.9.:

AIMA Figure 17.9: The policy iteration algorithm for calculating an optimal policy.

For policy evaluation, we need to solve:

\(U_i(s) = \sum_{s'} P(s'|s, \pi_i(s))[R(s, \pi_i(s), s') + \gamma U_i(s')]\)

This is slightly simpler than the general Bellman equation, since the action in each state is fixed by the policy and there is no non-linear \(\max()\) operator. For small state spaces this can be solved fast using a LP in \(O(n^3)\).

For large state spaces, we can do approximate policy evaluation by performing only a few iterations of a simplified Bellman update:

\(U_{i+1}(s) \leftarrow \sum_{s'} P(s'|s, \pi_i(s))[R(s, \pi_i(s), s') + \gamma U_i(s')]\)

We implement here approximate policy evaluation with N iterations.

approx_policy_evaluation <- function(pi, U = NULL, N = 10) {
    # start with all 0s if no previous U is given
    if (is.null(U))
        U <- rep(0, times = length(S))

    for (i in seq_len(N)) {
        for (s in S) {
            U[s] = sum(sapply(S, FUN = function(s_prime) {
                P(s_prime, s, pi[s]) * (R(s, pi[s], s_prime) + GAMMA * U[s_prime])
            }))
        }
    }
    U
}
approx_policy_evaluation(pi_random)
 [1] -0.42447359 -0.48544307 -0.62180170 -0.60470718  0.00000000 -0.65671466
 [7] -0.64166748 -0.06952390  0.07027771 -0.88579732  0.00000000  0.00000000
approx_policy_evaluation(pi_manual)
 [1]  0.6772416  0.7999334  0.8512886  0.1822976 -0.4000000  0.9078008
 [7]  0.4536453  0.7002716  0.9578078 -0.8474310  0.0000000  0.0000000

We will implement modified policy iteration. Modified means that we use the approximate policy evaluation.

policy_iteration <- function(N = 10) {
    U <- rep(0, times = length(S))
    pi <- create_random_deterministic_policy()

    while (TRUE) {
        U <- approx_policy_evaluation(pi, U, N)
        unchanged <- TRUE
        for (s in S) {
            actns <- actions(s)
            a <- actns[which.max(sapply(actns, FUN = function(a) Q_value(s, a, U)))]
            if (Q_value(s, a, U) > Q_value(s, pi[s], U)) {
                pi[s] <- a
                unchanged <- FALSE
            }
        }

        if (unchanged)
            break
    }
    pi
}
pi_opt_policy_it <- policy_iteration()
show_layout(pi_opt_policy_it)
   col
row 1       2       3       4     
  3 "Right" "Right" "Right" "None"
  2 "Up"    "None"  "Up"    "None"
  1 "Up"    "Left"  "Left"  "Left"

4.4 Linear Programming

Rewriting the Bellman equations as an LP formulation requires replacing the non-linear \(\max()\) operation using additional constraints. The LP can be solved in polynomial time. In practice this is too slow for larger problems. The dynamic programming solution above is typically more efficient, but it is also restricted to small problems.

4.5 Approximate Offline Methods

Reinforcement learning is discussed in AIMA Chapter 22.