Solving a Maze using Reinforcement Learning

Author

Michael Hahsler

This code is provided under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) License.

1 Introduction

This report demonstrates how value iteration (model-based RL) and q-learning (model-free RL) solve a maze similar to the mazes we have used for tree search.

1.1 Install and Load Package markovDP

Install the current development version of markovDP.

if (!"markovDP" %in% rownames(installed.packages())) 
  install.packages('markovDP', repos = c(
    'https://mhahsler.r-universe.dev', 
    'https://cloud.r-project.org'))

library(markovDP)

1.2 Read the Maze

Mazes are so called gridworlds where the agent moves around in a 2d environment composed out of tiles. The markovDP package already comes with the mazes we use in class.

maze_dir <- system.file("mazes", package = "markovDP")
dir(maze_dir)
#> [1] "empty_2_maze.txt" "empty_maze.txt"   "L_maze.txt"       "large_maze.txt"  
#> [5] "loops_maze.txt"   "medium_maze.txt"  "open_maze.txt"    "small_maze.txt"  
#> [9] "wall_maze.txt"

maze <- gw_read_maze(file.path(maze_dir, "L_maze.txt"))
maze
#> MDP, MDPE - Maze
#>   Discount factor: 1
#>   Horizon: Inf epochs
#>   Size: 155 states / 4 actions
#>   Storage: transition prob as function / reward as data.frame. Total size: 77.8 Kb
#>   Start: s(10,6)
#> 
#>   List components: 'name', 'discount', 'horizon', 'states', 'actions',
#>     'start', 'transition_prob', 'reward', 'info', 'absorbing_states'

gw_plot(maze)

The transition model is stored as a function.

maze$transition_prob
#> function (model, action, start.state) 
#> {
#>     S <- S(model)
#>     P <- setNames(numeric(length(S)), S)
#>     ai <- match(action, A(model))
#>     if (is.na(ai)) {
#>         warning("Unknown action", action)
#>         P[start.state] <- 1
#>         return(P)
#>     }
#>     if (start.state %in% model$info$absorbing_states) {
#>         P[start.state] <- 1
#>         return(P)
#>     }
#>     rc <- gw_s2rc(start.state)
#>     rc <- switch(ai, rc + c(-1, 0), rc + c(0, +1), rc + c(+1, 
#>         0), rc + c(0, -1), )
#>     es <- gw_rc2s(rc)
#>     if (!(es %in% S)) {
#>         es <- start.state
#>     }
#>     P[es] <- 1
#>     P
#> }
#> <bytecode: 0x560295004b30>
#> <environment: namespace:markovDP>
gw_plot_transition_graph(maze)

The reward model

maze$reward
#>   action start.state end.state value
#> 1   <NA>        <NA>      <NA>    -1
#> 2   <NA>        <NA>   s(3,13)    99
#> 3   <NA>     s(3,13)      <NA>     0

The reward is a cost of 1 for each transition and a reward of +100 for reaching the goal state. The best solution will be the shortest path to the goal.

Remember that tree search starts with the start state and explores the space till it finds the goal state.

2 Solve the Maze MDP as a Planning Agent

We solve the MDP directly using the problem description as a model for the environment. This approach would be used by a planning agent similar to the tree search algorithms like A* search. Since the agent uses a complete model of the environment, we use a model-based reinforcement learning algorithm.

Here we use the dynamic programming method called value iteration (VI). The algorithm sweeps in each iteration through all states and updates each state value (called the value function) using the utility of the best state it could get to plus the immediate reward. It stops when the state values do not change anymore (less than a set error threshold).

system.time(sol <- solve_MDP(maze, method = "DP:VI"))
#>    user  system elapsed 
#>   0.121   0.000   0.120

sol
#> MDP, MDPE - Maze
#>   Discount factor: 1
#>   Horizon: Inf epochs
#>   Size: 155 states / 4 actions
#>   Storage: transition prob as matrix / reward as matrix. Total size: 1.7 Mb
#>   Start: s(10,6)
#>   Solved:
#>     Method: 'VI'
#>     Solution converged: TRUE
#> 
#>   List components: 'name', 'discount', 'horizon', 'states', 'actions',
#>     'start', 'transition_prob', 'reward', 'info', 'absorbing_states',
#>     'solution'

gw_plot(sol)

gw_path(sol)
#> Warning in convergence_horizon(model, delta = delta_horizon, n_updates = 1): discount needs to be <1 to guarantee convergence.
#>   Using a maximum horizon of |S| x |A| x n_updates = 620
#> $path
#>    episode time        s     a  r  s_prime
#> 1        1    0  s(10,6) right -1  s(10,7)
#> 2        1    1  s(10,7)  down -1  s(11,7)
#> 3        1    2  s(11,7) right -1  s(11,8)
#> 4        1    3  s(11,8) right -1  s(11,9)
#> 5        1    4  s(11,9)  down -1  s(12,9)
#> 6        1    5  s(12,9) right -1 s(12,10)
#> 7        1    6 s(12,10) right -1 s(12,11)
#> 8        1    7 s(12,11) right -1 s(12,12)
#> 9        1    8 s(12,12)    up -1 s(11,12)
#> 10       1    9 s(11,12)    up -1 s(10,12)
#> 11       1   10 s(10,12) right -1 s(10,13)
#> 12       1   11 s(10,13)    up -1  s(9,13)
#> 13       1   12  s(9,13)    up -1  s(8,13)
#> 14       1   13  s(8,13)    up -1  s(7,13)
#> 15       1   14  s(7,13)    up -1  s(6,13)
#> 16       1   15  s(6,13)    up -1  s(5,13)
#> 17       1   16  s(5,13)    up -1  s(4,13)
#> 18       1   17  s(4,13)    up 99  s(3,13)
#> 
#> $reward
#> [1] 82
#> 
#> $solved
#> [1] TRUE

The color represents the state value. Here are the learned state values and the optimal action for each state (policy).

policy(sol)
#>       states  V action
#> 1     s(2,2) 88  right
#> 2     s(3,2) 89  right
#> 3     s(4,2) 88     up
#> 4     s(5,2) 87     up
#> 5     s(6,2) 86     up
#> 6     s(7,2) 85  right
#> 7     s(8,2) 84     up
#> 8     s(9,2) 83     up
#> 9    s(10,2) 82  right
#> 10   s(11,2) 81  right
#> 11   s(12,2) 80     up
#> 12   s(13,2) 79     up
#> 13   s(14,2) 78  right
#> 14    s(2,3) 89   down
#> 15    s(3,3) 90  right
#> 16    s(4,3) 89  right
#> 17    s(5,3) 88     up
#> 18    s(6,3) 87     up
#> 19    s(7,3) 86     up
#> 20    s(8,3) 85     up
#> 21    s(9,3) 84     up
#> 22   s(10,3) 83     up
#> 23   s(11,3) 82     up
#> 24   s(12,3) 81     up
#> 25   s(13,3) 80  right
#> 26   s(14,3) 79     up
#> 27    s(2,4) 90  right
#> 28    s(3,4) 91  right
#> 29    s(4,4) 90     up
#> 30    s(6,4) 86   left
#> 31    s(7,4) 85     up
#> 32    s(8,4) 84     up
#> 33    s(9,4) 83   left
#> 34   s(10,4) 82   left
#> 35   s(11,4) 81  right
#> 36   s(12,4) 82  right
#> 37   s(13,4) 81  right
#> 38   s(14,4) 80     up
#> 39    s(2,5) 91   down
#> 40    s(3,5) 92  right
#> 41    s(4,5) 91     up
#> 42    s(6,5) 85   left
#> 43    s(7,5) 84     up
#> 44    s(8,5) 83   left
#> 45    s(9,5) 82   left
#> 46   s(10,5) 81   left
#> 47   s(11,5) 82   down
#> 48   s(12,5) 83  right
#> 49   s(13,5) 82  right
#> 50   s(14,5) 81     up
#> 51    s(2,6) 92   down
#> 52    s(3,6) 93  right
#> 53    s(4,6) 92     up
#> 54    s(6,6) 84   left
#> 55    s(7,6) 83     up
#> 56    s(8,6) 82   left
#> 57    s(9,6) 81   down
#> 58   s(10,6) 82  right
#> 59   s(11,6) 83   down
#> 60   s(12,6) 84  right
#> 61   s(13,6) 83  right
#> 62   s(14,6) 82     up
#> 63    s(2,7) 93   down
#> 64    s(3,7) 94  right
#> 65    s(4,7) 93     up
#> 66    s(6,7) 83   left
#> 67    s(7,7) 82   left
#> 68    s(8,7) 81     up
#> 69    s(9,7) 82   down
#> 70   s(10,7) 83   down
#> 71   s(11,7) 84  right
#> 72   s(12,7) 85  right
#> 73   s(13,7) 84     up
#> 74   s(14,7) 83     up
#> 75    s(2,8) 94   down
#> 76    s(3,8) 95  right
#> 77    s(4,8) 94  right
#> 78    s(6,8) 82   left
#> 79    s(7,8) 81   down
#> 80    s(8,8) 82  right
#> 81    s(9,8) 83  right
#> 82   s(10,8) 84   down
#> 83   s(11,8) 85  right
#> 84   s(12,8) 86  right
#> 85   s(13,8) 85     up
#> 86   s(14,8) 84     up
#> 87    s(2,9) 95  right
#> 88    s(3,9) 96  right
#> 89    s(4,9) 95  right
#> 90    s(6,9) 81  right
#> 91    s(7,9) 82  right
#> 92    s(8,9) 83  right
#> 93    s(9,9) 84   down
#> 94   s(10,9) 85  right
#> 95   s(11,9) 86   down
#> 96   s(12,9) 87  right
#> 97   s(13,9) 86  right
#> 98   s(14,9) 85     up
#> 99   s(2,10) 96  right
#> 100  s(3,10) 97  right
#> 101  s(4,10) 96     up
#> 102  s(6,10) 82   down
#> 103  s(7,10) 83   down
#> 104  s(8,10) 84   down
#> 105  s(9,10) 85   down
#> 106 s(10,10) 86   down
#> 107 s(11,10) 87   down
#> 108 s(12,10) 88  right
#> 109 s(13,10) 87  right
#> 110 s(14,10) 86     up
#> 111  s(2,11) 97   down
#> 112  s(3,11) 98  right
#> 113  s(4,11) 97     up
#> 114 s(12,11) 89  right
#> 115 s(13,11) 88  right
#> 116 s(14,11) 87     up
#> 117  s(2,12) 98   down
#> 118  s(3,12) 99  right
#> 119  s(4,12) 98  right
#> 120  s(5,12) 97  right
#> 121  s(6,12) 96  right
#> 122  s(7,12) 95     up
#> 123  s(8,12) 94     up
#> 124  s(9,12) 93  right
#> 125 s(10,12) 92  right
#> 126 s(11,12) 91     up
#> 127 s(12,12) 90     up
#> 128 s(13,12) 89  right
#> 129 s(14,12) 88  right
#> 130  s(2,13) 99   down
#> 131  s(3,13)  0   down
#> 132  s(4,13) 99     up
#> 133  s(5,13) 98     up
#> 134  s(6,13) 97     up
#> 135  s(7,13) 96     up
#> 136  s(8,13) 95     up
#> 137  s(9,13) 94     up
#> 138 s(10,13) 93     up
#> 139 s(11,13) 92     up
#> 140 s(12,13) 91     up
#> 141 s(13,13) 90     up
#> 142 s(14,13) 89     up
#> 143  s(2,14) 98   left
#> 144  s(3,14) 99   left
#> 145  s(4,14) 98   left
#> 146  s(5,14) 97   left
#> 147  s(6,14) 96   left
#> 148  s(7,14) 95   left
#> 149  s(8,14) 94   left
#> 150  s(9,14) 93     up
#> 151 s(10,14) 92   left
#> 152 s(11,14) 91   left
#> 153 s(12,14) 90   left
#> 154 s(13,14) 89   left
#> 155 s(14,14) 88   left

Solve the maze step-by-step.

gw_animate(maze, "DP:VI", n = 25, zlim = c(70,100))

Value iteration looks a lot like the following algorithms:

flood fill algorithm.
breath-first search starting at the goal.

While the algorithm exhibits breath-first search behavior, it does not implement an explicit search tree. The search graph is implicitly stored in the transition model and the table representing the utility function. Exploration is replaced by propagating utility values. The tabular version of dynamic programming-based planning algorithms stores the value function as a table with \(O(|S|)\) entries, one value for each state. The number of states is typically too large to fit into memory. An option is to use function approximation (e.g., using a machine learning model) to store only a compressed table.

The goal is to find the optimal state value for each state. An alternative approach is called policy iteration that searches for the best policy instead of the best value function.

3 Reinforcement Learning for Unknown Mazes

An unknown environment mean that we do not have a model of the environment. That is, the agent does not know the transition function or the reward structure and has to learn a good policy and at least part of the transition and reward models by trying actions.

Here we assume that the agent has no direct access to the MDP. The agent can only try actions and the environment uses the MDP to simulate how it reacts. Methods to solve this problem are called model-free reinforcement learning algorithms.

3.1 Monte Carlo Control

A simple approach is to create (sample) episodes and then update the state value estimates given this information. This approach is similar to Monte Carlo Tree Search that uses playouts. Here is how such an episode looks like.

set.seed(1111)
sample_MDP(maze, n = 1, horizon = 50, trajectories = TRUE)$trajectories
#>    episode time       s     a  r s_prime
#> 1        1    0 s(10,6) right -1 s(10,7)
#> 2        1    1 s(10,7)    up -1  s(9,7)
#> 3        1    2  s(9,7) right -1  s(9,8)
#> 4        1    3  s(9,8)  left -1  s(9,7)
#> 5        1    4  s(9,7)    up -1  s(8,7)
#> 6        1    5  s(8,7)  left -1  s(8,6)
#> 7        1    6  s(8,6)  down -1  s(9,6)
#> 8        1    7  s(9,6)    up -1  s(8,6)
#> 9        1    8  s(8,6)    up -1  s(7,6)
#> 10       1    9  s(7,6) right -1  s(7,7)
#> 11       1   10  s(7,7) right -1  s(7,8)
#> 12       1   11  s(7,8)  left -1  s(7,7)
#> 13       1   12  s(7,7) right -1  s(7,8)
#> 14       1   13  s(7,8)  left -1  s(7,7)
#> 15       1   14  s(7,7)  down -1  s(8,7)
#> 16       1   15  s(8,7)  left -1  s(8,6)
#> 17       1   16  s(8,6)    up -1  s(7,6)
#> 18       1   17  s(7,6)    up -1  s(6,6)
#> 19       1   18  s(6,6) right -1  s(6,7)
#> 20       1   19  s(6,7) right -1  s(6,8)
#> 21       1   20  s(6,8)  left -1  s(6,7)
#> 22       1   21  s(6,7)  left -1  s(6,6)
#> 23       1   22  s(6,6)    up -1  s(6,6)
#> 24       1   23  s(6,6) right -1  s(6,7)
#> 25       1   24  s(6,7) right -1  s(6,8)
#> 26       1   25  s(6,8)  left -1  s(6,7)
#> 27       1   26  s(6,7)  down -1  s(7,7)
#> 28       1   27  s(7,7)  left -1  s(7,6)
#> 29       1   28  s(7,6) right -1  s(7,7)
#> 30       1   29  s(7,7)    up -1  s(6,7)
#> 31       1   30  s(6,7)  left -1  s(6,6)
#> 32       1   31  s(6,6) right -1  s(6,7)
#> 33       1   32  s(6,7)  down -1  s(7,7)
#> 34       1   33  s(7,7)  left -1  s(7,6)
#> 35       1   34  s(7,6)  down -1  s(8,6)
#> 36       1   35  s(8,6)    up -1  s(7,6)
#> 37       1   36  s(7,6)  down -1  s(8,6)
#> 38       1   37  s(8,6)  down -1  s(9,6)
#> 39       1   38  s(9,6)  left -1  s(9,5)
#> 40       1   39  s(9,5)    up -1  s(8,5)
#> 41       1   40  s(8,5) right -1  s(8,6)
#> 42       1   41  s(8,6)  down -1  s(9,6)
#> 43       1   42  s(9,6)  down -1 s(10,6)
#> 44       1   43 s(10,6)  down -1 s(11,6)
#> 45       1   44 s(11,6)  left -1 s(11,5)
#> 46       1   45 s(11,5)  left -1 s(11,4)
#> 47       1   46 s(11,4)  left -1 s(11,3)
#> 48       1   47 s(11,3)  down -1 s(12,3)
#> 49       1   48 s(12,3)    up -1 s(11,3)
#> 50       1   49 s(11,3)  left -1 s(11,2)

Monte Carlo methods creates on episode (search from the start to the goal) at a time and then updates the state values in reverse order using the learning parameter \(\alpha\). The easiest method is called On-policy Monte Carlo control and uses an exploring \(\epsilon\)-greedy policy to explore the maze. It is called on-policy because it also learns the same \(\epsilon\)-greedy policy.

Since we have observed that the simulated episode above was not able to find the goal after 50 steps, we use a much larger horizon for the algorithm.

system.time(sol <- solve_MDP(maze, method = "MC:on_policy", 
                 horizon = 1000, n = 10, 
                 epsilon =.2, alpha = 0.3))
#>    user  system elapsed 
#>   0.358   0.002   0.359

gw_plot(sol)

We see that it explores a large part of the maze, but then concentrates on the direct path to the goal.

Step-by-step solution.

sol <- gw_animate(maze, "MC:on_policy", 
                  n = 30, horizon = 1000, 
                  epsilon =.2, alpha = 0.3,
                  ylim = c(70, 100))

Initially, the random exploration does not lead to the goal during the horizon, but eventually it finds a path. The behavior can be compared to

repeated depth-first search with a depth limited,
or Monte Carlo tree search for game trees where the utility is propagated from the end of the game to the current state.

3.2 Q-Learning

Q-learning is the most popular model-free temporal differencing (TD) RL method that does not update with complete episodes, but it updates for each step. The idea is that for each step it makes the value of the current state a little more similar to the value of the next step plus the immediate reward.

system.time(sol <- solve_MDP(maze, method ="TD:q_learning", 
                             horizon = 1000, n = 100, 
                             alpha = 0.3))
#>    user  system elapsed 
#>   1.627   0.001   1.629

sol
#> MDP, MDPE - Maze
#>   Discount factor: 1
#>   Horizon: 1000 epochs
#>   Size: 155 states / 4 actions
#>   Storage: transition prob as matrix / reward as matrix. Total size: 1.7 Mb
#>   Start: s(10,6)
#>   Solved:
#>     Method: 'q_learning'
#>     Solution converged: NA
#> 
#>   List components: 'name', 'discount', 'horizon', 'states', 'actions',
#>     'start', 'transition_prob', 'reward', 'info', 'absorbing_states',
#>     'solution'

gw_plot(sol)

It is interesting to look at the step-by-step of Q-learning. Q-learning follows an \(\epsilon\)-greedy policy where it takes the current beat action, but with a probability of \(\epsilon\) uses a random action to perform exploration. Initially, the agent has no good policy and essentially performs random walk.

I use 1000 as the horizon with the hope that it will randomly run into the goal. After some experimentation, I settled on an \(\epsilon\) of \(0.2\) which means every 5th action is chosen randomly and I use a relatively high, fixed learning rate \(\alpha\) of \(0.3\)

sol <- gw_animate(maze, "TD:q_learning", 
                  n = 100, horizon = 1000, 
                  epsilon = 0.1, alpha = 0.3, 
                  zlim = c(-1,100))

4 Conclusion

If the environment is known and the number of states and actions is not too large, then model-based methods like value or policy iteration are the best method.

Model-free RL methods are typically horribly data inefficient, but they do not need a model of the environment and do not need to calculate state values for all states.