Solving a Maze using Reinforcement Learning

Author

Michael Hahsler

Introduction

This report demonstrates how value iteration (model-based RL) and q-learning (model-free RL) solve a maze similar to the mazes we have used for tree search.

Install and Load Package markovDP

Install the current development version of [markovDP](https://mhahsler.r-universe.dev/markovDP).

if (!"markovDP" %in% rownames(installed.packages())) 
  install.packages('markovDP', repos = c(
    'https://mhahsler.r-universe.dev', 
    'https://cloud.r-project.org'))

library(markovDP)

Read a Maze

Mazes are so called qridworlds where the agent moves around in a 2d environment composed out of tiles.
The markovDP package already comes with the mazes we use in class.

maze_dir <- system.file("mazes", package = "markovDP")
dir(maze_dir)
[1] "empty_2_maze.txt" "empty_maze.txt"   "L_maze.txt"       "large_maze.txt"  
[5] "loops_maze.txt"   "medium_maze.txt"  "open_maze.txt"    "small_maze.txt"  
[9] "wall_maze.txt"   
maze <- gw_read_maze(file.path(maze_dir, "L_maze.txt"))
maze
MDP, list - Maze
  Discount factor: 1
  Horizon: Inf epochs
  Size: 155 states / 4 actions
  Start: s(10,6)

  List components: 'name', 'discount', 'horizon', 'states', 'actions',
    'start', 'transition_prob', 'reward', 'info', 'absorbing_states'
gw_plot(maze)

The transition model is stored as a function.

maze$transition_prob
function(model, action, start.state) {
  P <- structure(numeric(length(S(model))), names = S(model))
  
  ai <- match(action, A(model))
  
  # stay in place for unknown actions
  if (is.na(ai)) {
    warning("Unknown action", action)
    P[start.state] <- 1
    return(P)
  }
  
  # absorbing states
  if (start.state %in% model$info$absorbing_states) {
    P[start.state] <- 1
    return(P)
  }
  
  # move
  rc <- gw_s2rc(start.state)
  rc <- switch(ai, rc + c(-1, 0), rc + c(0, +1), rc + c(+1, 0), rc + c(0, -1), )
  
  es <- gw_rc2s(rc)
  
  # stay in place if we would leave the gridworld
  if (!(es %in% S(model))) {
    es <- start.state
  }
  
  P[es] <- 1
  P
}
<bytecode: 0x65305f5f5188>
<environment: namespace:markovDP>
gw_plot_transition_graph(maze)

The reward model

maze$reward
  action start.state end.state value
1   <NA>        <NA>      <NA>    -1
2   <NA>        <NA>   s(3,13)   100
3   <NA>     s(3,13)      <NA>     0

The reward is a cost of 1 for each transition and a reward of +100 for reaching the goal state. The best solution will be the shortest path to the goal.

Remember that tree search starts with the start state and explores the space till it finds the goal state.

Solve the Maze MDP as a Planning Agent

We solve the MDP directly using the problem description as a model for the environment. This approach would be used by a planning agent similar to the tree search algorithms.

Since the agent uses a complete model of the environment, we use a model-based reinforcement learning algorithm. Here we use value iteration. The algorithm sweeps in each iteration through all states and updates each state value using the utility of the best state it could get to plus the immediate reward. It stops when the state values do not change anymore (less than a set error threshold).

system.time(sol <- solve_MDP(maze, method = "value_iteration"))
   user  system elapsed 
  0.077   0.001   0.082 
sol
MDP, list - Maze
  Discount factor: 1
  Horizon: Inf epochs
  Size: 155 states / 4 actions
  Start: s(10,6)
  Solved:
    Method: 'value_iteration'
    Solution converged: TRUE

  List components: 'name', 'discount', 'horizon', 'states', 'actions',
    'start', 'transition_prob', 'reward', 'info', 'absorbing_states',
    'solution'
gw_plot(sol)

gw_path(sol)
Warning in sample_MDP(model, n = 1, horizon = horizon, epsilon = 0, trajectories = TRUE): Simulation for undiscounted problems need a finite simulation horizon.
Using a maximum horizon of |S| x |A| = 620 to avoid infinite loops.
$path
   episode time        s     a   r  s_prime
1        1    0  s(10,6) right  -1  s(10,7)
2        1    1  s(10,7)  down  -1  s(11,7)
3        1    2  s(11,7)  down  -1  s(12,7)
4        1    3  s(12,7) right  -1  s(12,8)
5        1    4  s(12,8) right  -1  s(12,9)
6        1    5  s(12,9) right  -1 s(12,10)
7        1    6 s(12,10) right  -1 s(12,11)
8        1    7 s(12,11) right  -1 s(12,12)
9        1    8 s(12,12) right  -1 s(12,13)
10       1    9 s(12,13)    up  -1 s(11,13)
11       1   10 s(11,13)    up  -1 s(10,13)
12       1   11 s(10,13)    up  -1  s(9,13)
13       1   12  s(9,13)    up  -1  s(8,13)
14       1   13  s(8,13)    up  -1  s(7,13)
15       1   14  s(7,13)    up  -1  s(6,13)
16       1   15  s(6,13)    up  -1  s(5,13)
17       1   16  s(5,13)    up  -1  s(4,13)
18       1   17  s(4,13)    up 100  s(3,13)

$reward
[1] 83

$solved
[1] TRUE

The color represents the state value. Here are the learned state values and the optimal action for each state (policy).

policy(sol)
      states   V action
1     s(2,2)  89   down
2     s(3,2)  90  right
3     s(4,2)  89     up
4     s(5,2)  88     up
5     s(6,2)  87  right
6     s(7,2)  86     up
7     s(8,2)  85     up
8     s(9,2)  84  right
9    s(10,2)  83     up
10   s(11,2)  82     up
11   s(12,2)  81  right
12   s(13,2)  80     up
13   s(14,2)  79     up
14    s(2,3)  90   down
15    s(3,3)  91  right
16    s(4,3)  90  right
17    s(5,3)  89     up
18    s(6,3)  88     up
19    s(7,3)  87     up
20    s(8,3)  86     up
21    s(9,3)  85     up
22   s(10,3)  84     up
23   s(11,3)  83     up
24   s(12,3)  82  right
25   s(13,3)  81  right
26   s(14,3)  80     up
27    s(2,4)  91   down
28    s(3,4)  92  right
29    s(4,4)  91  right
30    s(6,4)  87   left
31    s(7,4)  86   left
32    s(8,4)  85   left
33    s(9,4)  84   left
34   s(10,4)  83   left
35   s(11,4)  82   down
36   s(12,4)  83  right
37   s(13,4)  82     up
38   s(14,4)  81  right
39    s(2,5)  92  right
40    s(3,5)  93  right
41    s(4,5)  92  right
42    s(6,5)  86   left
43    s(7,5)  85     up
44    s(8,5)  84   left
45    s(9,5)  83     up
46   s(10,5)  82     up
47   s(11,5)  83  right
48   s(12,5)  84  right
49   s(13,5)  83  right
50   s(14,5)  82     up
51    s(2,6)  93   down
52    s(3,6)  94  right
53    s(4,6)  93     up
54    s(6,6)  85   left
55    s(7,6)  84     up
56    s(8,6)  83   left
57    s(9,6)  82   down
58   s(10,6)  83  right
59   s(11,6)  84  right
60   s(12,6)  85  right
61   s(13,6)  84  right
62   s(14,6)  83  right
63    s(2,7)  94  right
64    s(3,7)  95  right
65    s(4,7)  94  right
66    s(6,7)  84   left
67    s(7,7)  83     up
68    s(8,7)  82   left
69    s(9,7)  83   down
70   s(10,7)  84   down
71   s(11,7)  85   down
72   s(12,7)  86  right
73   s(13,7)  85  right
74   s(14,7)  84  right
75    s(2,8)  95   down
76    s(3,8)  96  right
77    s(4,8)  95  right
78    s(6,8)  83   left
79    s(7,8)  82  right
80    s(8,8)  83  right
81    s(9,8)  84   down
82   s(10,8)  85   down
83   s(11,8)  86   down
84   s(12,8)  87  right
85   s(13,8)  86  right
86   s(14,8)  85  right
87    s(2,9)  96   down
88    s(3,9)  97  right
89    s(4,9)  96     up
90    s(6,9)  82  right
91    s(7,9)  83   down
92    s(8,9)  84  right
93    s(9,9)  85   down
94   s(10,9)  86   down
95   s(11,9)  87   down
96   s(12,9)  88  right
97   s(13,9)  87     up
98   s(14,9)  86     up
99   s(2,10)  97  right
100  s(3,10)  98  right
101  s(4,10)  97     up
102  s(6,10)  83   down
103  s(7,10)  84   down
104  s(8,10)  85   down
105  s(9,10)  86   down
106 s(10,10)  87   down
107 s(11,10)  88   down
108 s(12,10)  89  right
109 s(13,10)  88  right
110 s(14,10)  87  right
111  s(2,11)  98   down
112  s(3,11)  99  right
113  s(4,11)  98     up
114 s(12,11)  90  right
115 s(13,11)  89  right
116 s(14,11)  88  right
117  s(2,12)  99  right
118  s(3,12) 100  right
119  s(4,12)  99  right
120  s(5,12)  98     up
121  s(6,12)  97     up
122  s(7,12)  96  right
123  s(8,12)  95  right
124  s(9,12)  94     up
125 s(10,12)  93  right
126 s(11,12)  92     up
127 s(12,12)  91  right
128 s(13,12)  90     up
129 s(14,12)  89  right
130  s(2,13) 100   down
131  s(3,13)   0     up
132  s(4,13) 100     up
133  s(5,13)  99     up
134  s(6,13)  98     up
135  s(7,13)  97     up
136  s(8,13)  96     up
137  s(9,13)  95     up
138 s(10,13)  94     up
139 s(11,13)  93     up
140 s(12,13)  92     up
141 s(13,13)  91     up
142 s(14,13)  90     up
143  s(2,14)  99   left
144  s(3,14) 100   left
145  s(4,14)  99   left
146  s(5,14)  98     up
147  s(6,14)  97     up
148  s(7,14)  96     up
149  s(8,14)  95     up
150  s(9,14)  94     up
151 s(10,14)  93     up
152 s(11,14)  92   left
153 s(12,14)  91   left
154 s(13,14)  90   left
155 s(14,14)  89   left

Solve the maze step-by-step.

gw_animate(maze, "value_iteration", n = 25, zlim = c(-1,100))

Value iteration looks a lot like breath-first search starting at the goal or like a flood fill algorithm. The goal is to find the optimal state value for each state.

The space complexity is bad with \(O(|S|)\) and the number of states is typically very large.

Reinforcement Learning for Unknown Mazes

An unknown environment mean that we do not have a model of the environment. The agent does not know the transition function or the reward structure and has to learn a good policy and at least part of the transition and reward models.

Here we assume that the agent has no direct access to the MDP. The agent can only try actions and the environment uses the MDP to simulate how it reacts. These methods are also called model-free approach for reinforcement learning.

Monte Carlo Control

On-policy Monte Carlo control uses an exploring \(\epsilon\)-greedy policy to explore the maze. It is called on-policy because it also learns the same \(\epsilon\)-greedy policy.

Monte Carlo methods creates on episode (search from the start to the goal) at a time and then updates the state values in reverse order using the learning parameter \(\alpha\). This is how such an episode looks like.

set.seed(1111)
sample_MDP(maze, n = 1, horizon = 50, trajectories = TRUE)$trajectories
   episode time       s     a  r s_prime
1        1    0 s(10,6) right -1 s(10,7)
2        1    1 s(10,7)    up -1  s(9,7)
3        1    2  s(9,7) right -1  s(9,8)
4        1    3  s(9,8)  left -1  s(9,7)
5        1    4  s(9,7)    up -1  s(8,7)
6        1    5  s(8,7)  left -1  s(8,6)
7        1    6  s(8,6)  down -1  s(9,6)
8        1    7  s(9,6)    up -1  s(8,6)
9        1    8  s(8,6)    up -1  s(7,6)
10       1    9  s(7,6) right -1  s(7,7)
11       1   10  s(7,7) right -1  s(7,8)
12       1   11  s(7,8)  left -1  s(7,7)
13       1   12  s(7,7) right -1  s(7,8)
14       1   13  s(7,8)  left -1  s(7,7)
15       1   14  s(7,7)  down -1  s(8,7)
16       1   15  s(8,7)  left -1  s(8,6)
17       1   16  s(8,6)    up -1  s(7,6)
18       1   17  s(7,6)    up -1  s(6,6)
19       1   18  s(6,6) right -1  s(6,7)
20       1   19  s(6,7) right -1  s(6,8)
21       1   20  s(6,8)  left -1  s(6,7)
22       1   21  s(6,7)  left -1  s(6,6)
23       1   22  s(6,6)    up -1  s(6,6)
24       1   23  s(6,6) right -1  s(6,7)
25       1   24  s(6,7) right -1  s(6,8)
26       1   25  s(6,8)  left -1  s(6,7)
27       1   26  s(6,7)  down -1  s(7,7)
28       1   27  s(7,7)  left -1  s(7,6)
29       1   28  s(7,6) right -1  s(7,7)
30       1   29  s(7,7)    up -1  s(6,7)
31       1   30  s(6,7)  left -1  s(6,6)
32       1   31  s(6,6) right -1  s(6,7)
33       1   32  s(6,7)  down -1  s(7,7)
34       1   33  s(7,7)  left -1  s(7,6)
35       1   34  s(7,6)  down -1  s(8,6)
36       1   35  s(8,6)    up -1  s(7,6)
37       1   36  s(7,6)  down -1  s(8,6)
38       1   37  s(8,6)  down -1  s(9,6)
39       1   38  s(9,6)  left -1  s(9,5)
40       1   39  s(9,5)    up -1  s(8,5)
41       1   40  s(8,5) right -1  s(8,6)
42       1   41  s(8,6)  down -1  s(9,6)
43       1   42  s(9,6)  down -1 s(10,6)
44       1   43 s(10,6)  down -1 s(11,6)
45       1   44 s(11,6)  left -1 s(11,5)
46       1   45 s(11,5)  left -1 s(11,4)
47       1   46 s(11,4)  left -1 s(11,3)
48       1   47 s(11,3)  down -1 s(12,3)
49       1   48 s(12,3)    up -1 s(11,3)
50       1   49 s(11,3)  left -1 s(11,2)

After 50 steps, the random walk cannot find the goal, which is a problem. This is why I use a much larger horizon.

system.time(sol <- solve_MDP(maze, method = "MC_on_policy", 
                 horizon = 1000, n = 100, epsilon =.2, alpha = 0.3))
   user  system elapsed 
  1.486   0.005   1.491 
gw_plot(sol)

We see that it explors a large part of the maze, but then concentrates on the direct path to the goal.

Step-by-step solution.

sol <- gw_animate(maze, "MC_on_policy", n = 25, 
                  horizon = 1000, epsilon = 0.2, alpha = 0.3, first_visit = FALSE)

Initially, the random exploration does not lead to the goal during the horizon, but eventually it finds a path. The behavior can be compared to

  • repeated depth-first search with a depth limited,
  • or Monte Carlo tree search for game trees where the utility is propagated from the end of the game to the current state.

Q-Learning

Q-learning is a popular model-free RL method that does not update with complete episodes, but it updates only fore one step at a time.

system.time(sol <- solve_MDP(maze, method ="q_learning", 
                             horizon = 1000, n = 100, alpha = 0.3))
   user  system elapsed 
  1.668   0.000   1.668 
sol
MDP, list - Maze
  Discount factor: 1
  Horizon: Inf epochs
  Size: 155 states / 4 actions
  Start: s(10,6)
  Solved:
    Method: 'q_learning'
    Solution converged: NA

  List components: 'name', 'discount', 'horizon', 'states', 'actions',
    'start', 'transition_prob', 'reward', 'info', 'absorbing_states',
    'solution'
gw_plot(sol)

Step-by-step Solution using Q-Learning. Q-learning follows an \(\epsilon\)-greedy policy where it takes the current beat action, but with a probability of \(\epsilon\) uses a random action to perform exploration. Initially, the agent has no good policy and essentially performs random walk.

I use 1000 as the horizon with the hope that it will randomly run into the goal. After some experimentation, I settled on an \(\epsilon\) of \(0.2\) which means every 5th action is chosen randomly and I use a relatively high, fixed learning rate \(\alpha\) of \(0.3\)

sol <- gw_animate(maze, "q_learning", n = 25, zlim = c(-1,100), 
                  horizon = 1000, epsilon = 0.2, alpha = 0.3)

Model-free RL methods are typically horribly data inefficient, but they do not need a model of the environment and do not need to calculate state values for all states.