Human hand solving a maze

Solve Room Maze With Table Q(0) RL Method

In this post we provide a Python solution for the room maze defined by mnemstudio.org. The solution will be based solely on Python and numpy using a tabular Q(0) reinforcement learning algorithm, also known as Q learning.

Problem description

The link above already provides a good outline of the problem, so let us keep things short:

Visual representation of the room maze, by mnemstudio.org Image source

The maze consists of six rooms, some of them are connected by doors others have no connection. The agent will be placed randomly in one of the rooms and has to find it's way until the agent reaches room number 5, which is defined as outside. The other rooms are labeled 0 - 4. The maze provides two ways to leave the building, either through room 1 or through room 4. Both of them provide an exit.

Environment

Given the room maze a reward matrix RR can be derived. It contains the reward given a state (row), i.e. to be in room SiS_i where ii is the room number, and choosing an action (column) which let the agent go to room AjA_j where jj is the new room number. The reward is then given by R(i,j)R(i, j).

      A0  A1  A2  A3  A4  A5
S0 [[ -1  -1  -1  -1   0  -1]
S1  [ -1  -1  -1   0  -1 100]
S2  [ -1  -1  -1   0  -1  -1]
S3  [ -1   0   0  -1   0  -1]
S4  [  0  -1  -1   0  -1 100]
S5  [ -1   0  -1  -1   0 100]]

Example, say we are in room 3 (S3S_3) and the user chooses to go to room 1 (A1A_1). The reward of this move is described by position (3,1)(3, 1),in the rewards table. It gives a reward of 00, because room 1 is not the terminal state. Also notice that the environment tells us which actions are possible given a state. Possible actions are those columns whose value is larger than 1-1. Terminal states have a reward of 100100, whereas all other states have a reward of 00.

Note that R(5,5)R(5,5) is defined and yields a reward of 100100, whereas R(5,1)R(5,1) and R(5,4)R(5,4) are defined as 00. Since we explicitly allow the agent to leave terminal state if he has been started in room 5, we also give him the opportunity to stay in room 5, which is different to all other rooms. One could also re-write the problem to never let the agent start in a terminal state, which is how problems are often defined elsewhere.

Problem classification

The problem is discrete and episodic. It satisfies the Markov property, because the future depends only on the current state and action, but not on the past.

Algorithm

The following pseudocode has been taken from Sutton and Barto (2017). Its inner loops have been adjusted, to run at least once - even if the initial state is a terminal state.

SARSA

This pseudocode describes the SARSA(0) (state, action, reward, (next) state, (next) action) learning method.

Output: action value function QQ
initialize QQ arbitrarily, e.g. to 0 for all states, set action value for terminal states as 0
for each episode do
....initialize state ss
....forever do
........a \leftarrow actions for ss derived by QQ (ϵ\epsilon-greedy)
........take action aa, observe r,sr, s'
........aa' \leftarrow actions for ss' derived by QQ (ϵ\epsilon-greedy)
........Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma Q(s', a') - Q(s,a) ]
........sss \leftarrow s'
........if state ss is terminal state do
............break
........end
....end
end

Tabular Q(0) Method

Output: action value function QQ
initialize QQ aribitrarily, e.g. to 0 for all states, set action value for terminal states as 0
for each episode do
....initialize state ss
....forever do
........a \leftarrow actions for ss derived by QQ (ϵ\epsilon-greedy)
........take action aa, observe r,sr, s'
........Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max\nolimits_{a'} Q(s',a') - Q(s,a) ]
........sss \leftarrow s'
........if state ss is terminal state do
............break
........end
....end
end

Python implementation

The full solution it too lengthy to show it here. Instead we provide it as a jupyter notebook.

Results

Both algorithms converge and are able to solve the maze. The is what the final Q table (normalized) looks like.

      A0    A1    A2    A3    A4    A5
S0 [[ 0.    0.    0.    0.    0.8   0.  ]
S1  [ 0.    0.    0.    0.64  0.    1.  ]
S2  [ 0.    0.    0.    0.64  0.    0.  ]
S3  [ 0.    0.8   0.51  0.    0.8   0.  ]
S4  [ 0.64  0.    0.    0.64  0.    1.  ]
S5  [ 0.    0.8   0.    0.    0.8   1.  ]]

Using this table as the policy for the agent, we can place the agent somewhere in the maze, say room 2 and just follow the ruls in the table. From room 2 he choose the best action available, which is A3=0.64S3A_3 = 0.64 \rightarrow S_3. From there he can choose either A1A_1 or A4A_4, which are both similar in their Q values. Assume he picks A1=0.8S1A_1 = 0.8 \rightarrow S_1. From S1S_1 the best action is A5=1.0A_5 = 1.0. This brings the agent in room 5, which is the terminal room and finishes the episode.

So the full chain is then S2A3S3A1S1A5S5S_2 \rightarrow A_3 \rightarrow S_3 \rightarrow A_1 \rightarrow S_1 \rightarrow A_5 \rightarrow S_5

Problem solved 😊