A Graph-Based Reinforcement Learning Method with Converged State Exploration and Exploitation

: In any classical value-based reinforcement learning method, an agent, despite of its continuous interactions with the environment, is yet unable to quickly generate a complete and independent description of the entire environment, leaving the learning method to struggle with a difficult dilemma of choosing between the two tasks, namely exploration and exploitation. This problem becomes more pronounced when the agent has to deal with a dynamic environment, of which the configuration and/or parameters are constantly changing. In this paper, this problem is approached by first mapping a reinforcement learning scheme to a directed graph, and the set that contains all the states already explored shall continue to be exploited in the context of such a graph. We have proved that the two tasks of exploration and exploitation eventually converge in the decision-making process, and thus, there is no need to face the exploration vs. exploitation tradeoff as all the existing reinforcement learning methods do. Rather this observation indicates that a reinforcement learning scheme is essentially the same as searching for the shortest path in a dynamic environment, which is readily tackled by a modified Floyd-Warshall algorithm as proposed in the paper. The experimental results have confirmed that the proposed graph-based reinforcement learning algorithm has significantly higher performance than both standard Q-learning algorithm and improved Q-learning algorithm in solving mazes, rendering it an algorithm of choice in applications involving dynamic environments.


Introduction
Reinforcement Learning (RL) has found its great use in a lot of practical applications, ranging from problems in mobile robot [Mataric (1997); Smart and Kaelbling (2002); Huang, Cao and Guo (2005)], adaptive control [Sutton, Barto and Williams (1992); Lewis, Varbie and Vamvoudakis (2012); Lewis and Varbie (2009)], AI-backed chess playing [Silver, Hubert, Schrittwieser et al. (2017); Silver, Schrittwieser, Simonyan et al. (2017); Silver, Huang, Maddison et al. (2016)], among many others.The idea behind reinforcement learning, as illustrated in Fig. 1, is that an agent learns from the environment by interacting with it and receives positive or negative rewards for performing calculated actions, and the cycle is repeated.The key issue of the whole process is to learn a way of controlling the system so as to maximize the total award.When the agent begins to sense and learn a completely or partially unknown environment, it involves in two distinct tasks: exploration which attempts to collect as much information about the environment as possible, and exploitation which attempts to receive positive rewards as quickly as possible.
Figure 1: In reinforcement learning, the agent observes the environment, takes an action to interact with the environment, updates its own state and receives reward There is a dilemma of choosing between the two tasks of exploration and exploitation, though.Too much exploration will adversely influence the efficiency and convergence of the learning algorithm, while putting too much emphasis on exploitation will increase the possibility of falling into a locally optimal solution.The existing RL algorithms all attempt to balance out these two tasks in their learning cycles (Fig. 1), but these is no guarantee that the best result can always be obtained.Besides the exploration and exploitation dilemma, the RL algorithms have to employ value distributions that inexplicitly assume that environment is static (i.e., no change), or it changes very slowly and/or insignificantly.However, in many real applications, the environment rarely stays unchanged.More than likely, the environment that can be described in terms of states (Fig. 1) changes over the course of exploration.In this case, value distribution has nothing to do with the problem at hand, and all the information obtained from the previous exploration efforts become less, or totally irrelevant.To effectively solve the aforementioned problems in reinforcement learning, we herein present a new algorithm based on the partitioning of the states set and search of the shortest path in a directed graph that represents a RL method.We have formally proved and experimentally verified that both exploration and exploitation in reinforcement learning actually converge at the end of the decision-making process, and thus, the learning process does not need to face the exploration/exploration dilemma as other existing reinforcement learning methods would do.This observation indicates that a reinforcement learning scheme is essentially the same as searching for the shortest path in a dynamic environment, which is readily tackled by a modified Floyd-Warshall algorithm as proposed in the paper.The experiment that applies the proposed algorithm to solve mazes confirm better performance of the new algorithm, particularly its effectiveness in addressing issues pertaining to a dynamic environment.

Preliminaries and background
In this section, we will first survey the basic structure of reinforcement learning (RL) algorithms, particularly value-oriented method of RL, and formally define the exploration vs. exploitation tradeoff in RL.In the literature, RL is shown to be mapped to various graph representations, and these methods are briefly described in the section as well.With graph representations, RL can benefit from rich results in graph algorithms, and we thereby finish this section by reviewing algorithms that search for the shortest path in a graph, as they are related to this paper.

Value-oriented method for the exploration-exploitation tradeoff in RL
Most RL problems can be formalized using Markov Decision Processes (MDPs), and there are a few key elements in RL as defined below.1. Agent: An agent takes actions.2. Environment: The physical world through which the agent operates.3. State: A state is a concrete and immediate situation in which the agent finds itself.In this paper, we denote   as the state of the agent at time instance i, and set S contains all the states that the agent can operate on.That is,   ∈ . 4. Action: agents choose among a list of possible actions.Denote   as the action that agent might perform at time instance . is defined as the set of all possible moves of the agent can make, i.e.,   ∈ .5. Reward: A reward is the feedback that is used to measure the success or failure of an agent's action.Here a reward at time instance  is defined as   .Actions may affect both the immediate reward and, through the next situation, all the subsequent rewards [Sutton and Barto (2017)] 6. Exploitation: a task that makes the best decision given all the current information.7. Exploration: a task that gathers more information to be used for making the best decision in the future.8.An episode: the behavior process cycle of the agent from the beginning of the exploration to the beginning of the next exploration.When the interaction between the agent and the environment breaks naturally into subsequences, which are referred as episodes.Each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states [Sutton and Barto (2017)].In RL, the exploration-exploitation tradeoff refers to a decision making process that chooses between exploration and exploitation.Value-oriented RL methods have to deal with such exploration-exploitation tradeoff through the value distribution as defined by the value function or a probability that decides the chain of actions that lead to the target state all the way from the start state through a series of awards.A decision chain refers to a series of decision-making steps taken by an agent.In order to strike a balance between exploration and exploitation, there are two main decision methods that can be followed, the ϵ − greedy and softmax.In ϵ − greedy method, the action is selected by, one has where   * is the action in which of the value function assumes the highest value: (2) where (, ) is action-value function which evaluates each possible action while in the current state.One drawback of ϵ − greedy action selection is that when it explores, all the possible actions are given the equal opportunity, as indicated in Eq. ( 1).In a simple term, this method is as likely to choose the worst-appearing action as it is to choose the next-to-best action.This gives rise to the so-called softmax method that can vary the action probabilities through a graded function below, where π(|) is the probability policy to choose an action from the specific state , and τ is a "computational" temperature, and is an action-value function that evaluates each possible action in the current state.The problem of value-oriented method is due to its weak ability to eliminate exploration blindness resulting from a large number of repeated explored states introduced by the value distribution structure.The stochastic factors that are added to help the search process jump out of the loops and balance exploration and exploitation actually come at the expense of more blindness of exploration.

RL over graphs
A RL can be represented as a directed graph, G<V, E>, where a vertex, v i ∈V(G), corresponds to a state   in reinforcement learning and an edge, e ij ∈E(G), connects two vertices (two states      ) with a decision action   in reinforcement learning.In this graph, a path can be regarded as decision sequences in reinforcement learning.In the literature, many RL methods are related to their graph representations.In PartiGame Algorithm [Moore (1994)], the environment of RL is divided into cells modeled by kd-tree, and in each cell, the actions available consist of aiming at the neighbor cells [Kaelbling, Littman, Moore (1996)].In Dayan et al. [Dayan and Hinton (1993)], speedup of reinforcement learning is achieved by creating a Q-learning managerial hierarchy in which high-level managers learn how to set tasks for their lower level managers.The hierarchical Q-learning algorithm in Dietterich [Dietterich (1998)] proves its convergence and shows experimentally that it can learn much faster than ordinary "flat" Q-learning.None of these methods, however, can solve the root problem concerning the dilemma of exploration and exploitation.

Floyd-Warshall algorithm
Denote (,   ,   ) as a shortest path search algorithm that is applied to G from vertice   to vertice   that represents a RL.The classical shortest path algorithms like Dijkstra [Dijkstra (1959)] and A* [Hart, Nilsson, Raphael (1968)] are single starting point algorithms for the path-finding.The Floyd-Warshall algorithm [Floyd (1962)] (FW), which is pursued to use in this study, provides the shortest path between any two vertices in specified graph and it is found to be adaptive to the change of the graph.In the standard Floyd-Warshall algorithm, two matrices (DIST and NEXT) are used to express the information of all the shortest path in the graph.The matrix DIST records the shortest path length between two vertices.The matrix NEXT contains a name of the intermediate vertex through which the two vertices are connected through the shortest path.Because of the optimal substructure property of the shortest path, no matter how many intermediate vertices the shortest path passes through, simply recording one of the intermediate vertices is sufficient to express the entire shortest path.

Convergence of exploration and exploitation
In this section, we first define the completely explored graph, which serves as the foundation for a graph-based iterative framework for reinforcement learning.Under this framework, the acquired knowledge from the RL's exploration task can be recorded by the graph, and the shortest path search can then be conducted to determine the next decision chain.This new approach is able to well track the graph changes that are caused by exploration and sometimes by the changing environment.In this section, we shall prove that with this framework involving the shortest path search, the exploration actually converges to exploitation.In a simple word, exploration will find the shortest path to reach the same reward as exploitation does.

Completely explored graph
Definition 1 A Completely Explored State is a state of which all its possible successor states have already been explored.If one of a state's successor states has been explored and at least one of its successor states has not yet been explored, the state is called a Partially Explored State.Definition 2 If the vertex set V in connected graph G<V,E> includes the start states of episodes and all these states have been completely explored, and the edge set E represents all the actions that need to be taken to connect all the different states, graph G is called a Completely Explored Graph (CEG).Fig. 2 shows an example of a CEG where each state(  ) is linked with up to 4 possible actions:  0 ,  1 ,  2 ,  3 .There are some explored edges which are omitted for simplicity, such as action  1 for ( 20 ) and ( 01 )ℎ  0 ; they point to nonexistent state transitions.If we denote an environment feedback function by Env, then for a given action   , the next state  +1 can be determined as  +1 = (  ,   ).

Exploration converges to exploitation
From CEG, we can prove that exploration converges to exploitation.Here   is denoted as a reward state.Lemma 1. Suppose that exploration of each episode starts with state  0 , and ends in reward state stt rwd after passing some intermediate states through a series of episodes.Of all the possible episodes, one can see   ∉ .Proof: Once the agent has landed in state stt rwd , the episode ends, so there will be no further exploration originated from stt rwd .That is, the graph is a completely explored graph and the states are completely explored states, according to definitions 1 and 2. Thus stt rwd cannot be a member of the SE set.(End of proof) Define the envelope set of set SE as  = {  : (  ) ∧ (  (  ) ⊄ )} Corollary 1.For   ∈  consists of members in SE, there exists at least one of the successor states of stt i does not belong to SE.
Proof: It comes directly from the definition of SEE.(End of proof) Corollary 2. If stt i ∈ (Set predecessor (stt rwd ) ∩ SE) is in an exploring episode, then irrespective of the explore strategies adopted in the future, stt i will always be part of SEE.Proof: From lemma 1, one can see that if   ∈  has a successor state   and it is impossible for   to be a member of SE.By definition of SEE it is known that   is always a member of .(End of proof) Definition 3. If SSA(CEG) plans a decision chain that eventually reaches   , and it does not produce any change in either V(CEG) or E(CEG), this condition is referred as Exploration Convergence.Corollary 3. Once the exploration converges, the planned decision chain from SSA(CEG) remains the same in the next exploration episode.Proof: At the beginning of each episode, a CEG is constructed as needed to explore the new states.If the CEG keeps unchanged at the end of the current episode, the next episode will produce the exactly same path.At this point, the algorithm converges as it satisfies the condition set by Definition 3. (End of proof) Theorem 1. Assume there are a finite number of states and a SSA(CEG) is able to find the shortest path in CEG, exploration becomes finding a path from the start state stt 0 to the stt rwd .In other words, exploration converges to exploitation.Proof: i.
During exploration, state  0 can reach the SEE through SSA(CEG) .That is, one needs to find the shortest path, path k, among all the paths, such that: where (  ,   ) is the length of the shortest path between state   and state   .
If   ∈   (  ) , then ( 0 , … ,   , … ,   ) from SSA(CEG) algorithm marks the shortest path from  0 to   .In this case, the conditions concerning exploration convergence (defined in Definition 3) are met, and the exploration converges to the exploitation.ii.If   ∉   (  ) , CEG continues to evolve as exploration progresses.iii.As exploration continues, new members are added into SEE and they replace the old ones, extending the shortest path, and according to Corollary 2, any new member   ∈ (  (  ) ∩ ) will always be part of SEE.iv.When exploration ends,   that satisfies Eq. (4) will eventually meet the condition: ∈   (  ) .v. The agent is bounded to pass the state   associated with the shortest path in the   (  .If not, there would have a different   ∈   (  ) from   that otherwise makes ( 0 ,   ) < ( 0 ,   ).If   ∈  , it's impossible for SSA(CEG) to choose   as a state in the shortest path.If   ∉  , there must be a state   ∈  in   's predecessor chain that makes ( 0 ,   ) < ( 0 ,   ).The algorithm does not converge during this episode.vi.Putting all things together, one can see that exploration by SSA(CEG) must converge to the shortest path from start state  0 to   .As indicated in corollary 3, once the algorithm has found the shortest path from  0 to   , the path will be repeated with no change in the following episodes.In this case, the exploration is readily to be halted.(End of proof).

Algorithm implementation
Based on Theorem 1 described in the previous section, we propose a framework for RL that does not need to concern about the dilemma of exploration and exploitation.There are two major components in the framework, namely CEG and Incompletely explored states, and there are two iterative steps as illustrated in Fig. 3: i.
Based on the current CEG, an action decision, in the form of a single decision or a chain of multiple decisions, will be made to guide the next exploration.ii.Update the CEG with the new knowledge acquired from the latest exploration.In a static or nearly static environment, exploration will help continue to grow CEG, while in a changing environment, CEG members can be added or deleted according to the exploration result.Note that when the CEG is updated, nodes or edges can be added or deleted from the graph.In a static environment, as the exploration progresses, the number of nodes and edges tends to increase, while in a dynamic environment, the number of nodes and edges may increase or decrease.

Shortest path search in dynamic environment
The standard Floyd-Warshall Algorithm calculates matrices DIST and NEXT in batch divided by the length of short path (the number of relay vertices here) for each vertex pair.
For constantly changing of vertices and graph structure that engages in exploration all the time, a more efficient method is needed and thus proposed in this section.
During exploration in reinforcement learning, the completely explored states are discovered in sequence, and subsequently, they are added to the CEG, after which the corresponding edges are also added.In addition, if the agent wants to adapt to the dynamic environment, the removal of vertex must also be taken into account.In this section, a modified algorithm Floyd-Warshall algorithm (SFW) is presented, which is able to search for the shortest path in a graph that represents a dynamic environment.In a simple term, we present SSA(CEG) for SFW.
In SFW, each time when a new vertex is added, it is not only to add the shortest path associated with the new vertex directly tied to the two matrices as defined in Floyd-Warshall, but also to compare the length of the new path introduced by the new vertex against that of the shortest path obtained from the prior iteration.These operations may result in the update of the two matrices.

Guided exploration
As proved in Section 3, exploration finally converges to the shortest path that connects with the target state.Since exploration and exploitation basically produce the same result, our algorithm only needs to consider one single task, exploration.
The steps of how to guide exploration is listed in Tab. 1.There are several major steps in the algorithm listed in Tab.1: Step 1: If  is a neighbor of   , the agent can take action   directly, which transitions the state to   .
Step 2: If  is not a member of SE, a decision is randomly made by calling   ().
Step 3: If  is an edge state of SE, a decision is randomly made by calling   ().
Step 4: If  is a member of  but not a member of , the shortest path is obtained using SFW.This path represents the decision chain by which the agent can exit  in the most efficient way.

Update of the CEG
Before exploration starts, SE is empty, the agent has no a priori knowledge of the environment.Denote  0 as the start state of each exploration episode.Once exploration begins, from the initial state  0 , for each successor state   , an action   is selected from the actions set according to SFW, after which the agent moves to the next state  +1 by taking action   obtained from the feedback of the environment.Exploration gets repeated.Whenever a new state is found, it will be added to the graph, GU, immediately.When the current state is completely explored, it will be added to set SE, sometimes to the SEE simultaneously.This algorithm is listed in Tab. 2. One can see that when a new completely explored state, corresponding to a vertex in the graph can be added to the CEG, it must generate some action decision reflected as edge changes in the graph.The new SEE by definition can be readily derived from the updated CEG.

Notes on the proposed algorithm
If the current state of the agent is in SE, the shortest path to the boundary of the explored region is selected, as seen from the algorithm listed in Tab. 1.As far as the completely explored states are concerned, our approach is able to traverse all of them as opposed to explore them repeatedly.In the classical Q-learning algorithm, there are such a large number of states that have to be traversed repeatedly.This subtle difference makes our algorithm more computationally efficient, as evidenced by the experimental results reported in the next section.
Compared to value-oriented algorithms, experiences in our approach obtained from exploration history are recorded in GU and SE rather than derived from the value distribution.The establishment of two matrices in SFW relies on GU and SE, while its structure contains the shortest path of all the pairs of states in SE.Therefore, in the case of changing environment, the modest modification of GU and SE and the updates of the two matrices will make the proposed algorithm more adaptive to the new, changing environment.This feature can be clearly seen from the result reported in 5.3.

Experimental results
The new graph-based algorithm detailed in Section 4 has been applied to solve a maze.Maze solving has been widely adopted for the testing of reinforcement learning algorithms.The agent in the experiment can be seen as a ground robot roaming in a maze, and it can always sense its current position (state) as it moves around.At the beginning of experiment, the agent knows nothing about the maze, and it needs to find the reward(target) position and complete its journey by passing through a path from a specified start position.

Setup of experiment
The maze has a size of 16 rows by 16 columns for a total of 256 blocks.There are 4 types of blocks, namely target, trap, obstacle and ordinary pass.The fixed start position is treated as a normal pass block.The agent gets a reward of 1 when it reaches the target, but if the agent falls into a trap, it will get a reward of -1.Both conditions will lead to the finish of current episode, and the agent will have to return to the start position and restart its exploration.Note that the agent can keep the exploration information from all the previous episodes.There are 975 mazes in the experiment, and they differ from each other in terms of the locations of the obstacle blocks.In our experiment, there are 46 obstacles for each of the 975 mazes.For each maze, the initial position of the agent is at the upper left corner (1,1), and the target position is set to be (9,9).There are 4 fixed traps, located at (4,4), (12,4), (4,12) and (12,12).Tab. 3 summarizes the main characteristics of the maze.Fig. 4 illustrates a sample of mazes, and their respective reference numbers are 25, 36, 159, 256, 377, 512, 666 and 908.In these mazes, the red circle represents the agent, gray blocks represent obstacles, black blocks represent the traps, yellow block represents the target, and the rest are normal pass blocks.The proposed algorithm, referred as SFW, is compared against the classical Q-learning algorithm (ql) and an improved Q-learning algorithm (qlm).Tab.4 tabulates the main parameter values for ql.The main improvement of qlm over ql is that qlm can remember the locations of the obstacles and traps found during exploration, and avoid them during the subsequent explorations.Even if the next action is randomly selected based on some probability, qlm can filter out the obstacles and traps.0.9 ϵ − greedy 0.9

Performance comparison with Q-learning algorithms 5.2.1 Single maze comparison
All three algorithms are compared in terms of number of steps per episode when they are applied to solve all the mazes, and the results from mazes 25 and 908 are plotted in Fig. 5 and Fig. 6, respectively.In solving both mazes, SFW is found to converge more quickly than the other two algorithms, and it requires less number of steps during the exploratory process.As expected, qlm's performance is better than that of classical ql.

Statistical performance comparisons for all mazes
The experiments in this section include all 975 mazes.The X axis of each figure corresponds to the maze number.The comparison of convergence speed of every maze is shown in the Fig. 9 and Fig. 10, where ql and qlm are compared with SFW separately.The Y axis of each figure is the number of episodes when the agent for the first time arrives at convergence.In both figures, one can see that the proposed algorithm converge more quickly than the other two algorithms, especially true when there are a large number of episodes.Actually, the SFW alogirthm converges after no more than 20 episodes, while the other two algorithms need as many as 100+ episodes.The exploration efficiency obtained from solving every maze is shown in Fig. 13.One can see SFW outperforms qlm in this regard, and both algorithms are significantly better than ql.The X axis represents the maze number.The Y axis is the ratio of the total number of explored states to the total number of steps when the agent for the first time arrives at convergence.We will in this subsection examine how these changes can affect the performance of the three algorithms.

Obstacle change
Take Maze #8 as an example, the changes of the obstacles are tabulated in Tab. 5.

Computation efficiency
All three algorithms are compared for their respective computation efficiency under the same computation platform.The hardware used in the experiments has a Intel(R) Core(TM) i5-3210M CPU running at 2.50 GHz, and a RAM size of 8 GB.The operating system is Ubuntu 64bits.The tools used to test CPU time and memory occupation are line_profiler and memory_profiler, respectively.The average CPU time reported in Tab.7 is the average time of solving all 975 mazes.The basic memory usage in Tab.7 refers to the stable memory usage collected from solving select 62 mazes.One can see that SFW requires more memory space than the other two algorithms; the memory usage for both ql and qlm is comparable.The peak memory usage of SFW is also higher than that of ql or qlm.Unlike classical Q-learning algorithm and improved Q-learning algorithm, the proposed algorithm does not struggle with the exploration vs. exploitation tradeoff, as it was proved that the two tasks of exploration and exploitation actually converge in the decision-making process.As so, the proposed graph-based algorithm finds the shortest path during exploration, which gives higher efficiency and faster convergence than the Qlearning algorithm and its variant.Another big advantage of the proposed algorithm is that it can be applied to the dynamic environment where the value-oriented algorithm fails to work.The efficiency and convergence performance of the proposed algorithm comes at a cost of increased computational complexity.Future study will be focused to confine the computational complexity and particularly memory usage.

Figure 2 :
Figure 2: An example of a completely explored graph.Nodes represent states and directed edges between nodes represent actions.The shadowed area that includes all the State nodes (colored yellow) and all the associated directed edges represents the CEG.The unfilled nodes outside the shadowed area represent incompletely explored states, even though they connect to the CEG Suppose the complete actions set  = {  :  0 , … ,   } is known.State   is a completely explored state if any reachable next state of   , denoted as  +1 ,by taking a possible action   ∈  , is traversed.Let GU<V, E> represent a graph that contains all the traversed states, including both the completely and the partially explored states.We can define the predecessor state set   () as   () = {: ( ∈ ) ∧ ({, } ∈ ())} and the successor state set   () is defined as   () = {:  ∈  ∧ {, } ∈ ()} where SE denotes the set V(CEG).If we denote an environment feedback function by Env, then for a given action   , the next state  +1 can be determined as  +1 = (  ,   ).

Figure 5 :Figure 6 :
Figure 5: The steps amount in the #25 maze per episode comparison.The X axis is the episode number, and the Y axis represents the number of steps in each episode

Figure 11 :Figure 12 :
Figure 11: Steps length of convergence comparison: ql and SFW

Figure 13 :
Figure 13: Explore efficiency comparison 5.3 The maze in the dynamic environment Changes of environment are categorized as obstacle change and target position change.We will in this subsection examine how these changes can affect the performance of the three algorithms.

Fig. 14
Fig.14shows the snapshot of exploration, convergence, environment change and adaptation.The maze has undergone three major changes that occur to the locations of the obstacles in the experiments.The green squares in Fig.14represent the members of SE, and the purple squares represent dynamically increased obstacles that are located within the current convergent path.

Figure 14 :Figure 15 :
Figure 14: Dynamic obstacles, Maze #85.3.2Changes of target positionsTab.6 summarizes the changes that occur to maze 243.Other mazes have gone through similar changes.One can see that the target position is changed once, relocated from the center of the maze to its lower left corner.
The major steps of SFW are summarized follow.If current state stt is to be added to set SE, do the following steps: i.Add and initialize a new row in matrices DIST and NEXT.ii.Add and initialize a new column in matrices DIST and NEXT.iii.Update the new column by computing the shortest paths from all the vertices to this new vertex.iv.Update the new row by computing the shortest paths from the new vertex to all the other vertices.v. Update matrices DIST and NEXT by comparing the length of each vertices pair between the old shortest path recorded in the matrices and that of the new paths with the new vertex added.Denote   (  ,   ()) as the length of the shortest path from arbitrary state   to   () as defined in Section 2:   (  ,   ()) = min   ) ← {(  ,   ), (  , ) + (,   )} {(  ,   ):   ∈   ()}(5) Matrices DIST and NEXT are updated by performing the following operations: (  , ) =   (  ,   ()) + 1 (6) (  , ) =    (7) where   is the the result of min  {(  ,   ):   ∈   ()} as given in Eq. (5).In the same token, one can update the stt's predecessor states set   () with stt's successor states set   ().That is,   (  (),   ) = min  {(  ,   ):   ∈   ()}  ,   ) =  (12)

Table 3 :
Maze design parameter

Table 7 :
Algorithm complexity comparison In this paper, a new graph-based method was presented for reinforcement learning.