The Use of Digital Elevation Models (DEMs) for Bathymetry Development in Large Tropical Reservoirs

Bathymetry is the study of underwater depth of lake or ocean floors. In other words, bathymetry is the underwater equivalent to hypsometry (Miller et al., 2010), and this can be used also to describe the shape and volume of water reservoirs (Obregon et al., 2011). The bathymetry is generally obtained by recording water depths throughout a water body and connecting the recorded points of equal water depth. Hence, a bathymetric map is estimated from the water depth between two points of a known depth. There may be discrepancies in any given map depending on the number of depth measurements taken: the more depth measurements recorded, the more accurate the map is.

navigation problem.Section 4 presents the simulation procedures and results with comparisons with related method.Finally, conclusions are given in Section 5.

DAQL-enabled multiple goal reinforcement learning 2.1 General overview
Autonomous navigation is inherently a multiple goal problem involving destination seeking, collision avoidance, lane/wall following and others.Fig. 1 depicts the concept of multiple goal Reinforcement Learning with totally G goals.A multiple-goal scenario can be generalized such that both conventional QL and DAQL can be used for learning depending on the nature of the environment.The individual Q-values are eventually fused to produce a final action.For instance, limit the vehicle navigation problem to two goals: DS and CA.If obstacles and destination are non-stationary, then both goals can be dealt with by DAQL, whereas if they are all stationary, then QL suffice.Here, this general concept is illustrated by assuming that the destination is stationary and the obstacles are mobile.As such, QL is used for DS and DAQL is used for CA.

Reinforcement learning framework
An effective tool for mapping states (that describe the environment) to actions (that are taken by an agent) and carrying out appropriate optimization (based on a value function) is the Markov Decision Process (MDP) model.It is a model for sequential decision making under uncertainty.In an MDP, the transition probability and the reward function are determined by the current state and the action selected by the agent only (Puterman, 1994).It can be explained by considering a specific time instant of an agent and its environment as depicted in Fig. 2. At each time step t, the agent observes the state s t ∈S, where S is the set of possible states, then chooses an action a t ∈A(s t ), where A(s t ) is the set of actions available in s t , based on s t and an exploration policy (e.g.greedy policy).The action causes the environment to change to a new state (s t+1 ) according to a transition probability, ' In RL, the value function is introduced to estimate the value for the agent to be in a given state.It is the expected infinite discounted sum of reward that the agent will gain as follows (Sutton & Barto, 1998): where E π {} is the expected value when policy π is adopted and R t is the discounted sum of future rewards; is the discounting factor and r t+k+1 is the reward (or penalty) received at time (t+k+1).Policy π is a mapping from each state-action pair to the probability π(s,a) of taking action a when in state s.To solve the RL task, an optimal policy should be determined that would result in an a t with the highest expected discounted reward from s to the end of the episode.The optimal value function corresponding to the optimal policy is then achieved by maximizing the value function that represents the expected infinite discounted sum of reward: (2) The corresponding action-value function is given as: and the optimal action-value function is given as:  (Watkins & Dayan, 1992) QL has been proven to converge to optimal action-value with probability one if each action is executed in each state an infinite number of times (Kaelbling et al., 1996;Watkins & Dayan, 1992), and works reasonably well in single agent environment, where the agent is the only object that is able to evoke a state transition.

Double action Q-learning
In general, it is fair to assume that a DE consists of static obstacles (e.g.walls) and dynamic agents/obstacles.In this case, the assumption that state transition is solely caused by an agent is not exactly appropriate (Littman, 2001;Boutilier, 1995;Claus & Boutilier, 1998) is the reward received as a result.In this new model, state changes when either (or both) the agent or the environment has taken its action.To reflect the fact that state transition is now determined by a 1 and a 2 , the new value function is formulated below: , {} E ππ represents the expected value when policy π 1 is adopted by the agent and policy π 2 is adopted by the environment.Similarly, there exists an optimal value function when an optimal policy pair π 1 and π 2 is applied.Although there may be more than one pair, we called all the optimal pairs π 1 * and π 2 *.They have the optimal value function V*(s) defined as: Although a 2 t is involved in calculating Eqt. ( 7), ( 8) & (9), it is inherently uncontrollable by the agent and therefore maximizing a 2 t in (7) and a 2 t+1 in (8) & ( 9) is meaningless.Instead, an approximation to the optimal action-value function by using the observed a 2 t+1 is found and maximizing Eqt.(8) by a 1 t+1 subsequently.As such, the new update rule for DAQL is: where s t , a 1 t are known in t, a 2 t , s t+1 , and r t+1 are known in t+1, and a 2 t+1 can only be known in t+2.Therefore, the learning is delayed by two time steps when compared with conventional QL, but with a 2 t and a 2 t+1 appropriately included.When comparing Eqt. ( 5) with (10), the difference between DAQL and QL is that action a 2 t has been explicitly specified in the update rule.The optimal value function as a result of maximizing a 1 t only, while a 2 t is considered explicitly but unknown is given below: To predict obstacles' action, an AR model is applied, which allows the calculation of the expected Q-value.In case that other obstacles' actions are not predictable, such as when they move randomly, we assumed that a 2 t has equal probability in taking any of the |A 2 (s)| actions.

Goal fusion
The purpose of goal fusion (GF) is to derive a single final action from the actions of different goals.Available methods for the coordination of goals include simple summation or switch of action value function (Uchibe et al., 1996), mixtures of local experts by supervised learning (Jacobs et al., 1991), and multiple model based reinforcement learning (Doya et al., 2002).Here, we adopt a modified summation method to coordinate multiple goals.A GF function based on this is formulated as follow: Where 1 + 2 +…+ G =1, G is the number of goals to be achieved and Q 1 (a 1 t ),…,Q G (a 1 t ) are the Q-values of the G goals respectively.The importance of the goals with respect to the whole task is represented by the value of .A more important goal is represented by a larger while a less important goal is represented by a smaller .

Geometrical relations between agent and environment
The control variables of the agent and the i th obstacle at time t are depicted in Fig. 4. It is assumed that there are N moving obstacles in the environment and that obstacle distances can be sensed by distance sensors on the agent, which have a minimum and maximum detectable distance of d s,min (10cm) and d s,max (500cm) respectively.Further assume that only M obstacles in the environment can be detected, where M≤N.The location of the i th obstacle is denoted by distance d i ∈D o where D o =[d s,min , d s,max ]⊂ℜ and angle θ i ∈Θ where Θ=[0,2π]⊂ℜ.
We assume that the agent is d dest ∈ℜ + away from the destination and is at an angle φ∈Θ.and Θ q ={j|j=0,1, , and Θ q ={j|j=0,1,…,15}.Quantization is achieved as follows: 5 for 0 105 100 for 105 where there are altogether 161 actions for each obstacle as observed by the agent.The concept of factored MDP (Boutilier et al., 2000;Guestrin et al. 2001) can be applied if necessary to reduce the number of states required.

Destination seeking
For convenience, destination is assumed stationary here, otherwise actions performed by the moving destination may be considered as in the case of obstacles, which the same DAQL formulation applies.
Fig. 6.Change in d dest from t-1 to t.
The purpose of using reinforcement learning in destination seeking is for the agent to learn the limitation of the underlying vehicle mechanics such as limited acceleration and deceleration.The crux of the QL formulation for DS is that the agent is punished if its trajectory towards the destination contains more steps than necessary.With reference to Fig. 6 where -3≤r DS,t ≤1.In Eqt. ( 22), d extra is a penalty to the agent in order to ensure that it follows the shortest path to travel to the destination.The reward function is further shifted to 1≤r DS,t ≤0 by r DS,t ←(r DS,t -1 )/4, so that the Q-values calculated are in line with those from CA.By using the QL update rule, the agent can learn to use the most appropriate action in the current state to reach the destination using the most direct path, as depicted in Fig. 7. Fig. 7. QL for destination seeking.

Collision avoidance
Given multiple mobile obstacles in the environment, DAQL is most applicable here.The reward function adopted by DAQL represents punishment (-1) to the agent when collision occurs: ,, 0 if no collision occured 1 if collision occured When r CA,i,t is available, the agent uses the DAQL update rule to learn CA, as depicted in Fig. 8.Given obstacles' actions in two time steps (t-2 & t-1), the agent updates its Q-values (q i (s i,t ,a 1 t ,a 2 i,t )) at t.If there are M obstacles that are detectable by the agent, the DAQL update rule is applied M times and the results are combined based the parallel learning concept introduced by Laurent & Piat (Laurent & Piat, 2001;Laurent & Piat, 2002).Their proposal of taking the sum of all the Q-values from all the obstacles is used, as oppose to taking the maximum Q-value over all the obstacles, as given in the following: where Q CA (a 1 t ) is the overall Q-value set for the entire obstacle population when the agent takes a 1 t ; q i (s i,t ,a where 2 , it a p is the probability that the environment takes action a 2 i,t .The expected value of the overall Q-value is obtained by summing the product of the Q-value of each obstacle when they take action a 2 i,t with their probability of occurrence.The combined Q-value for the entire DE, Q CA (a 1 t ), is the summation of Q-values of each obstacle.

Prediction
To predict a 2 i,t , a linear prediction technique based on the autoregressive (AR) model is adopted.We assume that the accelerations of obstacles are slowly changing in the time interval T between two time steps.A 1 st order AR model (Kehtarnavaz & Li, 1988;Ye, 1999) is used to model the acceleration a i (t): , () ( 1 ) () where e(t) is the prediction error and B i,t is a time-dependent coefficient and is estimated adaptively according to the new distance measurements.The acceleration is thus approximated by a combination of velocity and position representations: where v i (t) and r i (t) are the velocity and position of the i th obstacle at time step t, respectively.Substituting Eqt. ( 28) into ( 27) gives a 3 rd order AR model: Therefore, the next position of the i th obstacle at time t+1 can be predicted by the following equation if the coefficient B i,t is known: where , ˆit B is time-dependent and is updated by the adaptive algorithm in (Shensa, 1981).The coefficient , ˆit B can thus be determined by the following equations: ,, 1 () ( 1 ) ,, 1 (1 ) (1 ) where 0< ≤1 is a weighting factor close to 1. Since a k (t), Δ k,t , R k,t and are all known, , ˆit B can be predicted and thus ˆ(1 ) i rt + can be predicted, from which the action performed by the i th obstacle at t can be predicted and the probability 2 , it a p can be determined.A probability of 1 is given to the predicted action and 0 is given to all other actions.

Fusion of DS and CA
Given two sets of Q values from DS and CA, they are combined by using -a parameter that varies between 0 and 1, to balance the influence of the two goals, as given in Eqt.(34), where Q CA ( a 1 t )) and Q DS (a For closer to 1, Q final (a 1 t ) is biased towards DS, giving the agent better DS performance but poorer CA performance.Conversely, for closer to 0, Q final (a 1 t ) is biased towards CA, giving the agent poorer DS performance but better CA performance.The final decision of the agent is made by using the ε-greedy policy as shown in Eqt.(35).Fig. 9 and Fig. 10 depict the functional diagram and pseudo code of the proposed method respectively for multiple obstacles.Initialize q i (s,a 1 ,a 2 ) arbitrarily Repeat (for each episode) until l dest ' is terminal

Simulation conditions
In this simulation, length has unit of cm and time has unit of second.The agent and obstacles are assumed to be circular with diameter of 100 cm, and the environment is 2500×2500 (cm 2 ), as depicted in Fig. 11.The numbers in the figure represent the location of the agents and targets in every 10 s.The maximum speed of the agent (v a,max ) is assumed to be 50 cm/s, with a maximum acceleration and deceleration of 20 cm/s -2 .The agent is required to start from rest, and decelerate to 0 cm/s when it reaches the destination.
To acquire environmental information, a sensor simulator has been implemented to measure distances between agent and obstacles.The sensor simulator can produce either accurate or erratic distance measurements of up to 500 cm, at T interval (typically 1s) to simulate practical sensor limitations.The other parameters are set as follows: α for both DS and CA learning is set to 0.6 for faster update of Q-values; is set to 0.9 for CA and 0.1 for DS; of 0.1 is set to have strong bias towards CA, in the expense of longer path; ε is set 0.5 for DS and 0.1 for CA.

Factors affecting navigation performance
The factors that affect an agent's performance in a DE are: relative velocity, relative heading angle; separation; and obstacle density.They define the bounds within which the agent can navigate without collision from an origin to a destination.The relative velocity of obstacle as observed by the agent can be defined as: , , where ,, m a x oi v and ,max a v are velocity vectors of the i th obstacle (O i ) and agent (A) respectively.In essence, , ri v represents the rate of change in separation between A and O i .Given , heading angle of O i w.r.t. the line joining the centres of O i and A, and δ a , heading angle of A as depicted in Fig. 12, relative heading angle is defined as ψ=π-(δ a +δ o,i ).It should be noted that ψ equals π when A and O i are moving towards each other, -π when A and O i are moving away from each other, and 0 when both are moving in the same direction.Let d i be the separation between A and O i .It determines the priority A would adopt when considering O i among other obstacles.If d s,max is the maximum sensor measurement range of A, and if d s,max <d i then O i simple does not exist from A's point of view.Obstacle density can be defined as D=Nπr o 2 /(A env -πr a 2 ), where N is the number of obstacle in the environment and A env is the area of the closed environment.We also assume that the obstacles are identical and have a radius of r o , and A has a radius of r a .Given A env =25002, r o =r a =50, D=0.00125N.Step 2 was repeated 100 times to get an average performance over 10,000 episodes.Fig. 13 depicts the mean % difference between the actual and the shortest paths.It can be seen that given sufficient training, the agent achieves a path difference of 3-4%.This is so because of the discrete actions the agent adopted in the simulation.In Fig. 13, the data are curve fitted with a 6 th order polynomial (solid line), from which a cost function is applied to

Training for collision avoidance
For different environmental factors, we trained the agent with Q-values for CA set to zeros initially for 10000 episodes in each case.After training, simulation results are obtained by allowing the agent to maneuver in an environment with the same set of environmental factors without learning and exploration.When obstacles are present, the agent travels between a fixed OD pair.The agent learnt from the rewards or punishments when it interacted with the obstacle.When the agent reached the destination, an episode was terminated and the agent and obstacle were returned to their origins for the next episode.Furthermore, to illustrate the behavior of the agent in a more complex environment which involves multiple sets of different environmental factors at the same time, environments with randomly moving obstacles are constructed.Q-values for CA are set to zeros initially and the agent is trained for 10000 episodes in each test case.After training, simulation results are obtained by allowing the agent to maneuver in the same environment without learning ability and exploration.In each training episode, the agent was required to travel to a fixed destination from a fixed origin through the crowd of randomly moving obstacles which were randomly placed in the environment, and the termination condition was the same as before.

Obstacles at constant velocity
This simulation investigates how the agent reacts to one or more obstacles at constant velocity with an initial separation of larger than 500 cm.The AR model in Section 3.4 was used for obstacle action prediction.For one obstacle, two v o values and two ψ were considered: 50 cm/s and 100 cm/s; and π and ¾π.The simulation was repeated for v o =50 cm/s when two obstacles were present at different heading angles.It was also repeated for a group of obstacles having the same heading angle.These cases are tabulated in Tables 1 to 3.
Case  1. Cases A and B: The obstacle moved directly towards the agent at different velocities respectively, as depicted in Fig. 15.For Case A, the obstacle moved at the same speed as the agent.The agent maintained at a maximum speed until the obstacle was within range.It responded appropriately as seen from its Velocity and Heading angle profiles.The agent responded with a gradual change in heading angle to avoid collision.It remained at the changed course for a while before converging to the destination.For Case B, as the obstacle moved faster than the agent, the agent responded earlier with a larger change in velocity.As the CA event ended faster, the agent in Case B reached the destination earlier.

Cases C and D:
The obstacle crossed path with the agent at an angle of ¾π, as depicted in Fig. 16.For Case C, when obstacle speed is the same as the agent, the agent moved to the right slightly to let the obstacle pass.For Case D, the agent responded earlier and also decided to let the obstacle passed first.As the obstacle moved faster in this case, the velocity and heading angle changes of the agent were larger.

Cases E and F:
To deal with two obstacles simultaneously.The obstacles moved at speed v o =50 cm/s in both cases, but at different heading angles.Case E, as depicted in Fig. 17(a-c), consists of two obstacles moved at an intersecting angle of ¾π with respect to the agent.As can be seen from its V and H profiles, there are two responses: one at t=23s when Obstacle 1 approached the agent first, and one at t=29s when Obstacle 2 followed.The speed changes in both responses were minor, while the agent stepped backward in the first instance to avoid collision.For Case F, two obstacles moved perpendicularly to the agent as depicted in Fig. 17(d-f).There were two distinct responses (at t=9s and 22s), both of which required slowing down and change in heading angle to let the obstacle pass first.Table 3. Summary of simulation parameters with a GROUP of obstacles.

Cases G and H:
To deal with a larger number of obstacles in the DE.In Case G, seven obstacles moved in a cluster towards the agent at v o =50 cm/s.From the path diagram as depicted in Fig. 18(b), as the obstacles were well apart, the agent found no difficulty in navigating through them, as shown in its V and H profiles.For Case H, the cluster of seven obstacles moved at an angle of ¾π with respect to the agent.Again, the agent navigated through the cluster appropriately, without collision.

Obstacles at variable velocity
The objective of this simulation is to study agent behavior in handling a simple randomly changing environment.In Cases I and J, a single obstacle moved at varying velocity directly towards the agent (ψ=π).The obstacle's velocity ranges are 0-50 cm/s and 0-100 cm/s respectively, in step of 10 cm/s.The agent was evaluated over 1,000 episodes in the same environment in each case.A summary of the two cases is given in Table 4 The results show that for Case I, the proportion of collision-free episodes is 97.6% and a mean path time of 62.97s.A collision-free episode is one that the agent travels to the destination without causing any collision.When compared with the shortest path time (50s), the agent used an extra of 12.97s more.For Case J, the obstacles moved faster in a wider range.As a result, the number of collision-free episodes was reduced to 95.7%, but the mean path time was also reduced to 59.32s.This can be explained as because of the faster moving obstacles, the agent experienced more collisions, but managed less convoluted paths.

Inaccurate sensor measurement
In this simulation, we investigate how the proposed method tolerates inaccuracy in sensor measurements.As in Cases A & B at three different speeds (v o =10, 50 or 100 cm/s), the output of the sensor simulator was deliberately corrupted by a Gaussian noise function that has a mean ( ) of =d i and standard deviation (σ) of n×μ where n=0, 0.1, 0.2, 0.3, 0.4, 0.5, and 0.6 (Ye et al., 2003).For each set of n and v o , Q-values for CA are set to zeros initially and the agent was trained for 10,000 episodes.After training, and the agent was evaluated in the same environment for 1,000 times with different n and v o .From Table 5, for n<0.2, none of the obstacle speed would cause collision.For n≥0.2, collision began to appear.At low speed, the number of collisions can be kept small with a worst case of 2.1%.For o v =50 cm/s, the number of collision-free episodes was reduced to 95.4% at n=0.6.For o v =100 cm/s, it went down to 57.6%, or almost half of the episodes have collisions.This is logical as slow obstacles are easier to avoid compared with fast obstacles, and inaccurate sensor measurements make it harder to avoid collision.For mean path time, it generally increases when n increases, although minima appear at n=0.2 for low speed, n=0.1 for medium speed and n=0 for high speed.As in Case A, the mean path time is longer when obstacle speed is low because of more convoluted paths.As n increases, the agent learnt to respond earlier to such inaccuracy and resulted in shorter paths.However, for larger n, the agent travels extra steps in order to cope with the large sensor error, which resulted in even longer path.The same applies when obstacle speed is relatively higher, except that the minima appear when n is smaller because the agent responded earlier in this case.

Randomly moving obstacles and performance comparison
The purpose of this simulation is to evaluate the proposed method in an environment with up to 50 moving obstacles, and compare it against another navigation method that is designed to work in such a complex environment.Obviously, those that work on static environment (Pimenta et al., 2006;Belkhouche et al. 2006;Jing et al. 2006), those that consider relatively simple cases with very low obstacle density (Soo et al. 2005;Kunwar & Benhabib, 2006), or those that assume perfect communication among agents, e.g.robot soccering (Bruce & Veloso, 2006), are unsuitable.A suitable candidate is the artificial potential field method proposed by Ratering & Gini (R&G) (Ratering & Gini, 1995), which was simulated in a relatively complex environment with high density of multiple moving obstacles.To enable the comparison, obstacles size was reduced to 20 cm in diameter, and the obstacles were placed and moved randomly in speed and direction, as in (Ratering & Gini, 1995).The origin and destination of the agent were located at the lower left hand corner and upper right corner of the environment, respectively.Since the obstacles moved randomly, the prediction was not used in the proposed method.Different obstacle density D and obstacle velocity v o were studied, and results are tabulated in Table 6.Each result shown in the table was derived from 100 episodes after training.The R&G method used static potential filed and dynamic potential fields to handle static and moving obstacles respectively, and their results are also depicted in Table VI v , the no. of collision-free episodes decreases as D increases for the proposed method.Obviously, more obstacles in a fixed area increase the chance of collision.This is also true when , ri v increases.In the extreme, only 69% of episodes are collision-free when , ri v =0-100 cm/s and D=0.0025 (max).When compared with the R&G method, when D is very low, differences in no. of collision-free episodes between the two methods are insignificant.However, when D is larger (>10 obstacles), the proposed method performed consistently better.This is also the case when , ri v increases.On average, the improvement on the no. of collision-free episodes is 23.63%, whereas the best is slightly over 115% for the largest D. For average path time, it increases as D increases.This is to be expected as there are more obstacles and more CA actions that resulted in longer path time.On the other hand, for small , ri v and large D, clustering of obstacles becomes a real possibility that can block the agent's path.This is confirmed by the large standard deviation of path time when compared with other larger , ri v .Although the R&G method employed the adjustable hill extent method to deal with this issue, their average path times are in fact longer.When , ri v is large, obstacle clustering is but their speed makes it necessary to make more convoluted path to avoid them, therefore the resultant path time is longer, with smaller standard deviation.Again, there is a minimum in average path time at medium , ri v depending on D. When compared with R&G method, an average improvement of 20.6% is achieved.

Conclusion
In this chapter we have presented a multiple goal reinforcement learning framework and illustrated on a two-goal problem in autonomous vehicle navigation.In general, DAQL can be applied in any goals that environmental response is available, whereas QL would suffice if environmental response is not available or can be ignored.A proportional goal fusion function was used to maintain balance between the two goals in this case.Extensive simulations have been carried out to evaluate its performance under different obstacle behaivors and sensing accuracy.The results showed that the proposed method is characterized by its ability to (1) deal with single obstacles at any speed and from any directions; (2) deal with two obstacles approaching from different directions; (3) cope with large sensor noise; (4) navigate in high obstacle density and high relative velocity environment.Detailed comparison of the proposed method with the R&G method reveals that improvements by the proposed method in path time and the number of collision-free episodes are substantial.

Acknowledgement
The authors would like to thank the anonymous referees for their thorough review of the paper and many constructive comments.The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China (Project No.HKU7194/06E), and was partially supported by a postgraduate studentship from the University of Hong Kong.

Fig. 4 .
Fig. 4. Control variables of agent and the i th obstacle.

Fig. 12 .
Fig. 12. Heading angles of O i and A.
Fig. 15.Simulation results of Cases A and B.
Fig. 16.Simulation results of cases C and D.
Fig. 17.Simulation results of cases E and F.
Fig. 18.Simulation results of cases G and H.
State diagram of the MDP model given that s t= s, s t+1 =s', and a t =a.
is one of the efficient methods for solving the RL problem through the action-value function in Eqt.(4).In QL, the agent chooses a t according to policy π and the Q-values corresponding to state s t .After performing action a t in state s t and making the transition to state s t+1 , it receives an immediate reward (or penalty) r t+1 .It then updates the Q-values for a t in s t using the Q-values of the new state, s t+1 , and the reward r t+1 as given by the update rule: . In other words, state transition in a DE may be caused by the action taken by the agent, a 1 t ∈A 1 (s t ), and a collective action taken by the other agents/obstacles, a 2 t ∈A 2 (s t ), where A 1 (s t ) and A 2 (s t ) are the set of actions available in s t for the agent and the obstacle in the environment respectively.Fig.3depicts a new MDP that reflects this relationship, whereas a 2 t describes the action performed by an obstacle.The net state transition at each time step is the result of all the action pairs taken together.
RT Fig.3.State diagram of the new MDP model given that s t =t, s t+1 =s', a 1 t =a 1 , and a 2 t =a 2 .The seven parameters of the new MDP are: T, s t , s t+1 , a 1 t , a 2 t , Using the same technique as QL, the function Q*(s t ,a 1 t ,a 2 t ) can be updated continuously that fulfils the purpose of RL.The QL type update rule for the new MDP model is given below: The four parameters: d i , θ i , d dest , and φ are quantized into states.The state set for the relative location of the destination is l dest ∈L dest where dest ={i|i=0,1,…,11} and Θ q ={j|j=0,1,…,15}.The state set for obstacle location is s i ∈S i where There are altogether 192 states for L dest and 160 states for S i .The output actions are given by a∈A where A={(|v a |,θ a )||v a |∈V a and θ a ∈Θ}, V a ={m×v max /5|m=0,1,...,15}, Θ a ={nπ/8|n=0,1,…,15}, and v max is the maximum agent speed.For a v =0, the agent is at rest despite of θ a , resulting in only 81 actions.For DAQL, we assume that obstacles have speed v o ∈ℜ + and heading angle θ o ∈Θ.They are quantized to a 2 i ∈Ao where If the agent performs a shortest path maneuver, then |v a |T=Δd dest , otherwise |v a |T>Δd dest and the worst case is when the agent has moved away from the destination, i.e., Δd dest =-|v a |T.Let us define d extra as: |T.The normalized reward function of the agent is thus defined as: , let us defineΔd dest =d dest,t-1 -d dest,t , where d dest,t-1 is the distance between the agent and destination at t-1, d dest,t is the distance at t; and the agent travels at v a from t-1 to t.
1 t ,a 2 i,t ) is the Q-value set due to the i th obstacle; s i,t is the state of the i th obstacle observed by the agent at time t; and a 2 i,t is the action performed by the i th obstacle at t. Since all M obstacles share a single set of Q-values, the Q-values are updated M times in one time step.As a 2 t is not known at t, it has to be predicted, which can be treated independently from RL, i.e. the agent predicts from the environment's historical information, or it can be based on concepts (rules learn from examples) and instances (pools of examples).To incorporate the predicted a 2 t , Eqt. (24) is modified as follows: 1 t ) are normalized.

Table 1 .
Summary of simulation parameters with ONE obstacle.

Table 2 .
Summary of simulation parameters with TWO obstacles.

Table 4 .
. Simulation parameters with ONE obstacle at random speed.

Table 5 .
Table 5 depicts the simulation summary.Robustness to sensor noise.

Table 6 .
. Cases of randomly moving obstacles in a fixed area.(Numbers in brackets show the results of R&G Method(Ratering & Gini, 1995))