Adaptive Negotiation-rules Acquisition Methods in Decentralized AGV Transportation Systems by Reinforcement Learning with a State Space Filter

In this paper, we introduce an autonomous decentralized method for multiple Automated Guided Vehicles (AGVs). In our proposed system, each AGV as an agent computes its transportation route by referring to the static path information. route. Once potential collisions are detected, one of the two agents chosen by a negotiation rule modifies its route plan. The rules are improved by reinforcement learning with a state space filter. Then, the performance is confirmed with regard to the adaptive negotiation rules.


Introduction
Recently, the problem of planning and operation are recognized as one of the most important issues in production and logistic systems.In particular, the transportation planning using Automated Guided Vehicles (AGVs) in a lot of steel production, semiconductor production and warehousing systems has been widely studied from both theoretical as well as practical viewpoints 1,2 .This paper focus on the AGV routing problem 2 .In this problem, all requests for transportation are given, but it is necessary to determine the transportation route plan in which collisions among AGVs do not occur.
In this paper, we introduce an autonomous decentralized approach for the AGV routing problem 3 .At first, each AGV as an agent creates its optimal route which satisfies the requests assigned to it.Once potential collisions are detected, one of the two agents chosen by a negotiationrule modifies its route.A set of negotiation-rules are used at every collision avoidance.In this paper, we propose a reinforcement learning (RL) 4 approach with a state space filter 5 for improving the negotiation-rules.Furthermore, the effectiveness of the proposed approach is discussed through a computational experiment.

P -346
In this section, we describe a problem of routing AGVs to the destination, under the given rail networks and the given assignment of transportation requests as an AGV route planning problem.This problem is hard to find routing plans without collisions.The AGV route planning problem is defined as follows: N V AGVs Vi (i = 1,⋯, N V ) are available in the rail system.The rail system is represented as N N nodes Nj (j = 1,⋯, N N ) and N A arcs Ak (k = 1,⋯, N A ).A node represents the area for the loading and unloading points and branching points of the rail.An arc represents the path for AGVs.Vi moves to the destination node Dl which is specified by the request Rl (l = 1,⋯, N R ) assigned to Vi. Rl is given to the system at the start time   R , and the assignment of Rl to Vi is given.These problem parameters can be summarized as follows: The following constraints should be taken into account in a planning process: (i) All AGVs move synchronously.
(ii) The velocity of each AGV is constant.
(iii) Each AGV can travel on the arc and turn only on each node.The distance between the nodes is constant.(iv) Each arc has a width of one AGV.Therefore, two AGVs cannot travel between the two nodes from the opposite direction simultaneously.As for the evaluation of the routing plan, various kinds of criteria may be considered.We consider here the maximum completion time C max : where   V represents the arrival time at the destination node Dl assigned to Vi.

Algorithm based on negotiation
In this section, we introduce an autonomous decentralized planning method 3 , in which each AGV as an agent repeats planning its own route and exchanging the route information with another until there is no collision.The route of each AGV is calculated by Dijkstra's method 6 on the graph representation.Once potential collisions are detected, one of the two agents chosen by a negotiation-rule modifies its route.The whole procedure of route planning is designed as follows: (i) Initialization Each AGV calculates its minimum route from the start point to the destination point.(ii) Information Exchange All AGVs within a certain distance dF exchange their route information with each other.(iii) Termination If any potential collisions are detected until a certain time step tL ahead, go to stage (iv), otherwise the current set of routes are outputted and time proceeds one step.(iv) Negotiation and Re-planning One of 2 AGVs involved potential collisions is chosen by a negotiation-rule and re-plans its route in order to avoid collisions.Go back to stage (ii).In stage (iv), a new route plan is acquired by re-planning by Dijkstra's method on the network graph in such a way that the arc adjacent to the potential collision point is given 1 extra value of weight.
However, it is impossible to pass through the same node several times to avoid collisions using simply Dijkstra's method.In this route planning, it is effective to move backward into an adjacent node in order to avoid collisions.For that reason, it is necessary to pass through the same node several times.
Therefore, we enable the AGV to pass through the same node up to twice using Dijkstra's method by virtually adding one dummy node to each of the nodes, though it may be effective to pass through the same node more than 3 times.One of two AGVs re-plans route plans in this procedure.Nishi et.al 2 have proposed the agent-based AGV routing algorithms, where both of two AGVs in collision modify their route plans separately.In their approach, both AGVs might move in the same direction to avoid the collision points, and then new route plans are P -347 still infeasible.In contrast, this method is free from such oscillation phenomena.

Structure of negotiation rules
The AGV which re-plans its own route is determined by a negotiation rule.A negotiation rule consists of a condition-part and an action-part.The rule which matches the difference between each state of the 2 AGVs under negotiation is selected from a set of rules.As for the state parameters, we introduce three kinds of variables to represent the whole state space S: s O : The amount of incremental change in the route length after the re-creation.s R : Route length from the collision point to the destination point.s F : Number of nodes with 3 or more arcs.Among the state parameters, s F represents the flexibility of route creation.Then, S is divided into n S portions Su (u = 1,⋯, n S ).The action-part expresses the index of the AGV which re-plans its route.

A Negotiation-rules Acquisition Method by Reinforcement Learning
It is hard to design good negotiation-rules shown in Sec.
3, in advance without advanced knowledge of an object system.In this paper, the negotiation-rules are improved autonomously by reinforcement learning (RL) 4 .Specifically we use Q-learning (QL) 7 which is one of the typical RL methods, where state space and action space are constructed by the condition-parts of the negotiationrules and by the action-parts, namely 0 or 1, respectively.It is necessary to design a positive reinforcement signal (reward) based on an evaluation of the route plans of all AGVs.In this paper, the reward is given to a RL agent only when all AGVs arrive at the destination point allowing the maximum completion time C max to be evaluated.

Reinforcement Learning with a State Space Fitler
We have proposed a state space filter based on the entropy which is defined by action selection probability distributions in a state 5 .
The entropy of action selection probability distributions using Boltzmann selection in a state H(s) is defined by where (a|s) specifies probabilities of taking each action a in each state s, A is the action space and |A| is the number of available actions.
The state space filter is adjusted by treating this entropy H(s) as an index of a correctness of state aggregation in the state s.In particular, in case of mapping from the inner state space roughly digitized to the inner state space, a perceptual aliasing problem is happened.That is, the action which an agent should select cannot be identified clearly.Thus, the entropy may not be small in the state space should be divided.In this paper, sufficiency of the number of learning opportunities is judged using a threshold value  L .Therefore, if the entropy does not get smaller than a threshold value  H despite the number of learning opportunities is sufficient, the state space filter is adjusted by dividing the state due to that the perceptual aliasing problem is happened.
Similarly, if the entropy is smaller than  H in a state s and a different state mapping from a transited input state s', and representative actions in each other's states are same, the state space filter is adjusted by integrating the states due to that the states is too divided.In addition, if a state s which never, not once, mapped during a certain period  t exists, then the state space filter is adjusted by integrating s and an adjacent state to s.

Computational Example
The effectiveness of the proposed approach is investigated in this section.We have prepared an AGV route planning problem (hereafter called ``N18''): N N =18, N A =26, N V =8. 1 step is defined as the time step for moving to the adjacent node.The period from when all Fig. 1.A framework of rule learning using reinforcement learning.
P -348 AGVs are located at the start point to when all AGVs arrive at the destination point or 100 steps pass away is labeled as 1 episode.All occurrence time: ∀,   R = 0, the detectable time step tL = 2, and the exchangeable distance dF = 2 which is Manhattan distance.The least number of steps required for all AGVs to arrive at the destination point is 7 in N18.
To apply QL with a state space filter, a 3 dimensional initial state space, which is constructed by the difference between each state element of 2 AGVs which is designed so that the state space is evenly divided into 2×2×2 (hereafter called "2-2-2"').For comparison, 3 state space constructions with 1 dimensional initial state space, which are constructed by the difference between each state element which are designed so that the state spaces are evenly divided into 2 (hereafter called "2-1-1"', "1-2-1"' and "1-1-2"'), and 1 state space construction with 1 initial state space where the size of state space is 1 (hereafter called "1-1-1").The reward r = 10 (reward) is given to the agent only when all AGVs arrive at the destination point, the reward r = -15 (punishment) is given to the agent only when the weight of the arc is more than 100., and the reward r = 0 is given to the agent.atany other points.
Computational experiments have been done with parameters as shown in Table 1.In addition, all initial Qvalues are set at 5.0 as the optimistic initial values.The average number of steps required for all AGVs to arrive at the destination point in N18 were observed during learning over 20 simulations with different initial state space constructions, as described in Fig. 5.
It can be seen from Fig. 5 that all initial state space constructions could acquire feasible route plans within nearly the least number of required steps.

Conclusion
In this paper, we deal with the AGV route planning problem and introduce an autonomous decentralized route planning method.Here, we propose a reinforcement learning approach with a state space filter for improving the negotiation-rules.Through a computational experiment, it is observed that the proposed approach can acquire feasible and efficient rule-sets for the problem.Our future projects include evaluating the effectiveness of our proposed approach under the condition that more than one request is assigned to each AGV.

Table 1 .
Parameters for Q-learning with a state space filter.Required steps for task N18 by QLs with state space filter