B AT Q- LEARNING A LGORITHM

Cooperative Q-learning approach allows multiple learners to learn independently and then share their Q-values among each other using a Q-value sharing strategy. A main problem with this approach is that the solutions of the learners may not converge to optimality, because the optimal Q-values may not be found. Another problem is that some cooperative algorithms perform very well with single-task problems, but quite poorly with multi-task problems. This paper proposes a new cooperative Q-learning algorithm called the Bat Q-learning algorithm (BQ-learning) that implements a Q-value sharing strategy based on the Bat algorithm. The Bat algorithm is a powerful optimization algorithm that increases the possibility of finding the optimal Q-values by balancing between the exploration and exploitation of actions by tuning the parameters of the algorithm. The BQ-learning algorithm was tested using two problems: the shortest path problem (single-task problem) and the taxi problem (multi-task problem). The experimental results suggest that BQ-learning performs better than single-agent Q-learning and some well-known cooperative Q-learning algorithms.


INTRODUCTION
Q-learning is a well known reinforcement learning (RL) algorithm that allows machines and software agents to develop an ideal behavior within a specific environment based on trial and error [1]- [3]. A Q-learning agent learns how to behave by trying actions to determine how to maximize some reward. This is usually accomplished using temporal difference learning to find mapping from state-action pairs into quality values (Q-values). A Q-value of a state-action pair ) , ( a s represents the expected utility of taking action a in state s and following a fixed policy thereafter. The Q-values are normally calculated using a utility function known as a Q-function. These values are usually stored in a data structure known as a Q-table. Cooperation among several reinforcement learners in the same multi-agent environment provides an opportunity for the learners to cooperatively solve a learning problem. Such an approach to RL, which is called cooperative RL, is increasingly used by research labs around the world to solve real world problems, such as robot control and autonomous navigation [4], [5]. This is because cooperative reinforcement learners can learn and converge faster than independent reinforcement learners via sharing of information (e.g., Q-values, Episodes, Policies) [3], [6]- [8]. One such example is cooperative Q-learning, in which several learners share their Q-values among each other in order to accelerate their convergence to optimal solutions [9], [10]. Cooperative Q-learning is normally broken into two stages. The first stage is known as the independent learning stage, in which each reinforcement learner individually applies Q-learning to enhance its own solution. In the second stage, the learning by interaction stage, the learners share their Q-values based on a sharing strategy. A Q-value sharing strategy defines how the independent learners can share their Q-values among each other to obtain new Q-tables. This strategy can only be applied when the agents have Q-tables with a similar structure.

BACKGROUND INFORMATION
This section briefly summarizes some of the underlying concepts of Q-learning and Bat algorithms.

Q-learning
The problem model of Q-learning is commonly represented as a Markov Decision Process (MDP) [1]. An  indicates that the transition is invalid. The immediate expected reward for executing this transition is the deterministic reward ) , ( z x a s R [3]. It is important to note that the implementation of Q-learning to stochastic MDPs is beyond the scope of this paper.
A learner is normally required to apply Q-learning to an MDP for a number of learning episodes in order to learn which action is optimal for each state. A learning episode is the time the agent takes to reach the goal state starting from an initial selected state. Reaching the goal state requires the learner to apply a simple value iteration procedure during each learning episode.
This procedure starts when the learner uses its selection policy to select an action a from the set of possible actions A of current state s . The learner then receives a reward ) , ( a s R and bserves a new state s of the environment. Subsequently, the agent uses these information to update its Q-table using the following Q-function: is the discount factor. Upon successful convergence to a solution, the output of Q-learning is the optimal Q-function from which an optimal policy A S  : *  (i.e., mapping from states to actions that maximizes the total discounted reward ( )) can be derived using a greedy selection method.

Bat Algorithm
Microbats are small bats that usually eat insects. An amazing feature of this species is that they rely on a special type of sonar called echolocation to locate their prey. Microbats make loud sound pulses as they fly. When these pulses hit an object, they produce echoes that return to the ears of the bats. The time required for the sound waves to return back to the microbat is used to calculate the distance of an object.
The Bat algorithm (BA) is a metaheuristic method that is inspired from the echolocation behavior of microbats [20]. This algorithm combines the advantages of existing metaheuristic algorithms, such as particle swarm optimization (PSO) and intensive local search in one algorithm. The research works of Yang and Gandomi [21] and Yang [20] suggest that BA performs better than many existing metaheuristic algorithms, such as PSO, intensive local search, harmony search and genetic algorithm.
The following simplifications of the main characteristics of the echolocation process were followed in order to simulate BA as a problem solver [22]: • Microbats know the difference between prey and other objects and use echolocation to calculate the distance of their prey.  Rank the bats and find the current best *   [20].
is a random parameter extracted from a uniform distribution and * x is the current best position among the positions of all bats.
After calculating * x , a local solution can be generated randomly for each bat based on the following equation:

is a tuning random parameter and *
A is the average loudness of all bats at instant t . The update equations of the velocities, positions and frequencies of the bats are similar to the update equations of the velocities and positions of the particles in PSO (Section 3). Actually, BA can be considered as a combination of PSO and intensive local search that aims to balance between the exploration and exploitation of solutions.
In the nature, when a microbat finds a prey, it usually decreases the loudness and increases the pulse emission of its sound. This aspect is simplified in the BA algorithm by assuming that Amin=0 and A0=1 . The assumption that Amin=0 indicates that a bat has located a prey and temporarily has stopped emitting any sound. In the beginning of the simulation process of BA, positive random values around zero are generated and assigned to the pulse emission of each bat, while positive random values around 1 are generated and assigned to the loudness of each bat.
At each iteration of the BA algorithm, a local search for a new best solution around one of the best solutions x* (line 8) is triggered when the pulse rate is less than a randomly generated number [0,1]  rand . Then, each time x* is improved (line 14) the pulse rate ri is increased and the loudness i A is decreased as follows: where  and  are constant parameters that can be determined experimentally, however, as a general rule 0< <1 and  >0 to guarantee that the loudness will decrease and the pulse rate will increase as new best solutions are discovered.
A new solution * x is accepted if it satisfies two conditions. First, the estimation fun(x*) of the new x* must be better than the estimation of a randomly selected bat's solution. Second, the value of rand should be less than the average loudness of all the solutions.
The purpose of setting the loudness to a value near to one and the pulse rate to a value near to zero is to encourage the exploration of new solutions around the current best solutions. This is because a pulse rate near to zero is expected with a high probability to be less than the randomly generated number rand [0,1]  (line 8). Consequently, there is a high probability that a new solution would be generated around one of the best solutions (lines 8 to 11). As the values of pulse rates are increased each time a better solution is accepted (line 14), the probability of generating a new solution around one of the best solutions decreases (line 8).

RELATED WORK
This section provides an overview of well known cooperative Q-learning algorithms with special focus on the second learning stage of these algorithms.
Iima and Kuroe [11]- [13] proposed three cooperative Q-learning algorithms (BEST-Q, AVE-Q and PSO-Q) that allow multiple learners to share their Q-values after each round of independent learning. Each one of these algorithms evaluates its Q-values during the independent learning stage using an evaluation method that approximates the rewards [6], [13], [14]. This method evaluates each state-action pair ) , ( a s by calculating the sum of its discounted rewards ) , ( a s E used to update its Q-value during an episode (independent learning stage). Discounting the reward is important to increase the weight of rewards while approaching the end of the episode. This is because the Q-values are in continuous change during the episode. At the end of the independent learning stage, each learner i calculates as follows: where n denotes the number of times the state-action pair ) , ( a s has been updated by agent i during the episode, ) , ( a s R i is the reward received for performing action a at state s and  is the discount parameter. The parameter  is the same discount factor used in Equation 1. This parameter is used in Equation 9 to balance between the rewards received in the beginning of the episode with rewards received in the end of it. In BEST-Q, the superior Q-values are extracted from the Q-tables of all of the learners, then copied to each Q-table of each agent. According to this description, an agent i updates ) , as follows: In the above equation, for all agents. The main disadvantage of BEST-Q is that it might not find the optimal Q-values, because the Q-tables of all of the learners become the same after each update. As a result, the diversity of the Q-values is affected negatively [11].
AVE-Q is a modification of BEST-Q that retains the diversity of each agent's Q-values after the learning by interaction stage. In this algorithm, the Q-values of learner i are updated by averaging each Q-value in the learners' Q-table and its corresponding best Q-value for all as follows: Actually, AVE-Q moves at the interaction stage into the middle of the agent's Q-values and their corresponding best values without investigating the quality of the agent's Q-values. As a consequence, AVE-Q may produce an incorrect policy, because it does not remove the bad Qvalues at the interaction stage [3].
The Particle Swarm Optimization (PSO) algorithm is a powerful metaheuristic method that attempts to iteratively optimize a solution with respect to a particular measure [23]. An optimization problem can be solved with PSO by moving the candidate solutions (particles) in the search space based on their positions and velocities. The movement of a particle is controlled by the particle's local best position and directed in the direction of the global best positions in the search-space. The global best positions are the best positions found by all of the particles after each iteration of the algorithm.
PSO-Q uses PSO at its second learning stage as a Q-value sharing method. In this method, the particles are the Q-values and the qualitative measurer is the Q-function. The Q-table of each learner is updated based on the velocities and positions of the Q-values as follows [12]: Two issues should be taken into consideration when implementing PSO-Q to a specific problem. First, determining suitable values for the parameters of PSO-Q usually requires multiple simulations to insure that PSO-Q will perform well. Second, there is no guarantee that PSO-Q will search outside the surroundings of the best Q-value for each possible combination of states and actions for all agents.
Ahmadabadi and Asadpour [18] proposed a cooperative Q-learning algorithm called Weighted Strategy Sharing (WSS). In this algorithm, each learner learns from its peers by following a two-step learning process. First, each learner assigns expertness values to the Q-tables of the other learners according to their relative expertness. Second, each learner updates its own Qtable by calculating the weighted average of the Q-values of the learners' Q-tables as follows: where Wij is the expertness value assigned by learner i to learner j's expertness.
An expertness value can be evaluated using one of many expertness measures that have similar outcomes [18]. One such measure is the Normal measure (Nrm) which calculates the expertness of a learner ( xr ) by finding the sum of rewards that the learner has obtained during the previous independent learning stage: is the reward that learner i obtains at instant t . Based on the output of the above formula, learner i can assign a weight to the knowledge of learner j by taking into account the expertness of all learners as follows: where n is the number of learners and k xr is the expertness of learner k for n k ..., 1, = .
A problem with WSS is that it might not converge to optimality when the shared Q-values are so extreme, because these values will deform the average Q-value [9].
Abed-alguni et al. [9] suggested a new cooperative Q-learning algorithm called average aggregation Q-learning which combines WSS, AVE-Q, BEST-Q and PSO-Q into one algorithm in order to reduce the instability in the performance of these algorithms for different problems. In this algorithm, each agent improves its Q-values by averaging the Q-values that resulted after implementing WSS, BEST-Q, AVE-Q and PSO-Q algorithms. Respectively, each agent i as follows: where i is the learner's identification number and the denominator is the number of the combined algorithms.
Although average aggregation Q-learning solves the variability in performance for four famous cooperative Q-learning algorithms, it requires heavy computations to do so, because it mainly depends on the results of the other cooperative Q-learning algorithms.
In conclusion, there is no guarantee that the algorithms discussed in this section will converge to optimal solutions. Moreover, none of these algorithms has a stable performance when implemented to various learning problems [9]. The next section will present the BQ-learning algorithm that attempts to solve these problems.

BQ-LEARNING
The BQ-learning algorithm comprises two repetitive sequential learning stages.  Figure 2 shows the BQ-learning algorithm. In the beginning of BQ-learning, the number of learners n and the total number of episodes of BQ-learning p should be specified. Also, the number of learning episodes mi that each learner i performs during the first learning stage of BQ-learning should be specified. In addition, the Q-values and Q-value evaluations of each learner are initialized to zero (lines 8 to 10). That is,

First Stage of BQ-learning
of each learner i . for all of the learners by interaction between the learners based on the bat Q-value sharing strategy described in   Line 17 in Figure 2 represents the second learning stage of BQ-learning that is described in details in Figure 3. It is important to keep in mind that the second learning stage of BQ-learning is what really distinguishes it from the other cooperative Q-learning algorithms described in Section 3. Figure 3 shows the flow of the proposed Q-value sharing strategy that is based on the Bat algorithm. In Figure 3, the Q-values represent the locations of the bats (line 2), the velocity of a Q- value V(s,a) is the rate at which it changes (line 5) and the objective function is the evaluation function E(s,a) of the Q-values (line 12). Line 14 in Figure 3

Second Stage of BQ-learning
), , ), ,  . Second, the value of rand should be less than the average loudness of ) , ( a s of all the learners. Fulfilling these conditions also implies that the pulse rate ri(s,a) should be increased and the loudness i A (s,a) should be decreased as follows: where  and  are constant parameters. As a general rule, 0< <1 to decrease the loudness and  >0 to increase the pulse rate each time the Q-values are improved.
Assigning a low pulse rate ) , ( a s r i for each ) , ( a s in the beginning of the optimization process (line 17) and then increasing it (line 28) is an essential factor for the success of the algorithm. This is because it reduces the rate of local search around ) , ( * a s Q as BQ-learning is approaching the best Q-value. The local search for the best Q-values can be performed simultaneously by multiple agents in BQ-learning as well as in other optimization-based cooperative Q-learning algorithms, such as PSO-Q and average aggregation Q-learning. BQ-learning is expected to perform better than the cooperative Q-learning algorithms discussed in Section 3, because it attempts to balance between the exploration and exploitation of the nominated best Q-values for sharing using tuning techniques that control its parameters (frequencies, pulse emission rates and loudness of the potential solutions). Neither BEST-Q nor AVE-Q attempts to search around the best Qvalues before sharing them. Consequently, BEST-Q might not find the optimal Q-values [11], while AVE-Q may produce an incorrect policy [11]- [12].

EXPERIMENTS
In this section, the performance of BQ-learning was compared with the performance of singleagent Q-learning, AVE-Q, BEST-Q, PSO-Q, WSS and average-aggregation Q-learning (Section 3) using two problems: the shortest path problem [12] and the taxi problem [24]. These problems have been widely used in the literature to evaluate the performance of cooperative Q-learning algorithms [12]- [13], [24]- [26].

Test Problems
RL can be applied to two types of learning problems [24]. First, single-task problems (e.g., shortest path problem), in which the learner is required to learn a single task. Second, multi-task problems (e.g., taxi domain problem), in which the learner is required to learn multiple tasks. The shortest path problem is a single-task problem that has been used in many research studies to evaluate the efficiency of cooperative Q-learning algorithms [11]- [13]. In this problem, an agent is required to learn one task which is finding the shortest path from one cell to another in a grid, such that the number of visited cells is minimized. The grid in this problem is usually represented as a two-dimensional array that is indexed by two subscripts, one for the row and one for the column. In the shortest path problem, the target cell is usually specified prior to learning and the start cell is randomly selected before the beginning of each learning episode. The learner can move during each episode in four directions (up, down, right and left) as long there are no obstacles or barriers obstructing its way. Figure 4 shows an example of shortest path problem on a grid size of 20 20 . Filled squares represent obstacles that the agent cannot pass, 0 s is the start cell and g s is the target cell. The taxi domain problem is an episodic multi-task problem that has been used in many research studies to evaluate the performance of hierarchical Q-learning algorithms [24]- [26]. In each episode, a taxi agent in a grid world of size 5 5 is required to perform multiple tasks: finding a customer, picking up the customer, driving the customer to a destination location and dropping down the customer in the destination location. The taxi agent can accomplish these goals by choosing actions from a set of six actions: move one cell (left, right, top or bottom), pickup action and drop off action. If any of these actions that leads the taxi agent to a barrier or a wall cell, the location of taxi agent remains unchanged. In the grid, there are four source and destination locations. Figure 5 shows an example of taxi domain problem. In the figure, a taxi is located on a 5 5 grid. There are four pre-determined locations in the grid, marked as Red (R), Blue (B), Green (G) and Yellow (Y). In the beginning of the simulation process, one of these locations is selected as a pick-up point and another location is selected as a drop-off point.

Setup
The shortest path problem in Figure 4 was modeled as an MDP as follows: • The cells in the 20 20 grid represent the states of the MDP: The taxi domain problem in Figure 5 was modeled as an MDP as follows: • The cells in the 5 5 grid represent the states of the MDP: The experiments were implemented using two models of knowledge [19]. In the first model, the learners were assumed to have equal levels of knowledge. This was simulated by allowing the learners to learn for the same number of episodes before sharing of their Q-values. In the second model, the learners were assumed to have different levels of knowledge, which was achieved by allowing each learner to learn for a different number of episodes each time it is learning independently. For example, a learner that has learned for 25 episodes has more practical knowledge than a learner that has learned only for 10 episodes.
The action selection policy was the  -soft policy, in which a random action is uniformly selected with probability  and the action with the highest expected reward is chosen the rest of the time [22].
The learning parameters for the experiments were set as follows: • In Q-learning, the learning rate  was tuned dynamically, so that low Q-values have larger learning rates than high Q-values as recommended by Ray and Oates [27]. The discount factor 1 =  [28]. • In all the cooperative Q-learning algorithms, the learning rate 0.01 =  and the discount factor 0.9 =  . These values ensure that each cooperative learner learns adequately and make, the best use of its current knowledge at each learning episode as recommended by Abed-alguni et al. [9]. • In each episode, a learner starts learning from a randomly selected state and finishes learning when a goal state is reached. Otherwise, the learner finishes learning after 5,000 moves without meeting its goal. • In order to ensure an adequate exploration/exploitation ratio, the probability of selecting a random action was 0.05 =  in the  -soft selection policy.
• The Nrm measure was selected as the expertness measure of WSS. This measure has a similar performance to the performance of all other tested expertness measures. • As in Abed-alguni et al. [9], the weight parameters in PSO-Q were 0 = W , • In BQ-learning, the frequency, the loudness and the pulse rate were in the range [0,1] for each solution. The discount parameter of the frequencies β=0.5. Initially, the loudness A was set to 1 and the pulse rate r was set to 0 for each Q-value.
Three agents are involved in the experiments. The total number of learning episodes is 2,000 for the shortest path problem, while the total number of learning episodes for the taxi problem is 12,000 episodes. Each algorithm was executed 100 times in order to provide meaningful statistical analysis of the experiments.
In the experiments, an algorithm is considered to have converged to a good policy when the average number of moves in its policy enhances by less than one move over 100 successive episodes. Figure 6 shows the average number of moves per 10 episodes to find the shortest path to the goal state in a 20 20 grid. The second learning stage of the cooperative Q-learning algorithms takes place after each 10 episodes of individual learning. We can see from the figure that BQlearning converges after 420 episodes to a solution. On the other hand, single agent Q-learning, AVE-Q, WSS, average aggregation Q-learning and PSO-Q converge after around 520 episodes to solutions, while BEST-Q requires 60 additional episodes to converge to a solution. These results suggest that the performance of BQ-learning is better than those of the other algorithms in single-task problems when the agents have similar levels of knowledge before sharing.  grid. Respectively, in Figure 7, the first, the second and the third agents learn for 10, 5 and 1 episodes before sharing of their Q-values among each other. In this experiment, BQ-learning requires 300 episodes of learning to converge to a solution, which is only 14.9% of the number of episodes required for single-agent Q-learning (2020), 54% of BEST-Q (550), 53.6% of WSS (560), 62.5% of PSO-Q (480), 61.2% of AVE-Q (490) and 29.4% of average aggregation Q-learning (1,020). These results mean that BQ-learning outperforms the other algorithms in single-task problems when the agents have different levels of knowledge before sharing. . Each curve is the average of 100 runs. One, five and ten episodes of learning occur before implementing a Q-value sharing strategy. 5  5 grid. Sharing of Q-values occurs in Figure 8 after each 10 episodes of independent learning, while in Figure 9, the 1 st learner, the 2 nd learner and the 3 rd learner respectively learn for 1, 5 and 10 episodes before sharing of their Q-values. Figure 8 shows that BQ-learning requires 7,180 episodes to converge to a solution, followed by PSO-Q that requires 7,560 episodes (5 % more episodes than BQ-learning) to converge to a solution. On the other hand, the other algorithms failed to converge to a solution at the end of the simulation process. These results suggest that BQ-learning converges to a solution faster than the other algorithms in multi-task problems when the agents have similar experiences. Figure 9 shows that all of the cooperative Q-learning algorithms failed to converge to a solution except BQ-learning and AVE-Q. As expected, BQ-learning has the fastest convergence speed among all algorithms. From Figure 9, we can also see that WSS has the worst performance among all the algorithms including single-agent Q-learning. These results indicate that BQlearning outperforms the other algorithms in multi-task problems when the agents have different levels of experience.

Performance Analysis
Two statistical measures were used in Table 1 to compare the performance of the tested algorithms over 100 runs. The results are in the format: average number of iterations  standard deviation of iterations. The last row of the table shows that BQ-learning requires less number of iterations to converge to a solution. In addition, the standard deviations of the number of iterations of BQ-learning are the lowest among all the standard deviations of the other algorithms that converge to a solution. This means that the performance of BQ-learning is more stable than the performance of the other tested algorithms.   . Each curve is the average of 100 runs. One, five and ten episodes of learning occur before implementing a Q-value sharing strategy.
In Figure 11 (taxi domain), BQ-learning (1-5-10) converges after 9,136 episodes, BQ-learning (15-30-45) converges after 9,233 episodes and BQ-learning (25-50-100) converges after 9,417 episodes. To sum up, the results in both figures indicate that the convergence speed of BQlearning is not highly sensitive to the number of episodes that each agent learns before sharing of Q-values.
The overall results of the experiments suggest that BQ-learning performs better than conventional Q-learning and the other cooperative Q-learning algorithms, regardless of the levels of experience of the agents (similar experiences vs. different experiences) and the types of the learning problems (single-task vs. multiple-task problems). Table 1. Average and standard deviation of number of iterations over 100 runs. The star symbol * indicates that the algorithm did not converge to a solution at the end of the simulation process.

CONCLUSION AND FUTURE WORK
Cooperative Q-learning approach is an efficient learning approach that accelerates the learning process of individual learners in homogeneous multi-agent systems. This paper presented the BQ-learning algorithm which is a new cooperative Q-learning that is inspired from the bat algorithm. The learning process of BQ-learning comprises two stages. First, the individual learning stage, where each agent learns or improves its own policy by implementing the standard Q-learning algorithm. Second, the learning by interaction stage, where the learners share their Q-values among each other using a Q-value sharing strategy based on the bat algorithm.The BQ-learning algorithm has many advantages. First, compared to current cooperative Q-learning algorithms, the BQ-learning algorithm can be implemented to singletask and multi-task problems, because optimizing the tasks of a learning problem using the bat algorithm improves the overall solution for the problem. Second, the bat sharing strategy in BQlearning increases the possibility of finding the optimal Q-values, because it attempts to balance between the exploration and exploitation of actions using tuning techniques that control its parameters (frequencies, pulse emission rates and loudness of the potential solutions). Finally, the results of the pilot experiments suggest that BQ-learning performs faster than single-agent Q-learning and other famous cooperative Q-learning algorithms, whether the agents have similar or different levels of experience and regardless of the type of the learning problems (single-task vs. multiple-task problems).
Future work includes implementing the BQ-learning algorithm to continuous space learning problems and developing a new cooperative Q-learning algorithm based on a combination of the firefly and monkey algorithms.