Learning Bayesian networks based on bi-velocity discrete particle swarm optimization with mutation operator

: The problem of structures learning in Bayesian networks is to discover a directed acyclic graph that in some sense is the best representation of the given database. Score-based learning algorithm is one of the important structure learning methods used to construct the Bayesian networks. These algorithms are implemented by using some heuristic search strategies to maximize the score of each candidate Bayesian network. In this paper, a bi-velocity discrete particle swarm optimization with mutation operator algorithm is proposed to learn Bayesian networks. The mutation strategy in proposed algorithm can efficiently prevent premature convergence and enhance the exploration capability of the population. We test the proposed algorithm on databases sampled from three well-known benchmark networks, and compare with other algorithms. The experimental results demonstrate the superiority of the proposed algorithm in learning Bayesian networks.


Introduction
Bayesian networks (BNs) are probabilistic graphical models used for representing the probabilistic relationships among the random variables in the domain and doing probabilistic inference with these variables. They have been successfully used for modeling and reasoning, with applications such as pattern recognition [1,2], medical diagnosis [3,4], risk analysis [5,6], computational biology [7][8][9], and many others.
The issue of learning BN structures from data is receiving increasing attention. Algorithms for BN structures learning can be grouped into two categories. Constraint-based algorithms [10][11][12][13] construct a graph from data by employing conditional independence tests; statistical or information theoretic measures are used to test conditional independencce between the variables. Local structure learning algorithms, designed by using the knowledge of the Markov blankets of the variables, can reduce the number of dependence tests to some extent. However, these algorithms depend on the accuracy of the statistical tests, they may perform badly with the insu cient or noisy data. Score-based learning algorithms [14][15][16] try to construct a network by maximizing the score function of each candidate network using some greedy search or heuristic search algorithms. However, the space of all possible structures increases rapidly with the increasing number of variables [17], deterministic search method may fail to nd optimal solution and are often trapped in local optimum.
Although, these swarm intelligence algorithms have good performance in the problem of BN structures learning, there may exist some unavoidable drawbacks. For instance, in global optimization problems, the challenge for PSO is that it may be trapped in the local optimum due to its poor exploration. To enhance the exploration ability of PSO, many strategies have been proposed, so that it can be widely used in many research and application areas with its advantages not only in easy implementation and few parameters to adjust, but also in the ability of quick discovery of optimal solutions. BN structures learning is one of this cases. The classical PSO is designed for continuous problems. In order to extent its applications, and apply it in discrete space to learn BNs, several discrete PSOs for discrete optimization have been presented.
To keep the e ciency of the classical PSO in continuous space, the fast convergence speed and global search advantages, the bi-velocity discrete PSO was proposed and applied in the steiner tree problem in graphs and the multicast routing problem in communication networks [30,31]. Since a BN structure is represented as a connectivity matrix, in which each element is 0 or 1, and the particle corresponding to the BN structure can be represented by a binary string. Thus, we adopt the velocity and position updating rules similar to that proposed in [30,31]. However, in PSO each particle moves toward its past best position and the global best position found so far, the exploitation ability is enhanced, but the number of nonzero elements of the velocity tends to zero with the increasing iterations. In this case, if the current global best position is not the global optimum, the particles in the swarm may be trapped in the local optima. To prevent the algorithm from being trapped in a local optimum and enhance the exploration capability, a mutation strategy is introduced to conduct mutation on each new particle. In this paper, an e cient bi-velocity discrete particle swarm optimization with mutation operator algorithm is designed to solve the problem of BN structures learning (BVD-MPSO-BN).

Bayesian networks and k metric
Bayesian networks are knowledge representation tools capable of representing independence and dependence relationships among variables. A Bayesian network, on one hand, is a directed acyclic graph (DAG) G = (X, E), where X = {X , X , ⋯, X n } the set of nodes, represents the random variables in a special domain. E is the set of edges, each edge represents the directed in uence of one node on another. On the other hand, a Bayesian network uniquely encodes a joint probability distribution over the random variables. It decomposes according to the structure as where Pa(X i ) is the set of the parents of node X i in G. When performing the score-and-searching approach for learning BNs from data, a score metric must be speci ed. So far many score criteria for evaluating the learned networks have been proposed. One of the most well-known score criterion in learning BN structures has been given by Cooper and Herskovits (1992). The score function for a given structure G and training database D is where n is the number of variables, each variable X i has r i possible values, q i is the number of parent con gurations of variable X i , N ijk is the number of cases in D, where X i takes on value k with parent con guration j and N ij = ∑ r i k= N ijk .
By using the logarithm of the above function and assuming a uniform prior for P(G), the decomposable k metric can be expressed as where f (X i , Pa(X i )) is the k score of node X i and de ned as

Particle swarm optimization
Particle swarm optimization is a population based stochastic optimization technique, each particle in PSO is a potential solution, the position of a particle is represented as in which D is the dimension of the search space, N is the number of particles. Each particle has a velocity . When a particle updates its position, it records its past best position pbest i = (pbest i , pbest i , ⋯, pbest iD ) and the global best position gbest = (gbest , gbest , ⋯, gbest D ) found by any particle in the population. In the standard PSO, the new velocity is calculated according to its previous velocity, the distances of the current position from both its past best position and the global best position. After a particle updates its velocity via Eq.(5), it ies toward a new position according to Eq.(6). Each particle compares its current tness value with its own past best tness value -if it is better then a particle updates the past best position and the past best tness value with the current position and its tness value. The particle also compares its tness value with global best tness value -if it is better then a particle updates the global best position and the global best tness value with the current position and its tness value.
where ω is the inertia weight, c and c are positive acceleration coe cients, r and r are two independently uniformly distributed random values in the range [ , ], t is the number of iterations.

Learning BNs using the bi-velocity discrete PSO with mutation operator
Considering the fact that the original PSO algorithm operates only in a continuous search space, some strategies were proposed to solve the problems in a discrete search space and then applied to learn BN structures. To keep the original PSO framework, a bi-velocity discrete particle swarm optimization was proposed, and used for the steiner tree problem in graphs [30] and the multicast routing problem in communication networks [31]. In this section, we intent to use the bi-velocity discrete particle swarm optimization with mutation to solve the problem of BN structures learning.

. Problem representation
The problem of BN structures learning is discrete. A BN structure can be represented by an n × n connectivity matrix A, whose each element a ij is de ned as in Eq. (7), where n is the number of nodes. We intend to use the bi-velocity discrete particle swarm optimization to solve the BN structures learning problem. The position of a particle is encoded as a binary string : a , a , ⋯, a n , a , ⋯, a n , ⋯, a n , ⋯, a nn , similar to [32]. A BN structure G is shown in Fig. 1, the corresponding connectivity matrix A = , and the binary string representing X = ( , , , , , , , , , , , , , , , , , , , , , , , , ).

Fig. 1. An example of Bayesian network
When bi-velocity discrete PSO works in BN structures learning, the position of each particle i is represented as while the velocity is encoded as where D = n , . v ij is the probability of x ij being 0, and v ij is the probability of x ij being 1.

. Initial solution construction
To generate initial solutions, we use a method analogous to the one used in [25]. Each initial solution is derived by rst starting with an empty graph dose not having any edges, and then adding the absent edges one by one to the current graph if and only if the new graph is a directed acyclic graph and the score of the new graph is higher than that of the previous graph. This procedure repeats until the number of added edges reaches the prede ned value. By using the method mentioned above, a certain number of initial solutions are generated.

. Bi-velocity discrete PSO with mutation operator . . Updating rules
To keep the concept of original PSO in a continuous search space, di erent updating rules have been proposed in bi-velocity discrete PSO, which are described in detail as follows: For example, if X = ( , , , , , , , ) and X = ( , , , , , , , ), then, (2): Velocity = Coe cient × Velocity: This equation represents that the Velocity is multiplied by ω or c × r. Because each element of the Velocity is the possibility for the position being 0 or 1, so the element that is larger than 1 is set to 1.
For example, c × r = ( . , . , . , . , . , . , . , . In the continuous searching space, the new position of a particle is calculated by adding the updated velocity to the current position. However, the position and velocity may not be added directly in the discrete searching space. In order to solve the discrete problem, a new updating method has been proposed as Eq.(10) [31] in which α is a random value in [ , ].

. . Mutation operation
When PSO is implemented, each particle moves toward its past best position and the global best position found so far, the exploitation ability is enhanced. However, according to the velocity and particle position updating rules, the numbers of nonzero elements in results of pbest i − X i and gbest − X i decrease and even becomes equal to zero with the increasing iterations. In this case, if the current global best position is not the global optimum, the particles in the swarm may be trapped in the local optima. To prevent premature convergence and enhance the exploration capability, a mutation strategy is adopted to conduct mutation on each new particle. The mutation probability depends on the problem dimension, in other words, it is determined by the number of nodes in BN structures learning problem, because a BN structure is represented by an n×n connectivity matrix whose diagonal elements are all zeros, and the position of a particle is encoded as a binary string according to the connectivity matrix. Thus, we de ne the mutation probability p = (n − n), where n is the number of nodes. The mutation operator of a particle x ij , otherwise , (11) in which, m = , ⋯, n. .

Procedure of the proposed algorithm for BN structures learning
Based on the description above, the pseudo-code of BVD-MPSO-BN is presented in Algorithm 1. It starts with the initial solutions generated by the method described in section 4.2. Each particle in the swarm is encoded as a binary string corresponding to a directed acyclic graph and evaluated using the k metric. During iteration, each particle updates its velocity and position according to the updating rules presented in subsection 4.3.1.
To increase the probability of escaping from a local optimum, the mutation operator is conducted on each new particle. Because each solution should be a direct acyclic graph, the direct cycles are removed from the new particle if it is a invalid solution. In order to detect and remove the cycles, we rst use the depth rst search algorithm to detect all back edges, and then invert or delete them. After removing the cycles, the proposed algorithm is executed to improve the past best position of each particle and the global best position of the population. During the main loop of the algorithm, the velocities and the positions of the particles are iteratively updated until a stopping criterion is met.

Experimental results
In this section, we use several networks to test the behaviour of BVD-MPSO-BN. These networks are available in the software GeNie . In addition, we compare the proposed algorithm with other algorithms on benchmark networks. The experiments have been executed in a personal computer with Pentium(R) Dual-Core CPU, 2.0 GB memory, and Windows 7, all the algorithms have been implemented in the Matlab language.

. Databases and parameter settings of the algorithms
In our experiments, three benchmark networks are selected, namely Alarm, Asia and Credit networks. Alarm network developed for on-line monitoring of patients in intensive care unites [33] contains 37 nodes and 46 arcs; Asia network is useful in demonstrating basics concepts of Bayesian networks in diagnosis [34], it is a simple graphical model and has 8 nodes and 8 arcs; Credit network for assessing credit worthiness of an individual was developed by Gerardina Hernandez as a class homework at the University of Pittsburgh, it is available in the GeNie software and consists of 12 nodes and 12 arcs. The databases used in our experiments are sampled from these benchmark networks by probabilistic logic sampling. In Table 1, the databases, the original networks, the number of cases in each database, the number of nodes in each network, the number of arcs in each network and the k scores for the original networks are listed.
We compare BVD-MPSO-BN with other algorithms. BNC-PSO: structure learning Bayesian networks by particle swarm optimization [26], when BNC-PSO is implemented, the population size is 50, inertia weight ω decreases linearly from 0.95 to 0.4, acceleration coe cient c decreases linearly from 0.82 to 0.5 and acceleration coe cient c increases linearly from 0.4 to 0.83. An arti cial bee colony algorithm for learning Bayesian networks (ABC-B) [24] had the following parameters: weighted coe cients for the pheromone α = http://dslpitt.org/genie/.

Original network Number of cases Number of nodes Number of arcs Score
Alarm- and the heuristic information β = , pheromone evaporation coe cient ρ = . , switching parameter for exploitation versus exploration q = . , maximum number of solution stagnation limit = . , population size is equal to 40. The parameters for BVD-MPSO-BN were chosen as: the population size is 50, inertia weight ω = .

. Metrics of the performance
To measure the performance of proposed algorithm, we evaluate the learned results in terms of the k score and the structural di erence (i.e., the di erences between the learned structure and the original network). The detailed descriptions of the metrics are de ned as below: • HKS: the highest k score resulting from all trials carried out.
• LKS: the lowest k score resulting from all trials.
• AKS: the average k score (including the mean and the standard deviation) resulting from all trials.
• AEA: the average number of edges accidentally added over all trials, it contains the mean and the standard deviation. • AED: the average number of edges accidentally deleted over all trials, it contains the mean and the standard deviation. • AEI: the average number of edges accidentally inverted over all trials, it contains the mean and the standard deviation. • LSD: the largest structural di erence resulting from all trials.
• SSD: the smallest structural di erence resulting from all trials.
• ASD: the average structural di erence (including the mean and the standard deviation) resulting from all trials. • AIt: the average number of iterations needed to nd an optimal solution over all trials, it contains the mean and the standard deviation. • AET: the average execution time over all trials.

. Experimental analysis . . Learning BNs using bi-velocity discrete PSO with mutation operator
To study the performance of BVD-MPSO-BN algorithm for Bayesian networks learning, we use it to recover the structures from databases sampled from the given benchmark networks. k score and structural di erence as the basic metrics of the performance are adopted to evaluate the learned networks. We test the BVD-MPSO- BN algorithm by using the Alarm network with the sample size n = , , , , , , the Asia network with the sample size n = , , and the Credit network with the sample size n = , , . Table 2 reports the experimental results in terms of k score, the number of iterations and the execution time. Table 3 shows the experimental results based on the structural di erence between the learned network and the original network. Each statistic in Table 2 and Table 3 is the average and standard deviation values over ten independent runs of BVD-MPSO-BN algorithm. We mark the best values in bold.
As shown in Table 2, the di erence between HKS and LKS is small on databases Alarm-2000, Alarm-3000, Alarm-4000 and Alarm-5000, except for Alarm-500 and Alarm-1000, which do not have enough samples to correctly learn Alarm networks. The algorithm also returns the small standard deviation, which indicates that the BVD-MPSO-BN algorithm is stable for the network with enough samples. For Asia network, the di erences between HKS and LKS are smaller than 0.6 on database Asia-1000 and 0.4 on database Asia-3000, the algorithm returns the same k score on database Asia-500. In addition, the di erence between AKS and the score of the original network is smaller than 0.6, which means that the score of the Asia network obtained by BVD-MPSO-BN algorithm is very close to that of the original network. For Credit network, the proposed algorithm obtains small standard deviation value and the di erence between HKS and LKS, which indicates that the BVD-MPSO-BN algorithm also performs well in Credit network.
From the view of the structural di erence, as shown in Table 3, the average and standard values in terms of ASD, AEA, AED and AEI are relatively small on databases sampled from Alarm, Asia and Credit networks. For Alarm network, the values of SSD on databases Alarm-3000 and Alarm-4000 are equal to two, which means that only two times of legitimate operations needed to change the learned network to the original one at the best case. The standard deviation values of AED are equal to zeros on databases Alarm-2000, Alarm-3000, Alarm-4000 and Alarm-5000, which means that the proposed algorithm learns the Alarm network with one or two edges accidentally deleted over ten runs. The average structural di erence on database Alarm-500 is larger than 12, which means that the algorithm has poor performance on small databases. For Asia and Credit networks, the average and standard deviation values of AEA approach to zeros, and they are equal to zeros on databases Asia-3000, Credit-500, Credit-1000 and Credit-3000, which means that there are no edges accidentally added when the proposed algorithm learns the Asia network with sample size 3000 and the Credit network with sample size 500, 1000 and 3000. In the best case, the values of AEI are equal to zeros on databases generated from the benchmark networks, that is, there is no accidentally inverted edges in the best case.
The results related to the Alarm, Asia and Credit networks demonstrate that the proposed algorithm is stable for the large networks and able to nd structures very close to the original structures for small networks. The performance of the proposed algorithm improves with the increasing sample size.

. . Learning BNs using di erent algorithms
Next, we compare BVD-MPSO-BN algorithm with BNC-PSO and ABC-B algorithms. The experimental results are presented in Table 4, Table 5 and Table 6. Each entry is the average and standard deviation values over ten independent runs of the di erent algorithms. The performance of the algorithms is evaluated based on the accuracy in terms of AKS, AEA, AED, AEI and AIt. The best values for di erent metrics are marked in bold. Table 4 shows the experimental results of three di erent algorithms on databases sampled from Alarm network. From the perspective of k score, BVD-MPSO-BN achieves the best values of AKS on databases Alarm-3000 and Alarm-5000. Although, BVD-MPSO-BN obtains the higher k score on databases Alarm-1000 compared with BNC-PSO, the standard deviation is larger than that obtained by BNC-PSO. ABC-B algorithm returns the best k score on small database Alarm-500. From the view of structural di erence, the ASD values of BVD-MPSO-BN on databases Alarm-1000 and Alarm-3000 are the smallest among those of three algorithms. The ASD values returned by ABC-B algorithm on databases Alarm-500 and Alarm-5000 are smaller than that returned by other algorithms. Although ABC-B achieves the smallest average value of ASD on database Alarm-5000, the standard deviation is larger than that of BVD-MPSO-BN. It is obvious that BNC-PSO obtains networks with more incorrect edges compared with original networks on database Alarm-500. The values of AEA returned by ABC-B on di erent databases sample from Alarm network are the best among those of three algorithms. BVD-MPSO-BN achieves the smallest values of AED on database Alarm-500 and Alarm-1000, the AED values of BNC-PSO and ABC-B on databases Alarm-3000 and Alarm-5000 are the same as that of BVD-MPSO-BN, and they are equal to ones, which means that each of three algorithms learns the BN structures with one edge accidentally deleted on each trial carried out. BVD-MPSO-BN obtains the best values of AEI on databases Alarm-1000, Alarm-500 and Alarm-5000. From the view of the number of iterations, ABC-B often needs less number of iterations compared with BVD-MPSO-BN on databases Alarm-500, Alarm-3000 and Alarm-5000.
From the experimental results on Asia network with sample size n = , , in Table 5, we observe that the BVD-MPSO-BN and BNC-PSO algorithms achieve the same well k score and structural di erence on databases Asia-500 and Asia-1000, they learn the BNs from database Asia-3000 with no edges accidentally added over ten executions. In comparison to the use of ABC-B algorithm, BVD-MPSO-BN algorithm can obtain higher k score, while the ASD values returned by ABC-B algorithm on databases Asia-500 and Asia-1000 are the smallest among three algorithms. There is no accidentally inverted edges generated by BVD-MPSO-BN and BNC-PSO algorithms on database Asia-500. However, the average number of iterations of BVD-MPSO-BN algorithm is the smallest among three algorithms on databases Asia-1000 and Asia-3000.
The experimental results of three algorithms on the Credit network are presented in Table 6. For k score, BVD-MPSO-BN algorithm does not perform well on databases Credit-500 and Credit-1000, but still obtains relatively good result on databases Credit-3000. From the view of the structure di erence, we observe that BVD-MPSO-BN obtains the best ASD results on databases Credit-500 and Credit-1000. BVD-MPSO-BN and Alarm- Alarm- Alarm- Alarm- Asia- Asia- Asia- BNC-PSO get the same AEA results and BVD-MPSO-BN obtains the best AED results on three databases. ABC-B obtains the best AEI results on databases Credit-1000 and Credit-5000, and BVD-MPSO-BN obtains the best AEI result on database Credit-500. The average number of iteration of BVD-MPSO-BN is smaller or at least not larger than that of the other two algorithms. Credit- Credit- To test the time performance of the proposed algorithm, we evaluate three algorithms on Alarm network with sample size n = , , , Asia network with sample size n = , and Credit network with sample size n = , . Fig.(2) shows the average running time of three algorithms on di erent networks. It is obvious that the searching time of the proposed algorithm is the smallest among three algorithms. For BNC-PSO algorithm, the reason is that BVD-MPSO-BN keeps the advantage of fast convergence of classical PSO, while BNC-PSO was proposed by combining PSO with Genetic Algorithm. For ABC-B algorithm, during each iteration, each employed bee nds a new solution in its neighborhood by testing and comparing the k scores of four operators (addition, deletion, reversion and move). In addition, each onlooker determines a new solution by performing two knowledge-guided operators or four simple operators and comparing their k scores, so it is time consuming to compute the k score. Although the number of iterations of ABC-B is often less than that of BVD-MPSO-BN, ABC-B takes much time to reach the near-optimal solutions. Meanwhile, we analyze the changing of time requirement as the changing of sample size. Fig.(3) shows the average results of BVD-MPSO-BN in comparison to ABC-B and BNC-PSO algorithms on databases sampled from Alarm network. Three algorithms generally take much time on learning BNs from large databases. It is obvious that the execution time of the proposed algorithm increases slowly with the increase of the sample size, whereas ABC-B and BNC-PSO algorithms are sensitive to the sample capacity. The overall results demonstrate that BVD-MPSO-BN algorithm is superior to ABC-B and BNC-PSO algorithms in terms of execution time.   4 shows the convergence characteristics of three heuristic algorithms on database Alarm-5000. It is obvious that the nal solutions of three algorithms are close to each other. However, the proposed algorithm converges to the optimal solutions faster than both BNC-PSO and ABC-B algorithms. BNC-PSO performs better than ABC-B at the beginning because particles in PSO learn from better and best solutions so that the population quickly converges to the optimal solution. Once the particles are close to the best solution, the convergence speed becomes slower. However, with the help of mutation operator, the particles in proposed algorithm are easy to jump out of the likely local optima, and hence the fast convergence speed could be remained through the whole evolutionary progress. Fig. 3. Time performance of three algorithms on Alarm network. Fig. 4. The score convergence of three algorithms on Alarm-5000.
Based on the observations above, we conclude that BVD-MPSO-BN can guarantee to learn good-quality networks. BVD-MPSO-BN not only keeps the powerful searching capability in nding the optimal solution, but also prevents the particles in swarm from trapping in the local optima.

Conclusion
In this paper, we propose a novel score-based algorithm for BNs learning. PSO is a swarm intelligence globalized search algorithm with the advantages of simple computation and rapid convergence capability. However, with the increasing of the number of iterations, the quality of the solution can not be improved, and the algorithm converges to the local optima. In other words, it is easy for PSO to su er from the premature convergence. To overcome the drawback of the PSO and learn BN structures from data, bi-velocity discrete PSO with mutation algorithm has been proposed. We make a proper balance between exploration and exploitation ability of the proposed algorithm. The experimental results on the databases generated from the benchmark networks demonstrate the e ectiveness of our method. Comparing with the BNC-PSO algorithm for BNs learning, the advantage of our algorithm not only lies in its less computation time but also lies in its less error rate between the learned structure and the original network. In the comparison to the use of the ABC-B algorithm, when the number of samples available for structure learning is large, the proposed algorithm performs well and has the better average accurate. The experimental results illustrate the superiority of the proposed algorithm in learning BNs from the data. In this paper, the databases are completely observed, however, there may exist missing data or data with hidden variables in practice. Extending swarm-based algorithms to learn BN structures with incomplete data is our future work. In addition, the performance of the proposed algorithm decreases with the decreasing sample size. Thus, future work will consider the method for structure learning on small databases.