Fault Diagnosis for Rolling Bearings Based on Fine-Sorted Dispersion Entropy and SVM Optimized with Mutation SCA-PSO

Rolling bearings are a vital and widely used component in modern industry, relating to the production efficiency and remaining life of a device. An effective and robust fault diagnosis method for rolling bearings can reduce the downtime caused by unexpected failures. Thus, a novel fault diagnosis method for rolling bearings by fine-sorted dispersion entropy and mutation sine cosine algorithm and particle swarm optimization (SCA-PSO) optimized support vector machine (SVM) is presented to diagnose a fault of various sizes, locations and motor loads. Vibration signals collected from different types of faults are firstly decomposed by variational mode decomposition (VMD) into sets of intrinsic mode functions (IMFs), where the decomposing mode number K is determined by the central frequency observation method, thus, to weaken the non-stationarity of original signals. Later, the improved fine-sorted dispersion entropy (FSDE) is proposed to enhance the perception for relationship information between neighboring elements and then employed to construct the feature vectors of different fault samples. Afterward, a hybrid optimization strategy combining advantages of mutation operator, sine cosine algorithm and particle swarm optimization (MSCAPSO) is proposed to optimize the SVM model. The optimal SVM model is subsequently applied to realize the pattern recognition for different fault samples. The superiority of the proposed method is assessed through multiple contrastive experiments. Result analysis indicates that the proposed method achieves better precision and stability over some relevant methods, whereupon it is promising in the field of fault diagnosis for rolling bearings.


Introduction
Rolling bearings are a crucial part in modern industrial manufacture, which can be found in linear guides, precision machine tools, and engine parts, etc., whose failure may result in serious safety accidents and economic loss. Therefore, one of the major topics to be investigated in preventing failures of mechanical systems is recognizing and diagnosing rolling bearing faults [1,2]. Unfortunately, rolling bearings are prone to failure to some extent because of complex operating conditions and structural characteristics [3]. Given this fact, effective and feasible fault diagnosis methods need to be developed for rolling bearings, thus, to improve the reliability of mechanical systems [4].
Variational mode decomposition (VMD) is a new adaptive signal preprocessing method [29]. By setting a mode number K in advance, the given signal can be decomposed into K band-limited intrinsic mode functions (IMFs). The cyclic iterative method is used to obtain the optimal solution of the constrained variational problem, through which the frequency center and bandwidth of each mode are determined. The constrained variation problem can be described as follows: min m k ,ω k k ∂ t δ(t) + j πt * m k (t) e − jω k t 2 2 s.t.
where m k = [m 1 , m 2 , . . . , m k ] represents the set of K mode functions and ω k = [ω 1 , ω 2 , . . . , ω k ] represents the set of central frequencies, while ∂ t and δ t are the partial derivative of time t for the function and unit pulse function, respectively. f (t) is the given real valued input signal.
For the above variational problem, quadratic penalty function term and Lagrange multiplier are employed to transform it into an unconstrained problem. Then Problem (1) can be specified as follows: where α and β(t) represent the penalty factor and Lagrange multiplier, respectively. Then alternate direction method is utilized to obtain the saddle point of Lagrange multiplier, that is, solving (2) by optimizing m k , ω k , and β alternately. The optimization problems of m k and ω k are formulated as (3) and (4), respectively.
Solving problems (3) and (4), the iterative equations are deduced as follows: The Lagrange multipliers can be iterated with Equation (7).
where γ is an updating parameter.

Support Vector Machine
Support vector machine (SVM) is a machine learning model developed by Vapnik [30], which can deduce the optimal solution between model complexity and learning ability based on limited information, thus, to obtain the best classification accuracy, showing unique advantages in solving learning problems with limited samples, non-linear, and high dimensional data. For a given sample set (x i , y i ) i = 1, 2, . . . , n , the main idea of SVM is to map samples into a high-dimensional feature space and find a hyper-plane to solve the corresponding non-linear problem in low-dimensional space. The hyper-plane function can be constructed as: where w is a weight vector, and b is a bias parameter, while w·x represents the inner product of w and x. Taking binary classification issue as an example, to classify samples correctly with enough into classification intervals, all samples are required to meet the following constraint: w · x i + b > 1 f or y i = 1 < −1 f or y i = −1 (9) In the sequel, the classification interval can be calculated as 2/ w 2 , and maximizing the interval is equivalent to minimizing w 2 . To solve the linear indivisibility problem, the slack term ξ and penalty factor C are brought into Equation (9), hence the construction of the optimal hyper-plane is transformed into a constrained optimization problem: When mapping samples to high-dimensional space, different inner product kernel functions will form different algorithms. Among them, radial basis function is widely employed in the application of pattern recognition, which can be described as: where g is a kernel parameter. The solution of the constrained optimization problem (10) is determined by a saddle point of Lagrange function, and this quadratic programming problem is transformed into the corresponding dual problem, that is: where µ i is Lagrange multiplier.
With the optimal value of µ i obtained from the above dual problem, the optimal classification discriminant function can be ascertained as:

Dispersion Entropy
For a time series x = [x 1 , x 2 , . . . , x n ] of length n, x i is firstly normalized by a mapping function, where the standard normal cumulative distribution function is generally used [17].
where i = 1, 2, . . . , n, σ, and µ are the variance and expectation of the normal distribution, respectively. The time series x is then normalized to y = [y 1 , y 2 , . . . , y n ], y i ∈ (0, 1). Further, the phase space is reconstructed for y: where j = 1, 2, . . . , n − (m − 1)τ, m and τ are the embedding dimension and time delay, respectively. Then y m j is mapped to c class (1 to c): where z c i is the i-th element of class sequence z c , while round means rounding. Each z m,c j corresponds to a dispersion pattern .v m−1 contains m elements, each element contains c values, so the number of all possible dispersion patterns are c m .
The relative frequency of π v 0 v 1 ...v m−1 can be deduced as: where corresponding to z m,c j : Dispersion entropy can be defined according to information entropy theory: A large DE value means that the time series is more irregular; on the contrary, the smaller the DE value, the lower the irregularity.

Fine-Sorted Dispersion Entropy
Dispersion entropy, as an information entropy feature, can effectively monitor the dynamic changes in vibration signal and measure its irregularity, so it can be effectively applied to the state monitoring of rolling bearings. However, its ability to feature extraction for the complex non-stationary signal can be further improved in some respects. One possible improvement would be to enhance the perception of neighboring element values. In the theory of dispersion entropy, dispersion entropy focuses on the class sequence that mapping the elements of time series into positive integers. The relationship between elements, as comparatively critical information, has not been considered enough, which suggests that the degree of difference between neighboring elements is not concerned. According to the mapping rule of dispersion entropy, the same dispersion pattern may come from multiple forms of sample vectors. Figure 1 is an example of possible vectors corresponding to the same dispersion pattern (dispersion pattern [2 2 3 3]). The relative frequency of can be deduced as: Dispersion entropy can be defined according to information entropy theory: A large DE value means that the time series is more irregular; on the contrary, the smaller the DE value, the lower the irregularity.

Fine-Sorted Dispersion Entropy
Dispersion entropy, as an information entropy feature, can effectively monitor the dynamic changes in vibration signal and measure its irregularity, so it can be effectively applied to the state monitoring of rolling bearings. However, its ability to feature extraction for the complex nonstationary signal can be further improved in some respects. One possible improvement would be to enhance the perception of neighboring element values. In the theory of dispersion entropy, dispersion entropy focuses on the class sequence that mapping the elements of time series into positive integers. The relationship between elements, as comparatively critical information, has not been considered enough, which suggests that the degree of difference between neighboring elements is not concerned. According to the mapping rule of dispersion entropy, the same dispersion pattern may come from multiple forms of sample vectors. Figure 1 is an example of possible vectors corresponding to the same dispersion pattern (dispersion pattern [2 2 3 3]). To enhance the feature extraction ability of dispersion entropy, an improved dispersion entropy called fine-sorted dispersion entropy (FSDE) is proposed. FSDE introduces a factor f to measure the difference between neighboring elements in the vector m j y . The factor f is then added in class sequence  To enhance the feature extraction ability of dispersion entropy, an improved dispersion entropy called fine-sorted dispersion entropy (FSDE) is proposed. FSDE introduces a factor f to measure the difference between neighboring elements in the vector y m j . The factor f is then added in class sequence z m,c j as an additional element. Therefore, the relationship information between elements is brought into dispersion entropy. This operation not only inherits the features within the result of original DE but also improves the case that vectors with a distinct trend are assigned to the same class sequence, which may provide a powerful aid to feature extraction process of FSDE [31,32]. The calculation equation for factor f is as follows: where · returns the largest integer that is less than or equal to the value in brackets,  [4,4,4,4], [4,4,4,4], and [4,4,4,4]. It can be seen that all the time series are mapped to the same dispersion pattern, although their trends are different. Meanwhile, the time series are mapped by the proposed FSDE, factor f is added as an additional element to the end of the class sequence, the new class sequences are [4,4,4,4,0], [4,4,4,4,1], and [4,4,4,4,2]. In the example, the time series originally mapped to the same dispersion pattern are finely mapped to more dispersion patterns. Figure 2 illustrates the way that f is added to the sequence z m,c j as an additional element. into dispersion entropy. This operation not only inherits the features within the result of original DE but also improves the case that vectors with a distinct trend are assigned to the same class sequence, which may provide a powerful aid to feature extraction process of FSDE [31,32]. The calculation equation for factor f is as follows: where      returns the largest integer that is less than or equal to the value in brackets,  [4,4,4,4], [4,4,4,4], and [4,4,4,4]. It can be seen that all the time series are mapped to the same dispersion pattern, although their trends are different. Meanwhile, the time series are mapped by the proposed FSDE, factor f is added as an additional element to the end of the class sequence, the new class sequences are [4,4,4,4,0], [4,4,4,4,1], and [4,4,4,4,2]. In the example, the time series originally mapped to the same dispersion pattern are finely mapped to more dispersion patterns. Figure 2 illustrates the way that f is added to the sequence , Then, similar to the definition of dispersion entropy, the proposed fine-sorted dispersion entropy can be defined as:  After obtaining the new class sequence z m,c j , π v 0 v 1 ...v m−1 denotes a dispersion pattern corresponding to a z m,c j . The relative frequency of new dispersion pattern is calculated as: Then, similar to the definition of dispersion entropy, the proposed fine-sorted dispersion entropy can be defined as:

Sine Cosine Algorithm
Sine cosine algorithm (SCA) is a new swarm optimization algorithm, which is simple in structure and easy to realize. Using only sine and cosine functions, SCA achieves a good optimization effect. Exploration and exploitation are the two key phases when SCA processes optimization a problem (see [26] for more details). With an elaborate mechanism of exploration and exploitation, SCA is able to search for feasible solutions quickly in the searching space. Let Z i = (Z il , Z i2 , . . . , Z id ) T be the position of i-th individual, whereupon each solution of the optimization problem corresponds to a position of a corresponding individual in the searching space, where d is the dimension of individuals. The core updating equation of SCA is as follows: where P i = (P il , P i2 , . . . , P id ) T is the best position of individual i. The parameter r 1 = a − l(a/l max ) determine the region of individual i at next iteration, where l max , l, and a are the maximum number of iterations, the current number of iterations, and a constant, respectively. The parameter r 2 is a random number in the scope of [0, 2π], which defines the distance that the next movement of individuals should be towards. To stochastically emphasize (r 3 > 1) or deemphasize (r 3 < 1) the effect of the best position searched so far, a random weight r 3 within [0, 2] is brought into the equation. Finally, the parameter r 4 is a random number in the range of [0, 1] to switch fairly between sine and cosine components.

Particle Swarm Optimization
Particle swarm optimization (PSO) is a classical stochastic optimization algorithm based on swarm intelligence. In the PSO algorithm, each particle corresponds to a potential solution of the problem to be optimized [33]. Particle P i has its own velocity and position, which are expressed as d-dimensional . . , s id ) T , respectively. Particle P i keeps track of the optimal solution it has found so far, namely, individual extreme value P i best . The optimal solution found by the whole population is called the global extreme value P i gbest . Particles of the swarm update themselves by tracking the above two extremes values. PSO generates a group of particles randomly to initiate the optimization and then solves the problem by iterations. In the l-th iteration, particle P i updates its velocity and position according to the following two equations: where w is the inertia weight which is employed to balance the global and local search ability of the algorithm. c 1 and c 2 are learning factors, rand is a random number between (0, 1), l is the current number of iterations.

Mutation SCA-PSO Optimization
In the updating strategy of SCA, sine and cosine functions in Equation (23) possess the outstanding ability of exploration but also have some drawbacks, such as slow convergence and the likelihood of being trapped in a local optimum. The reason for the above problems is that parameters of SCA are difficult to set properly, so a poor combination of parameters will result in weak exploitation, but its exploration phase will not be affected [34]. On the other hand, the PSO algorithm has the advantages of information exchange among particles and good robustness, which lead to the high probability of exploiting local optimal solution. However, similar to SCA, PSO also has the problem of local optimum in the later stage of iterations.
Since SCA and PSO have their own strengths, the combination of capabilities between SCA and PSO is expected to improve the convergence performance, which may make the hybrid algorithm powerful in both exploring different regions and exploiting promising regions. As shown in Figure 3, the SCA-PSO approach is constructed in a hierarchical form. The top layer contains M particles which are searching solutions according to the strategy of PSO. The bottom layer contains M·N individuals which are evenly distributed to each top layer particle (total M groups) and their positions update by SCA strategy.
Entropy 2019, 21, x FOR PEER REVIEW 9 of 24 powerful in both exploring different regions and exploiting promising regions. As shown in Figure  3, the SCA-PSO approach is constructed in a hierarchical form.  In iterations, the optimal value found by each group in the bottom layer is saved by a top layer particle. Through this operation, the top layer particles focus on exploitation by using advantages of PSO, while the bottom layer individuals focus on exploration, giving play to advantages of SCA [34]. The updating strategy of bottom layer individuals is changed to the following equation: where Zij l is the position of j-th (j = 1, 2, ..., N) bottom layer individual that belongs to the i-th (i = 1, 2, ..., M) top layer particle. The position of i-th top layer particle is described as si l . l is the current number of iterations.
Moreover, analysis of PSO shows that the population diversity of PSO gradually decreasing with each iteration and it is more likely to result in a local optimum in the later stage of iterations. To solve this problem, a mutation operator is introduced to enrich the diversity of PSO particles [35]. The updating rules of top layer particles containing the mutation operators are as follows: where G and T (T < lmax) are the given mutation amplitude and mutation period, respectively, while unif represents a random number that conforms to the uniform distribution U (0, 1). It can be observed that during the updating process, the position si will mutate periodically, which gives PSO particles a certain ability to jump out of the local optimum even in the later stage of iterations. The computational complexity of the proposed mutation sine cosine algorithm and particle swarm optimization (SCA-PSO) algorithm is O(lM(Ntsca+tpso)), where tsca and tpso are the computational costs of updating all the search agents in each iteration for SCA and PSO, respectively. Accordingly, the computational complexity of SCA and PSO are O(lntsca) and O(lntpso), respectively.
The main steps of proposed mutation SCA-PSO (MSCAPSO) approach are as below, and its block diagram is shown in Figure 4: Step 1: Initialize M particles in the top layer as well as M•N individuals in the bottom layer randomly and set mutation parameters; Step 2: Calculate the fitness values of all searching agents; Step 3: Update si on the basis of the best solution found by each group in the bottom layer according to Equation (26); Step 4: Update si best in top layer based on Equations (27) and (28); In iterations, the optimal value found by each group in the bottom layer is saved by a top layer particle. Through this operation, the top layer particles focus on exploitation by using advantages of PSO, while the bottom layer individuals focus on exploration, giving play to advantages of SCA [34]. The updating strategy of bottom layer individuals is changed to the following equation: where Z ij l is the position of j-th (j = 1, 2, . . . , N) bottom layer individual that belongs to the i-th (i = 1, 2, . . . , M) top layer particle. The position of i-th top layer particle is described as s i l . l is the current number of iterations.
Moreover, analysis of PSO shows that the population diversity of PSO gradually decreasing with each iteration and it is more likely to result in a local optimum in the later stage of iterations. To solve this problem, a mutation operator is introduced to enrich the diversity of PSO particles [35]. The updating rules of top layer particles containing the mutation operators are as follows: where G and T (T < l max ) are the given mutation amplitude and mutation period, respectively, while unif represents a random number that conforms to the uniform distribution U (0, 1). It can be observed that during the updating process, the position s i will mutate periodically, which gives PSO particles a certain ability to jump out of the local optimum even in the later stage of iterations. The computational complexity of the proposed mutation sine cosine algorithm and particle swarm optimization (SCA-PSO) algorithm is O(lM(Ntsca+tpso)), where t sca and t pso are the computational costs of updating all the search agents in each iteration for SCA and PSO, respectively. Accordingly, the computational complexity of SCA and PSO are O(lnt sca ) and O(lnt pso ), respectively.
The main steps of proposed mutation SCA-PSO (MSCAPSO) approach are as below, and its block diagram is shown in Figure 4: Step 1: Initialize M particles in the top layer as well as M·N individuals in the bottom layer randomly and set mutation parameters; Step 2: Calculate the fitness values of all searching agents; Step 3: Update s i on the basis of the best solution found by each group in the bottom layer according to Equation (26); Step 4: Update s i best in top layer based on Equations (27) and (28); Step 5: If the maximum number of iterations is not meet, go to step 3; Step 6: Return the value of s i gbest as the optimal solution.
Entropy 2019, 21, x FOR PEER REVIEW 10 of 24 Step 5: If the maximum number of iterations is not meet, go to step 3; Step 6: Return the value of si gbest as the optimal solution. Initialize Return the best solution s gbest as the global optimum solution No Update s i l+1 according to Figure 4. Block diagram of the proposed mutation operator, sine cosine algorithm and particle swarm optimization (MSCAPSO).

Fault Diagnosis by FSDE and MSCAPSO Optimized SVM
In this section, a novel fault diagnosis method based on FSDE and SVM optimized by MSCAPSO is proposed to improve the fault diagnosis accuracy. The specific steps are detailed as follows: Step 1: Determine the mode number K of VMD by central frequency observation method; Step 2: Decompose samples into sets of IMFs with VMD; Step 3: Calculate the FSDE value of each IMF; Step 4: Construct feature vectors of different fault samples with FSDE values; Step 5: Optimize parameters C and g for SVM with the proposed MSCAPSO optimization method; Step 6: Train the SVM model with optimal parameters C and g; Step 7: Apply the optimal SVM model to classify different fault samples and evaluate the performance of the model.
The flowchart of the proposed fault diagnosis model is shown in Figure 5.

Fault Diagnosis by FSDE and MSCAPSO Optimized SVM
In this section, a novel fault diagnosis method based on FSDE and SVM optimized by MSCAPSO is proposed to improve the fault diagnosis accuracy. The specific steps are detailed as follows: Step 1: Determine the mode number K of VMD by central frequency observation method; Step 2: Decompose samples into sets of IMFs with VMD; Step 3: Calculate the FSDE value of each IMF; Step 4: Construct feature vectors of different fault samples with FSDE values; Step 5: Optimize parameters C and g for SVM with the proposed MSCAPSO optimization method; Step 6: Train the SVM model with optimal parameters C and g; Step 7: Apply the optimal SVM model to classify different fault samples and evaluate the performance of the model.
The flowchart of the proposed fault diagnosis model is shown in Figure 5.

Data Collection
Vibration signals with different fault locations and sizes gathered from Bearings Data Center of Case Western Reserve University (CWRU) [36] were employed as the experiment data in this research. The experiment stand was mainly composed of a motor, an accelerometer, a torque sensor/encoder and a dynamometer. The bearing was a deep groove ball bearing with model SKF6205-2RS. Accelerometers were applied to collect vibration signal. Meanwhile, the torque sensor/encoder was for the measurement of power and speed. By using electro-discharge machining, the experiment device simulated three fault states of the rolling bearing: inner race fault, ball element fault, and outer race fault, and the depth of faults was 0.011 inches. In the experiment, vibration signals collected from the drive end were taken as the research objects, where the sample frequency was 12 kHz. To fully verify the effectiveness of the proposed fault diagnosis method, nine types samples were applied in this paper, namely inner race fault, ball fault, and outer race fault with diameters of 0.007, 0.014, and 0.021 inches. In addition, the experiments are conducted under the load of 2, 1, and 0 hp with motor speed 1750, 1772, and 1797 rpm. Further, the vibration signal of each type of faults contained 59 samples, and each sample contained 2048 sample points. The experimental data are listed in Table 1.

Application to Fault Diagnosis of Rolling Bearings
The proposed FSDE-MSCAPSO method has been applied to diagnose nine types of faults and compared with other six relevant fault diagnosis methods at feature extraction as well as parameter optimization stage. More specifically, fuzzy entropy (FE) [37], permutation entropy (PE) [4], and dispersion entropy (DE) were applied at the feature extraction stage for comparison; SCA and PSO were applied at the parameter optimization stage. The parameters of the contrastive experiments were set in the same way as the proposed method.
In the proposed method, the first step is to decompose the signal of each type of fault into a set of IMFs. The mode number K of VMD needs to be determined in advance with the central frequency observation method. If K is too large, central frequencies of neighboring IMFs may be too close, then mode mixing will emerge. Normalized central frequencies under different K values are listed in Table 2, where K was ascertained by the sample of ball fault with a diameter of 0.014 inches (L5). As listed in Table 2, when K was set 5, the first two central frequencies were relatively close. Similarly, as illustrated in Figure 6, during the iterative calculation of VMD, if K was 5 or greater, there were some central frequencies of IMFs close to each other, which meant excessive decomposition had occurred. Hence, parameter K was set to 4.
The waveforms of the original signals are shown in Figure 7.  Figure 8, which shows that IMFs decomposed from the same types of faults are quite different.

Application to Fault Diagnosis of Rolling Bearings
The proposed FSDE-MSCAPSO method has been applied to diagnose nine types of faults and compared with other six relevant fault diagnosis methods at feature extraction as well as parameter optimization stage. More specifically, fuzzy entropy (FE) [37], permutation entropy (PE) [4], and dispersion entropy (DE) were applied at the feature extraction stage for comparison; SCA and PSO were applied at the parameter optimization stage. The parameters of the contrastive experiments were set in the same way as the proposed method.
In the proposed method, the first step is to decompose the signal of each type of fault into a set of IMFs. The mode number K of VMD needs to be determined in advance with the central frequency observation method. If K is too large, central frequencies of neighboring IMFs may be too close, then mode mixing will emerge. Normalized central frequencies under different K values are listed in Table  2, where K was ascertained by the sample of ball fault with a diameter of 0.014 inches (L5). As listed in Table 2, when K was set 5, the first two central frequencies were relatively close. Similarly, as illustrated in Figure 6, during the iterative calculation of VMD, if K was 5 or greater, there were some central frequencies of IMFs close to each other, which meant excessive decomposition had occurred. Hence, parameter K was set to 4.
The    When the IMFs of all samples were obtained, FSDE values were calculated to construct fault feature vectors, during which four parameters needed to be chosen properly, namely, embedding dimension m, number of class c, time delay , and precision parameter ρ. Theoretically, m is usually set from two to seven, if m is large, there will be a high computational cost. When dealing with c, it is usually suggested to choose from four to eight, if c is small, it may not be able to recognize two values with very different amplitudes [38]. ρ is the key parameter that distinguishes FSDE from DE. The closer ρ to 0, an original dispersion pattern will be subdivided into more new dispersion patterns, however, FSDE may also be oversensitive to small changes and increase the computation. Considering all these factors, embedding dimension m was set at three and the number of class c was set at six in our research. Moreover, for the practical purpose, it was recommended to assume the time delay = 1 [38], while precision parameter ρ was set as 2.9 to balance the precision and computational complexity.
The first three FSDE and DE vectors of different fault samples (L1-L9) are compared in Table 3, which shows that FSDE value of every IMF is different from the corresponding DE value, but the overall trend remains similar. 3D projection of entropy vectors for different types of faults is shown When the IMFs of all samples were obtained, FSDE values were calculated to construct fault feature vectors, during which four parameters needed to be chosen properly, namely, embedding dimension m, number of class c, time delay , and precision parameter ρ. Theoretically, m is usually set from two to seven, if m is large, there will be a high computational cost. When dealing with c, it is usually suggested to choose from four to eight, if c is small, it may not be able to recognize two values with very different amplitudes [38]. ρ is the key parameter that distinguishes FSDE from DE. The closer ρ to 0, an original dispersion pattern will be subdivided into more new dispersion patterns, however, FSDE may also be oversensitive to small changes and increase the computation. Considering all these factors, embedding dimension m was set at three and the number of class c was set at six in our research. Moreover, for the practical purpose, it was recommended to assume the time delay = 1 [38], while precision parameter ρ was set as 2.9 to balance the precision and computational complexity.
The first three FSDE and DE vectors of different fault samples (L1-L9) are compared in Table 3, which shows that FSDE value of every IMF is different from the corresponding DE value, but the overall trend remains similar. 3D projection of entropy vectors for different types of faults is shown When the IMFs of all samples were obtained, FSDE values were calculated to construct fault feature vectors, during which four parameters needed to be chosen properly, namely, embedding dimension m, number of class c, time delay τ, and precision parameter ρ. Theoretically, m is usually set from two to seven, if m is large, there will be a high computational cost. When dealing with c, it is usually suggested to choose from four to eight, if c is small, it may not be able to recognize two values with very different amplitudes [38]. ρ is the key parameter that distinguishes FSDE from DE. The closer ρ to 0, an original dispersion pattern will be subdivided into more new dispersion patterns, however, FSDE may also be oversensitive to small changes and increase the computation. Considering all these factors, embedding dimension m was set at three and the number of class c was set at six in our research. Moreover, for the practical purpose, it was recommended to assume the time delay τ = 1 [38], while precision parameter ρ was set as 2.9 to balance the precision and computational complexity.
The first three FSDE and DE vectors of different fault samples (L1-L9) are compared in Table 3, which shows that FSDE value of every IMF is different from the corresponding DE value, but the overall trend remains similar. 3D projection of entropy vectors for different types of faults is shown in Figure 9. It can be seen that feature vectors extracted by PE are not clearly distinguished, quite a lot of feature vectors from different types of faults are still mixed together. This situation is improved in the figures of FE and DE to a certain extent, while feature vectors extracted by FSDE are basically well distinguished. In the optimization stage, the feature vectors from different types of faults were divided randomly into two parts with 30 and 29 vectors, where 30 were used for training while the rest, 29, for testing. The proposed MSCAPSO approach was applied to optimize the penalty factor C and the kernel parameter g of SVM. The total number of searching agents and iteration times were set 40 and 100, respectively, while searching ranges of C and g were both [2 −10 , 2 10 ]. Considering the maximum number of iterations and the number of optimization parameters, the mutation period was set at five and mutation amplitude was set at one. Fitness values were calculated by five-fold cross-validation for the training samples. Hence, the optimization effect of the given C and g could be measured by the average accuracy of cross-validation, then the optimal C and g were obtained. in Figure 9. It can be seen that feature vectors extracted by PE are not clearly distinguished, quite a lot of feature vectors from different types of faults are still mixed together. This situation is improved in the figures of FE and DE to a certain extent, while feature vectors extracted by FSDE are basically well distinguished. Table 3. Fine-sorted dispersion entropy and dispersion entropy of intrinsic mode functions (IMFs). IMF1  IMF2  IMF3  IMF4  IMF1  IMF2  IMF3  In the optimization stage, the feature vectors from different types of faults were divided randomly into two parts with 30 and 29 vectors, where 30 were used for training while the rest, 29, for testing. The proposed MSCAPSO approach was applied to optimize the penalty factor C and the kernel parameter g of SVM. The total number of searching agents and iteration times were set 40 and 100, respectively, while searching ranges of C and g were both [2 −10 , 2 10 ]. Considering the maximum number of iterations and the number of optimization parameters, the mutation period was set at five and mutation amplitude was set at one. Fitness values were calculated by five-fold cross-validation for the training samples. Hence, the optimization effect of the given C and g could be measured by the average accuracy of cross-validation, then the optimal C and g were obtained.

Fault Label Sample Number Fine-Sorted Dispersion Entropy Dispersion Entropy
The convergence procedure of MSCAPSO is depicted in Figure 10a, from which it can be observed that the average fitness value of particles rises rapidly at an early stage of the iterations and then tends to be stable. That is, the particles are close to the global optimal solution. The shaded part shows the distribution of convergence curves in 10 runs. The comparison of different optimization methods is shown in Figure 10b, where all convergence experiments are based on the same FSDE feature vectors, and each convergence curve is the average of 10 runs. The figure shows that the convergence curve of PSO fluctuates greatly, and the average fitness value is difficult to be further improved after 50 iterations. SCA has a good overall convergence trend, but the convergence rate is significantly lower than the other two methods. MSCAPSO has already converged to a high level in the early stage and keeps the best convergence effect among all methods in the whole iterations.  The convergence procedure of MSCAPSO is depicted in Figure 10a, from which it can be observed that the average fitness value of particles rises rapidly at an early stage of the iterations and then tends to be stable. That is, the particles are close to the global optimal solution. The shaded part shows the distribution of convergence curves in 10 runs. The comparison of different optimization methods is shown in Figure 10b, where all convergence experiments are based on the same FSDE feature vectors, and each convergence curve is the average of 10 runs. The figure shows that the convergence curve of PSO fluctuates greatly, and the average fitness value is difficult to be further improved after 50 iterations. SCA has a good overall convergence trend, but the convergence rate is significantly lower than the other two methods. MSCAPSO has already converged to a high level in the early stage and keeps the best convergence effect among all methods in the whole iterations. In the optimization stage, the feature vectors from different types of faults were divided randomly into two parts with 30 and 29 vectors, where 30 were used for training while the rest, 29, for testing. The proposed MSCAPSO approach was applied to optimize the penalty factor C and the kernel parameter g of SVM. The total number of searching agents and iteration times were set 40 and 100, respectively, while searching ranges of C and g were both [2 −10 , 2 10 ]. Considering the maximum number of iterations and the number of optimization parameters, the mutation period was set at five and mutation amplitude was set at one. Fitness values were calculated by five-fold cross-validation for the training samples. Hence, the optimization effect of the given C and g could be measured by the average accuracy of cross-validation, then the optimal C and g were obtained.
The convergence procedure of MSCAPSO is depicted in Figure 10a, from which it can be observed that the average fitness value of particles rises rapidly at an early stage of the iterations and then tends to be stable. That is, the particles are close to the global optimal solution. The shaded part shows the distribution of convergence curves in 10 runs. The comparison of different optimization methods is shown in Figure 10b, where all convergence experiments are based on the same FSDE feature vectors, and each convergence curve is the average of 10 runs. The figure shows that the convergence curve of PSO fluctuates greatly, and the average fitness value is difficult to be further improved after 50 iterations. SCA has a good overall convergence trend, but the convergence rate is significantly lower than the other two methods. MSCAPSO has already converged to a high level in the early stage and keeps the best convergence effect among all methods in the whole iterations.  With the optimal parameters C and g, the SVM model optimized by MSCAPSO was trained and applied to recognize testing samples. For an in-depth verification of the proposed method, diagnosis results were averaged over 10 runs, and training samples were chosen at random in every run. Furthermore, the deviation of each result was calculated as a reference for result evaluation. In this application, adjusted Rand index (ARI), normalized mutual information (NMI), F-measure (F), and accuracy (ACC) as four widely used evaluation metrics are employed to evaluate the diagnosis results [39,40], which reflects the matching degree between classification result and real distribution information of samples. The scope of ARI is [−1, 1] and the rest scopes are all [0, 1]. A higher value indicates a better classification performance. The optimal C and g are presented corresponding to the best accuracy among 10 runs.
Fault diagnosis results with different methods under variable load conditions are shown in Table 4 and Figure 11. It can be observed from Table 4 that four evaluation metrics of the proposed FSDE-MSCAPSO method are all the highest when the motor loads are 2 hp and 1 hp, i.e., 0.9637, 0.9646, 0.9839, 0.9839 and 0.9615, 0.9598, 0.9827, 0.9828. As well, the deviations of the evaluation values are also at a very small level among all methods. When the motor load is 0 hp, although the value of proposed method is slightly lower than FE-PSO in each metric, its performance is still better than DE-PSO/SCA methods than before improvements. To be specific, the comparison of FE-PSO, DE-PSO and FSDE-PSO indicates that the proposed FSDE has a better feature extraction ability than FE, PE and DE in the field of fault diagnosis. Similarly, this fact is also verified in the comparison among PE-SCA, DE-SCA, and FSDE-SCA. Moreover, it can be concluded from the contrastive experiments between FSDE-SCA/PSO and FSDE-MSCAPSO that the proposed MSCAPSO optimization strategy improves the diagnosis performance of SVM.
The boxplots of evaluation values are shown in Figure 12, illustrating the performances of different diagnostic methods. As shown in the figure, the proposed method achieves better overall performance and stability for fault diagnosis than other contrastive methods. Furthermore, several statistical metrics that have been widely used for robustness evaluation [10,41,42] are applied to demonstrate the robustness of the diagnosis models, i.e., mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE). The error metrics of the outputs of different models under various loads are listed in Table 5, as well as the corresponding histograms shown in Figure 13. It can be observed that error metrics of FSDE-MSCAPSO are the lowest when the motor loads are 2 hp and 1 hp. When the motor load is 0 hp, although the metric values of the proposed method are not the lowest among all models, it is still at a low level and in the second position in each metric, indicating the proposed model has good robustness on the whole.      (FE-PSO), 2: Permutation entropy-sine cosine algorithm (PE-SCA), 3: Dispersion entropy-particle swarm optimization (DE-PSO), 4 Dispersion entropy-sine cosine algorithm (DE-SCA), 5: Fine-sorted dispersion entropy-particle swarm optimization (FSDE-PSO), 6: Fine-sorted dispersion entropy-sine cosine algorithm (FSDE-SCA), 7: Fine-sorted dispersion entropy-mutation operator, sine cosine algorithm and particle swarm optimization (FSDE-MSCAPSO).

Discussion
After the detailed comparative analysis above, the superiority of the proposed FSDE and MSCAPSO methods have been effectively demonstrated, as well, the accuracy of the diagnosis model has also been verified by experiments on various loads, fault sizes, and locations. It is worth noting that the proposed FSDE method only considers the relationship between neighboring elements in a given series, and the overall information of the series is not taken into account. According to the previous references [31,43], the improved method using the overall amplitude information of the series has been realized on other entropy methods, which is expected to further improve the performance of the proposed method. Furthermore, some of the parameters in the FSDE are still empirical values in the present paper. In future studies, reasonable metrics can be introduced to evaluate the influence of parameters on the approximation between FSDE and DE, so as to better determine the value of parameters. Furthermore, the proposed MSCAPSO algorithm can be optimized through better strategies to further improve the convergence speed and global search ability. In addition, the multi-objective optimization that has been widely utilized in the field of controlling [44][45][46][47] could be implemented in fault diagnosis, which is expected to improve the accuracy and reduce the variance of the outputs of the model [48]. Moreover, although the diagnosis experiments considered various load conditions, fault sizes, and locations, the signal to noise ratio of the CWRU bearing data is relatively high, and the conclusions obtained in the experiment may not be very comprehensive. In the future work, we plan to build our own bearing test device to collect many more vibration signals, and the results obtained under such conditions may be more practical.

Discussion
After the detailed comparative analysis above, the superiority of the proposed FSDE and MSCAPSO methods have been effectively demonstrated, as well, the accuracy of the diagnosis model has also been verified by experiments on various loads, fault sizes, and locations. It is worth noting that the proposed FSDE method only considers the relationship between neighboring elements in a given series, and the overall information of the series is not taken into account. According to the previous references [31,43], the improved method using the overall amplitude information of the series has been realized on other entropy methods, which is expected to further improve the performance of the proposed method. Furthermore, some of the parameters in the FSDE are still empirical values in the present paper. In future studies, reasonable metrics can be introduced to evaluate the influence of parameters on the approximation between FSDE and DE, so as to better determine the value of parameters. Furthermore, the proposed MSCAPSO algorithm can be optimized through better strategies to further improve the convergence speed and global search ability. In addition, the multi-objective optimization that has been widely utilized in the field of controlling [44][45][46][47] could be implemented in fault diagnosis, which is expected to improve the accuracy and reduce the variance of the outputs of the model [48]. Moreover, although the diagnosis experiments considered various load conditions, fault sizes, and locations, the signal to noise ratio of the CWRU bearing data is relatively high, and the conclusions obtained in the experiment may not be very comprehensive. In the future work, we plan to build our own bearing test device to collect many more vibration signals, and the results obtained under such conditions may be more practical.
Considering that the CWRU bearing data set has been selected by many researchers as experimental data to validate their methods and techniques, the diagnosis accuracies of some other methods based on this data set are listed in Table 6 to compare with the methods presented in this paper. Note that the segmentation or fault types of experimental samples may not be exactly the same, the accuracies listed is for general description. Table 6. Comparison among several other fault diagnosis model. The abbreviations are as follows: Chaos quantum sine cosine algorithm (CQSCA), hybrid gravitational search algorithm (HGSA), extreme learning machine (ELM), multivariate autoregressive (MAR), time shift multiscale fuzzy entropy (TSMFE), Laplacian support vector machine (LapSVM), improved multiscale dispersion entropy (IMDE), max-relevance min-redundancy (mRMR), back-propagation neural network (BPNN), improved multiscale permutation entropy (IMPE), quantum behaved particle swarm optimization (QPSO), least squares support vector machine (LSSVM), generalized composite multiscale permutation entropy (GCMPE), semi-supervised kernel Marginal Fisher analysis (SSKMFA).

Conclusions
A novel fault diagnosis method for rolling bearings by fine-sorted dispersion entropy and mutation SCA-PSO optimized SVM is presented to diagnose faults of various sizes, locations, and motor loads in this study. Due to the non-stationarity of original vibration signals, signals collected from different types of faults were firstly split by VMD into sets of IMFs, before which the decomposing mode number K was determined by a central frequency observation method. Then an improved dispersion entropy containing an additional factor, termed FSDE, was proposed to enhance the perception for relationship information between neighboring elements. It was employed to construct the feature vectors of different fault samples later. Afterward, a hybrid optimization strategy combining advantages of mutation operator, SCA and PSO called MSCAPSO, was proposed to optimize the SVM model. The optimal SVM model was subsequently applied to realize the classification for different fault samples. The contrastive experiments in Section 4 were implemented among the proposed FSDE-MSCAPSO method and other relevant methods, where nine types of faults with different locations and sizes were successfully diagnosed. The analysis of results indicates that the proposed method achieves the highest 0.9637, 0.9646, 0.9839, and 0.9839 in four widely used evaluation metrics, which exhibits its superior precision and stability over other methods.