A Rolling Bearing Fault Diagnosis Method Based on the WOA-VMD and the GAT

In complex industrial environments, the vibration signal of the rolling bearing is covered by noise, which makes fault diagnosis inaccurate. In order to overcome the effect of noise on the signal, a rolling bearing fault diagnosis method based on the WOA-VMD (Whale Optimization Algorithm-Variational Mode Decomposition) and the GAT (Graph Attention network) is proposed to deal with end effect and mode mixing issues in signal decomposition. Firstly, the WOA is used to adaptively determine the penalty factor and decomposition layers in the VMD algorithm. Meanwhile, the optimal combination is determined and input into the VMD, which is used to decompose the original signal. Then, the Pearson correlation coefficient method is used to select IMF (Intrinsic Mode Function) components that have a high correlation with the original signal, and selected IMF components are reconstructed to remove the noise in the original signal. Finally, the KNN (K-Nearest Neighbor) method is used to construct the graph structure data. The multi-headed attention mechanism is used to construct the fault diagnosis model of the GAT rolling bearing in order to classify the signal. The results show an obvious noise reduction effect in the high-frequency part of the signal after the application of the proposed method, where a large amount of noise was removed. In the diagnosis of rolling bearing faults, the accuracy of the test set diagnosis in this study was 100%, which is higher than that of the four other compared methods, and the diagnosis accuracy rate of various faults reached 100%.


Introduction
Rolling bearings are widely used in various fields of the machinery industry due to their advantages of high speed, high efficiency and low noise. They are often used in harsh operating environments, which result in the large dispersion of life and a high failure rate. According to statistics, about 30% of mechanical failures in rotating machinery equipment that use rolling bearings are related to bearing damage [1]. The use of vibration signals generated during the working process for fault diagnosis may not only reduce the probability of accidents related to mechanical equipment, but it can also provide reliable decision support for the later maintenance of the equipment [2,3]. As fault detection is based on physical quantities (such as stator current [4], stray flux [5], thermal image [6], etc.), and because vibration-based measurement methods are lower in cost, are easier than direct observation, more sensitive to external interference, and often used in practical engineering, a vibration-based fault detection method is adopted in this paper.
In general, the vibration signal collected through the equipment is mixed with a considerable amount of noise. Due to the existence of noise, the performances of the mechanical equipment and equipment health prediction are very poor. To achieve optimal noise reduction, it is particularly important to select a suitable noise reduction method. The

1.
By separating the signals with a fixed bandwidth, the problems of mode mixing and end effect are solved to some extent. Two key parameters in the VMD are determined using the WOA optimization algorithm, allowing the model to adaptively decompose the signal.

2.
Node classification of graph structure data is carried out using the attention mechanism method, which assigns different attention weights to different neighborhoods so as to identify more important information.

Signal Decomposition and Reconstruction Based on the WOA-VMD
The variational mode decomposition can be estimated for each signal component by solving a frequency-domain variational optimization problem. It is assumed that the decomposed IMF components are all low-bandwidth signals, and that the fault feature frequency appears in the central frequency of the IMF components. The IMF components are then reconstructed to achieve the purpose, the VMD model is solved iteratively. The center frequency ω k is used to decompose the decomposed k-mode components. Finally, the following problems are obtained: where {u k } = {u 1 , · · · , u k }, u k is the original signal, u 1 , . . . , u k are the k mode components obtained after decomposition, and {ω k } = {ω 1 , · · · , ω k } is the frequency center of each mode component after decomposition. If Equation (1) is to be solved, it needs to be converted into the unconstrained problem of Equation (2). The conversion mode is mainly solved with the use of Lagrange multipliers and quadratic penalty terms.
where α is the penalty factor, and λ(t) is the Lagrange multiplier.
The two parameters of the VMD, the penalty factor α and the decomposition layers k, need to be considered in the decomposition process to create certain limitations for the VMD. Too small or too large a value will affect the algorithm's effect, so the optimal parameter combination [k, α] needs to be determined. At present, the center frequency observation method is widely used. This method mainly determines the value of k by observing the center frequency under different values of k, without an accurate basis, and it can only determine the number of modes k, but not the penalty parameter α, which ultimately leads to poor noise reduction.

Whale Optimization Algorithm
In this paper, the whale optimization algorithm is used to adaptively determine the two parameters mentioned above in order to achieve a better noise reduction effect. In the process of hunting, humpback whales surround their prey in groups and move in a spiral, during which they constantly spit out bubbles, thus forming a spiral "bubble net" shown in Figure 1. This will make the space for the prey's movement smaller and smaller, until it can be swallowed in a single bite [36]. two parameters mentioned above in order to achieve a better noise reduction effect. In the process of hunting, humpback whales surround their prey in groups and move in a spiral, during which they constantly spit out bubbles, thus forming a spiral "bubble net" shown in Figure 1. This will make the space for the prey's movement smaller and smaller, until it can be swallowed in a single bite [36].
However, the algorithm initially does not know the optimal location, and the WOA is goal-oriented to find the prey. When whales find the best prey, they also find the best whale location, thus allowing the rest of the whales to move closer to it. This behavior can be expressed as follows: where is the position vector, * represents the optimal position obtained currently, and are the coefficient vectors, is the current number of iterations and ⃗ ⃗ is the distance between the prey and the whales. If there is a better solution, * will be iteratively updated. Vectors and can be calculated as follows: where is linear in its descent from 2 to 0, while is a random vector within [0, 1]. Simulating the humpback whale's unique hunting method is achieved through a spiral motion that updates the position and a narrowing ring mechanism, which is often referred to as the bubble net strategy. Assuming that the probability of prey capture using these two methods is 50%, the total strategy is expressed as follows: However, the algorithm initially does not know the optimal location, and the WOA is goal-oriented to find the prey. When whales find the best prey, they also find the best whale location, thus allowing the rest of the whales to move closer to it. This behavior can be expressed as follows: where → X is the position vector, → X * represents the optimal position obtained currently, → A and → C are the coefficient vectors, t is the current number of iterations and → D is the distance between the prey and the whales. If there is a better solution, → X * will be iteratively updated.
Vectors → A and → C can be calculated as follows: where → a is linear in its descent from 2 to 0, while → r is a random vector within [0, 1]. Simulating the humpback whale's unique hunting method is achieved through a spiral motion that updates the position and a narrowing ring mechanism, which is often referred to as the bubble net strategy. Assuming that the probability of prey capture using these two methods is 50%, the total strategy is expressed as follows: In addition to the above strategy, it will also hunt according to its position, again through a change in vector → A. When | → A| > 1, it moves away from the prey; conversely, it moves closer to the prey. In this way, global optimization properties can be effectively improved. The mathematical model is as follows: where → X random is the whale position randomly selected from the current group of whales. Then, the minimum value of envelope entropy is selected as the fitness function; the envelope entropy represents the sparsity of the original signal. When there is more noise in the IMF and less effective information, the envelope entropy value is larger; otherwise, the envelope entropy value is smaller.
The envelope spectrum of the signal X(i)(i = 1, 2, . . . , N) is calculated with the following equation: where a(j) is the envelope signal of k mode components decomposed by the VMD after Hilbert demodulation, p j is the sequence of probability distributions obtained by calculating the normalization of a(j), N is the number of sampling points, and the entropy value of a calculated sequence of probability distributions p j is the envelope entropy E p .

WOA-VMD Parameter Optimization
Since the WOA is simple, and easy to implement and because few parameters were set, it was decided that the penalty factor and decomposition layers in the VMD decomposition would be optimized first. The steps are shown in Figure 2.
(1) Set the number of whales, the maximum number of iterations and the optimization dimension, and initialize the position information. Set the mode component and penalty factor as k = [100, 3] and α = [2000, 7]; (2) Use the VMD algorithm to decompose the input signal and obtain each IMF function.
Calculate the envelope entropy of each IMF according to Equation (11). The envelope entropy was used as a fitness function to find the optimal whale location and retain it; (3) Start the iteration. Generate a random number p in the interval (−1, 1). If p < 0.5, it is directly transferred to step 4; otherwise, Equation (8) is used for position update, namely for spiral contraction; (4) Determine the value of |A|. If |A| < 1, update the position type of Equation (5), surrounded by the contraction; otherwise, update the position according to Equation (10), that is, change it to random exploration; (5) Calculate the fitness of each whale and compare it with the previously reserved optimal position. If it is better, replace it with the new optimal solution; (6) Determine whether the iteration is terminated. If t ≤ t max , then t = t + 1; return to Step 3. Otherwise, the iteration ends and the optimal parameter combination [k, α] is saved.
lating the normalization of ( ), is the number of sampling points, and the entropy value of a calculated sequence of probability distributions is the envelope entropy .

WOA-VMD Parameter Optimization
Since the WOA is simple, and easy to implement and because few parameters were set, it was decided that the penalty factor and decomposition layers in the VMD decomposition would be optimized first. The steps are shown in Figure 2.

Start
Initialize the vector position of whale [k,α] The signals are processed using VMD based on the location of each whale The envelope entropy for each whale is calculated and the optimal individual position is recorded

Screening of IMF Component Coefficients
To filter the decomposed mode components, the Pearson correlation coefficient method is adopted in this paper.

Pearson Correlation Coefficient Analysis
The Pearson correlation coefficient method focuses on the correlation between signals to determine the similarity between two signals. The Pearson correlation coefficient method is as follows: cov(x t , where ρ x t x I MF represents the overall correlation coefficient, E represents the expected value, and cov(x t , x I MF ) represents covariances of x t and x I MF .
The equation for the correlation coefficient value r(x t , x I MF ) is as follows:

Correlation Component Discrimination
The value range of r(x t , x I MF ) is [−1, 1]; that is, −1 ≤ r(x t , x I MF ) ≤ 1. Therefore, the classification of r(x t , x I MF ) is shown in Table 1. In this section, reference [37] is used to select the mode components. In general, it is considered that the mode components with a strong correlation or above the original signal are those with less noise information. Mode components with correlation degrees below a strong correlation are removed, and then, mode components with a strong correlation or above are reconstructed to complete the noise reduction.
The signal decomposition and reconstruction method proposed in this paper uses the VMD to decompose signals, which can effectively avoid the phenomenon of mode mixing in noise reduction with the traditional method, and can retain the effective information of the original signal. Then, two key parameters, k and α, of the VMD are determined adaptively using the WOA optimization algorithm to improve the generalization ability of the model. This method is used for noise reduction. The parameters are initialized for the WOA. The envelope entropy value corresponding to each whale is calculated and recorded as optimal. The optimal combination of parameters is output, and the optimal k and α are used to decompose the original signal. Pearson is then used to filter the obtained results. Specific steps are shown in Figure 3. In this section, reference [37] is used to select the mode components. In general, it is considered that the mode components with a strong correlation or above the original signal are those with less noise information. Mode components with correlation degrees below a strong correlation are removed, and then, mode components with a strong correlation or above are reconstructed to complete the noise reduction.
The signal decomposition and reconstruction method proposed in this paper uses the VMD to decompose signals, which can effectively avoid the phenomenon of mode mixing in noise reduction with the traditional method, and can retain the effective information of the original signal. Then, two key parameters, and , of the VMD are determined adaptively using the WOA optimization algorithm to improve the generalization ability of the model.
This method is used for noise reduction. The parameters are initialized for the WOA. The envelope entropy value corresponding to each whale is calculated and recorded as optimal. The optimal combination of parameters is output, and the optimal and are used to decompose the original signal. Pearson is then used to filter the obtained results. Specific steps are shown in Figure 3.

Mechanical fault vibration signal with noise Xt
WOA algorithm finds the optimal parameter combination [k,α] The optimal parameter combination is used for VMD decomposition

Fault Diagnosis of Rolling Bearing Based on the GAT
If the fault can be found in time during the actual operation of rolling bearings and maintenance or remedial measures can be taken, which is of great significance for the reliability of the bearing operation. In this section, fault diagnosis methods of rolling bearings are studied based on Section 2. The GAT network model is established to diagnose rolling bearings. Meanwhile, a multi-head attention mechanism is used to perform node classification of graph structure data. To investigate ways of combining the two methods, increasing the weight of important information and improving diagnostic accuracy. At the same time, the parameter settings and data sets in the experiment are elaborated, and the experimental results are analyzed in detail.

KNN Graph Construction Method for Fault Diagnosis Data
Graph attention neural networks need to build graphs to represent the correlation between different faults as input data. Therefore, for the data set V after noise reduction using the WOA-VMD, the concept of Graph is introduced in order to represent the data itself and the relationship between them. Additionally, the KNN method is used to construct the graph model.
In the KNN figure, for each node (e.g., fault data point in V) to find the first K nearest neighbor points, the nearest neighbor of the obtained node x i can be expressed as follows: where NN returns the K nearest neighbors of the node denotes that there are m samples, and Ne(x i ) represents the neighbors of the node x i . The edge weight between KNN nodes can be estimated using the Gaussian kernel weight function, and defined as follows: where e ij is edge weight between nodes x i and x j , and ξ is the bandwidth of the Gaussian kernel. An example KNN composition is shown in Figure 4 [29].

Fault Diagnosis of Rolling Bearing Based on the GAT
If the fault can be found in time during the actual operation of rolling bearings and maintenance or remedial measures can be taken, which is of great significance for the reliability of the bearing operation. In this section, fault diagnosis methods of rolling bearings are studied based on Section 2. The GAT network model is established to diagnose rolling bearings. Meanwhile, a multi-head attention mechanism is used to perform node classification of graph structure data. To investigate ways of combining the two methods, increasing the weight of important information and improving diagnostic accuracy. At the same time, the parameter settings and data sets in the experiment are elaborated, and the experimental results are analyzed in detail.

KNN Graph Construction Method for Fault Diagnosis Data
Graph attention neural networks need to build graphs to represent the correlation between different faults as input data. Therefore, for the data set after noise reduction using the WOA-VMD, the concept of Graph is introduced in order to represent the data itself and the relationship between them. Additionally, the KNN method is used to construct the graph model.
In the KNN figure, for each node (e.g., fault data point in ) to find the first nearest neighbor points, the nearest neighbor of the obtained node can be expressed as follows: where NN returns the nearest neighbors of the node in set , = [ +1 , +2 , . . . , + ] denotes that there are m samples, and ( ) represents the neighbors of the node .
The edge weight between KNN nodes can be estimated using the Gaussian kernel weight function, and defined as follows: where is edge weight between nodes and , and is the bandwidth of the Gaussian kernel. An example KNN composition is shown in Figure 4   The rolling bearing fault diagnosis data in this paper are based on PyTorch Geometry (PYG) [38], a deep learning framework developed by PyTorch to construct undirected graphs. Firstly, nine different types of data from Case Western Reserve University are input, the number of labels is set, the time-domain vibration signal is loaded as the input, the graph model is constructed using KNN, and weights are assigned. Then, PYG is used for data encapsulation to complete the establishment of the graph model. The composition process is shown in Figure 5. The rolling bearing fault diagnosis data in this paper are based on PyTorch Geometry (PYG) [38], a deep learning framework developed by PyTorch to construct undirected graphs. Firstly, nine different types of data from Case Western Reserve University are input, the number of labels is set, the time-domain vibration signal is loaded as the input, the graph model is constructed using KNN, and weights are assigned. Then, PYG is used for data encapsulation to complete the establishment of the graph model. The composition process is shown in Figure 5.

Graph Attention Layer Construction Method
This section will start from the attention layer of a single graph. Firstly, node information in the constructed graph model is input, and then input features are transformed into higher-order features via a linear transformation. In addition, attention (weights) is allocated to each node through self-attention; a represents the shared attention mechanism, and ′ × ′ → , which is used to calculate the attention coefficient , that is, the influence coefficient of node on node .
The attention calculation above only considers any two nodes in the graph, whereas in the general case, every node in the graph needs to be considered; thus, the overall graph information is lost. Therefore, this section only calculates the correlation between node and node ∈ in its neighborhood to the target node, where is a domain of node , and then normalizes it in all options with the softmax function.
Using LeakyReLU as the activation function.
where ∥ indicates the splicing operation, and indicates transposition. The complete equation for calculating the weight factor is shown below.
The attention coefficient after normalization is calculated for the corresponding linear combination of features, and the final output feature vector of each node is obtained after calculation via the nonlinear activation function as shown below.
In addition, in this section, weights are mainly assigned through a multi-head attention mechanism, which is detailed in Figure 6. Each head attention mechanism will finally do a summation and averaging process to ℎ 1 ′ ⃗⃗⃗⃗ [39].

Graph Attention Layer Construction Method
This section will start from the attention layer of a single graph. Firstly, node information in the constructed graph model is input, and then input features are transformed into higher-order features via a linear transformation. In addition, attention (weights) is allocated to each node through self-attention; a represents the shared attention mechanism, and R F × R F → R , which is used to calculate the attention coefficient e ij , that is, the influence coefficient of node i on node j.
The attention calculation above only considers any two nodes in the graph, whereas in the general case, every node in the graph needs to be considered; thus, the overall graph information is lost. Therefore, this section only calculates the correlation between node i and node j ∈ N i in its neighborhood to the target node, where N i is a domain of node i, and then normalizes it in all j options with the softmax function.
Using LeakyReLU as the activation function.
where indicates the splicing operation, and T indicates transposition. The complete equation for calculating the weight factor is shown below.
The attention coefficient after normalization is calculated for the corresponding linear combination of features, and the final output feature vector of each node is obtained after calculation via the nonlinear activation function as shown below.
In addition, in this section, weights are mainly assigned through a multi-head attention mechanism, which is detailed in Figure 6. Each head attention mechanism will finally do a summation and averaging process to → h 1 [39].
In addition, in this section, weights are mainly assigned through a multi-head attention mechanism, which is detailed in Figure 6. Each head attention mechanism will finally do a summation and averaging process to ℎ 1 ′ ⃗⃗⃗⃗ [39].  The output of the above equation is stitched together using the following independent attention mechanisms: where represents the splicing operation, α k ij represents the normalized attention coefficient calculated by the β-th attention mechanism, and W k is the weight matrix of the corresponding input linear transformation. It is worth noting here that, in this setup, the final output h returned will consist of the K and F features of each node, not just F . The specific flow is shown in Figure 7. The output of the above equation is stitched together using the following independent attention mechanisms: where ∥ represents the splicing operation, represents the normalized attention coefficient calculated by the -th attention mechanism, and is the weight matrix of the corresponding input linear transformation. It is worth noting here that, in this setup, the final output ℎ ′ returned will consist of the and ′ features of each node, not just ′ . The specific flow is shown in Figure 7. Figure 7. Diagram of GAT model.

Formatting of Mathematical Components
In this section, the graph attention neural network model is constructed, and some notations used are introduced. The graph is represented as = { , }. The graph adjacency matrix is represented as . The node feature matrix of the graph is represented as ∈ × . is used to represent the number of nodes in the graph, and is used to represent the dimension of the node features. The features of a node in the graph are each a row of .
The basic framework of the graph attention neural network is the combination of the graph filtering layer and nonlinear activation layer. Figure 8 shows two graph filtering layers and activation layers. The output of the -th graph filter layer is denoted as ( ) . In particular,

Formatting of Mathematical Components
In this section, the graph attention neural network model is constructed, and some notations used are introduced. The graph is represented as G = {V, E}. The graph adjacency matrix is represented as A. The node feature matrix of the graph is represented as F ∈ R N×d . N is used to represent the number of nodes in the graph, and d is used to represent the dimension of the node features. The features of a node in the graph are each a row of F.
The basic framework of the graph attention neural network is the combination of the graph filtering layer and nonlinear activation layer. Figure 8 shows two graph filtering layers and activation layers. The output of the i-th graph filter layer is denoted as F (i) . In particular, F (0) is initialized to the nodal feature matrix F. The output dimension of the i-th graph filter layer is denoted as d i . Since the structure of the graph does not change, it follows that F (i) ∈ R N×d i . The i-th layer of the graph filter layer can be described as follows: where α i−1 () denotes the activation function applied element-by-element after the (i − 1)-th graph filter layer. It is worth noting that α 0 denotes a constant function; as in practice, the input features are not usually activated. where −1 () denotes the activation function applied element-by-element after the ( − 1)-th graph filter layer. It is worth noting that 0 denotes a constant function; as in practice, the input features are not usually activated.

GAT Convolutional kernels
Fully connected layer Node classification Depending on the specific downstream task, the final output ( ) can be used as an input for a particular layer. The downstream task is parametric learning. The GAT model in this paper takes the entire graph as inputs to generate node representations, which are then used to train a node classifier. Specifically, let () represent a GAT model with multiple graph filter layers stacked. The function () takes the graph structure and node features as inputs and the learned node features as the output, and is expressed as follows: where 1 represents a model parameter, ∈ × is the adjacency matrix, ∈ × represents the node features of the input, and ∈ × indicates the node features of the output.
The output node features are then used for node classification, as shown below: where ∈ × represents the output node category probability matrix, and 2 ∈ × is the parameter matrix that converts the feature into a dimension equal to class number .
The -th line of represents the category distribution of the prediction node, and the prediction label is usually the label with the highest probability. The whole process can be summarized as follows: where function () contains the processes of Equations (24) and (25), contains 1 and 2 . The parameter can be learned by minimizing the following objective function: Depending on the specific downstream task, the final output F (L) can be used as an input for a particular layer. The downstream task is parametric learning. The GAT model in this paper takes the entire graph as inputs to generate node representations, which are then used to train a node classifier. Specifically, let GAT node () represent a GAT model with multiple graph filter layers stacked. The function GAT node () takes the graph structure and node features as inputs and the learned node features as the output, and is expressed as follows: where Θ 1 represents a model parameter, A ∈ R N×N is the adjacency matrix, F ∈ R N×d in represents the node features of the input, and F out ∈ R N×d out indicates the node features of the output. The output node features are then used for node classification, as shown below: where Z ∈ R N×C represents the output node category probability matrix, and Θ 2 ∈ R d out ×C is the parameter matrix that converts the feature F out into a dimension equal to class number C.
The i-th line of Z represents the category distribution of the prediction node, and the prediction label is usually the label with the highest probability. The whole process can be summarized as follows: where function f GAT () contains the processes of Equations (24) and (25), Θ contains Θ 1 and Θ 2 . The parameter Θ can be learned by minimizing the following objective function: where f GAT (A, F; Θ) i represents the i-th line of the matrix, that is, the category probability distribution of node v i ; y i represents its corresponding label; and l(·, ·) represents some kind of loss function, such as a cross-entropy loss function. Figure 9 shows the fault diagnosis flow chart of the GAT.
Entropy 2023, 25, x FOR PEER REVIEW 12 of 30 Step 3: The data after noise reduction are initialized, and the graph model is constructed through the KNN method. The constructed graph model is divided into the test, training and validation sets.
Step 4: The training set data are input into the GAT model, the model is trained, the output error is obtained through the validation set, and the error is back-propagated to update the network model parameters.

Start
Obtain bearing vibration signal

WOA-VMD
In this section, the simulation fault signal and the test bench data in our laboratory are used to verify the superiority and effectiveness of the WOA-VMD noise reduction algorithm.

Simulation Verification
Firstly, the simulation fault signal is used for verification. The equation of the simulation vibration signal is as follows: where 1 = 80 Hz, 2 = 200 Hz, 3 = 300 Hz, sampling points = 1024. The time and frequency domain diagrams of the simulation signal s1 are shown in Figure 10. Step 1: The signal reconstruction and decomposition methods in Part A of the figure are described in Section 2.
Step 2: The GAT model is initialized and relevant model parameters are set.
Step 3: The data after noise reduction are initialized, and the graph model is constructed through the KNN method. The constructed graph model is divided into the test, training and validation sets.
Step 4: The training set data are input into the GAT model, the model is trained, the output error is obtained through the validation set, and the error is back-propagated to update the network model parameters.

WOA-VMD
In this section, the simulation fault signal and the test bench data in our laboratory are used to verify the superiority and effectiveness of the WOA-VMD noise reduction algorithm.

Simulation Verification
Firstly, the simulation fault signal is used for verification. The equation of the simulation vibration signal is as follows: where f 1 = 80 Hz, f 2 = 200 Hz, f 3 = 300 Hz, sampling points N = 1024. The time and frequency domain diagrams of the simulation signal s 1 are shown in Figure 10. The time and frequency domain diagrams of the simulation signal s2 are shown in Figure 11.   The time and frequency domain diagrams of the simulation signal s 2 are shown in Figure 11. The time and frequency domain diagrams of the simulation signal s2 are shown in Figure 11.   The time and frequency domain diagrams of the simulation signal s 3 are shown in Figure 12. The time and frequency domain diagrams of the simulation signal s2 are shown in Figure 11.     The Gaussian white noise ( ) of −10 dB [40] is added to signal . The time and frequency domain diagrams after adding noise are shown in Figure 14. The Gaussian white noise n(t) of −10 dB [40] is added to signal Y. The time and frequency domain diagrams after adding noise are shown in Figure 14. It can be seen in Figure 14 that the simulated signal after adding noise ( ) is more real than the one before. The range of variation in the time-domain diagram increases, and the difference in amplitude is likewise increased. The amplitude interleaving in the frequency domain is due to the addition of simulated high-intensity noise.
In what follows, the mixed signal with noise is input into the WOA-VMD model for decomposition, and multiple mode components and frequency-domain diagrams are obtained. The number of decomposition layers and penalty factor in the VMD algorithm is optimized using the WOA algorithm, and the best results are obtained. The whale algorithm parameters are set, as in Table 2. The final parameters are obtained after 20 iterations. The iteration curve of penalty factor , shown in Figure 15a, enters convergence after the fifth time, with a final convergence value of 1652. The iteration curve of the optimal decomposition layer is shown in Figure 15b. After the second iteration, the optimal solution is found to be eight layers. The iterative curve of the envelope entropy, shown in Figure 15c, enters convergence after the thirteenth iteration, with the final envelope entropy value of 9.7633. It can be seen in Figure 14 that the simulated signal after adding noise n(t) is more real than the one before. The range of variation in the time-domain diagram increases, and the difference in amplitude is likewise increased. The amplitude interleaving in the frequency domain is due to the addition of simulated high-intensity noise.
In what follows, the mixed signal with noise is input into the WOA-VMD model for decomposition, and multiple mode components and frequency-domain diagrams are obtained. The number of decomposition layers and penalty factor in the VMD algorithm is optimized using the WOA algorithm, and the best results are obtained. The whale algorithm parameters are set, as in Table 2. The final parameters are obtained after 20 iterations. The iteration curve of penalty factor α, shown in Figure 15a, enters convergence after the fifth time, with a final convergence value of 1652. The iteration curve of the optimal decomposition layer k is shown in Figure 15b. After the second iteration, the optimal solution is found to be eight layers. The iterative curve of the envelope entropy, shown in Figure 15c, enters convergence after the thirteenth iteration, with the final envelope entropy value of 9.7633.
The final parameters are obtained after 20 iterations. The iteration curve of penalty factor , shown in Figure 15a, enters convergence after the fifth time, with a final convergence value of 1652. The iteration curve of the optimal decomposition layer is shown in Figure 15b. After the second iteration, the optimal solution is found to be eight layers. The iterative curve of the envelope entropy, shown in Figure 15c, enters convergence after the thirteenth iteration, with the final envelope entropy value of 9.7633.  Figure 16 shows the time and frequency diagram of multiple mode components obtained from the decomposition of the optimal combination of parameters.
Then, the mode components obtained from decomposition are selected using the Pearson correlation coefficient method, and those whose relationship value with the original signal is greater than or equal to 0.6 are reconstructed. The correlation values of the calculated mode components are shown in Table 3.    Figure 16 shows the time and frequency diagram of multiple mode components obtained from the decomposition of the optimal combination of parameters.
Then, the mode components obtained from decomposition are selected using the Pearson correlation coefficient method, and those whose relationship value with the original signal is greater than or equal to 0.6 are reconstructed. The correlation values of the calculated mode components are shown in Table 3.   Then, the mode components obtained from decomposition are selected using the Pearson correlation coefficient method, and those whose relationship value with the original signal is greater than or equal to 0.6 are reconstructed. The correlation values of the calculated mode components are shown in Table 3. The Pearson correlation coefficients of IMF2, 4, 5 and 6 exceed 0.6, discarding IMF1, 3, 7 and 8. IMF2, 4, 5 and 6 are reconstructed into signals after noise reduction. Then, the signals are compared after the noise reduction of EMD, EEMD, CEEMD and GA-VMD. The respective time and frequency domain diagrams are shown below.
As can be seen in Figures 17-21, the WOA-VMD provides a better noise reduction effect compared to the other four methods, and it has an obvious noise reduction effect on the whole frequency band. The Pearson correlation coefficients of IMF2, 4, 5 and 6 exceed 0.6, discarding IMF1, 3, 7 and 8. IMF2, 4, 5 and 6 are reconstructed into signals after noise reduction. Then, the signals are compared after the noise reduction of EMD, EEMD, CEEMD and GA-VMD. The respective time and frequency domain diagrams are shown below.
As can be seen in Figures 17-21, the WOA-VMD provides a better noise reduction effect compared to the other four methods, and it has an obvious noise reduction effect on the whole frequency band.       The Pearson correlation coefficients of IMF2, 4, 5 and 6 exceed 0.6, discarding IMF1, 3, 7 and 8. IMF2, 4, 5 and 6 are reconstructed into signals after noise reduction. Then, the signals are compared after the noise reduction of EMD, EEMD, CEEMD and GA-VMD. The respective time and frequency domain diagrams are shown below.
As can be seen in Figures 17-21, the WOA-VMD provides a better noise reduction effect compared to the other four methods, and it has an obvious noise reduction effect on the whole frequency band.       The Pearson correlation coefficients of IMF2, 4, 5 and 6 exceed 0.6, discarding IMF1, 3, 7 and 8. IMF2, 4, 5 and 6 are reconstructed into signals after noise reduction. Then, the signals are compared after the noise reduction of EMD, EEMD, CEEMD and GA-VMD. The respective time and frequency domain diagrams are shown below.
As can be seen in Figures 17-21, the WOA-VMD provides a better noise reduction effect compared to the other four methods, and it has an obvious noise reduction effect on the whole frequency band.           As can be seen in Table 4, the root mean square error after the WOA-VMD signal decomposition and reconstruction is 0.213, the signal to noise ratio is 6.912 and the noise reduction effect is more obvious as compared with other algorithms.

. Validation of Laboratory Data
As can be seen in Figure 22, the bearing fault diagnosis test bench consists of a touch panel, motor speed controller, motor, radial loading hydraulic system, ADI150 uniaxial acceleration sensor, axial loading hydraulic system, main shaft, two support 6210 and 18,720 bearing, the ER-16K bearing to be measured and a force arm beam adjusting device. The bearing type is ER-16K, and detailed parameters are shown in Table 5. The acceleration sensor was used to obtain the vibration acceleration information of 13 bearing fault states, including 10 single point faults and 3 compound faults (CF). The experimental data were obtained at a sampling frequency of 25.6 kHz. A total of 10 groups were collected under each fault state, with each group comprising 32,768 sample points. As can be seen in Table 4, the root mean square error after the WOA-VMD signal decomposition and reconstruction is 0.213, the signal to noise ratio is 6.912 and the noise reduction effect is more obvious as compared with other algorithms.

Validation of Laboratory Data
As can be seen in Figure 22, the bearing fault diagnosis test bench consists of a touch panel, motor speed controller, motor, radial loading hydraulic system, ADI150 uniaxial acceleration sensor, axial loading hydraulic system, main shaft, two support 6210 and 18,720 bearing, the ER-16K bearing to be measured and a force arm beam adjusting device. The bearing type is ER-16K, and detailed parameters are shown in Table 5. The acceleration sensor was used to obtain the vibration acceleration information of 13 bearing fault states, including 10 single point faults and 3 compound faults (CF). The experimental data were obtained at a sampling frequency of 25.6 kHz. A total of 10 groups were collected under each fault state, with each group comprising 32,768 sample points. acceleration sensor, axial loading hydraulic system, main shaft, two support 6210 and 18,720 bearing, the ER-16K bearing to be measured and a force arm beam adjusting device. The bearing type is ER-16K, and detailed parameters are shown in Table 5. The acceleration sensor was used to obtain the vibration acceleration information of 13 bearing fault states, including 10 single point faults and 3 compound faults (CF). The experimental data were obtained at a sampling frequency of 25.6 kHz. A total of 10 groups were collected under each fault state, with each group comprising 32,768 sample points.  The damage of the fault bearing is man-made, as shown in Figure 23. The inner and outer ring faults were caused by the use of a laser marking machine to create indentation in the groove of the outer ring ball. The red block represents the position of the laser groove. The rolling body fault represents the punching of holes in the rolling body.  The damage of the fault bearing is man-made, as shown in Figure 23. The inner and outer ring faults were caused by the use of a laser marking machine to create indentation in the groove of the outer ring ball. The red block represents the position of the laser groove. The rolling body fault represents the punching of holes in the rolling body. In the experiment, as shown in Table 6, three different loads were set, and different fault locations, damage degrees and experimental speeds were set under three different loads. In addition, there were three healthy sets of bearing data for each of the three loads.  In the experiment, as shown in Table 6, three different loads were set, and different fault locations, damage degrees and experimental speeds were set under three different loads. In addition, there were three healthy sets of bearing data for each of the three loads.
The WOA-VMD method was used for the signal decomposition and reconstruction of the original data, and the feasibility of the proposed method was observed. The time and frequency domain diagrams of the vibration signals are shown in Figure 24. The WOA-VMD method was used for the signal decomposition and reconstruction of the original data, and the feasibility of the proposed method was observed. The time and frequency domain diagrams of the vibration signals are shown in Figure 24. In this experiment, the inner ring fault under a 100 N load is selected as experimental data, and the sampling frequency is 1024 Hz. It can be seen in Figure 24 that there is noise in the data. The WOA-VMD is applied for noise reduction processing. The decomposed  In this experiment, the inner ring fault under a 100 N load is selected as experimental data, and the sampling frequency is 1024 Hz. It can be seen in Figure 24 that there is noise in the data. The WOA-VMD is applied for noise reduction processing. The decomposed IMF components and iterative curves are shown in Figures 25 and 26. The WOA-VMD method was used for the signal decomposition and reconstruction of the original data, and the feasibility of the proposed method was observed. The time and frequency domain diagrams of the vibration signals are shown in Figure 24. In this experiment, the inner ring fault under a 100 N load is selected as experimental data, and the sampling frequency is 1024 Hz. It can be seen in Figure 24 that there is noise in the data. The WOA-VMD is applied for noise reduction processing. The decomposed IMF components and iterative curves are shown in Figures 25 and 26.  The correlation coefficient values between IMF components after the WOA-VMD decomposition and the original signal are calculated, as shown in Table 7. After the correlation analysis, the correlation coefficients of IMF1 and 2 were higher than 0.6, so IMF1 and 2 were selected for the reconstruction, and the noise reduction The correlation coefficient values between IMF components after the WOA-VMD decomposition and the original signal are calculated, as shown in Table 7. After the correlation analysis, the correlation coefficients of IMF1 and 2 were higher than 0.6, so IMF1 and 2 were selected for the reconstruction, and the noise reduction process As can be seen in Figure 27, the noise reduction effect in the high-frequency part of the signal is very obvious, with a large amount of noise having been removed. This allowed for effective information to be retained, the ideal effect to be achieved and the preparation for subsequent fault diagnosis.

GAT
In this section, two sets of data are chosen to verify the superiority and effectiveness of the GAT fault diagnosis algorithm. The data comprise the experimental bearing data from Case Western Reserve University [41] and the test bench data from this laboratory.

Case Western Reserve University Data Verification
The bearing test platform of Case Western Reserve University [42] is shown in Figure  28. The fault sizes were set to 0.007 inches, 0.014 inches and 0.021 inches, with 1 inch = 2.54 cm. In addition, the rotational speeds were set to 1797 r/min, 1772 r/min, 1750 r/min and 1730 r/min. The specific different fault states are shown in Table 8.  As can be seen in Figure 27, the noise reduction effect in the high-frequency part of the signal is very obvious, with a large amount of noise having been removed. This allowed for effective information to be retained, the ideal effect to be achieved and the preparation for subsequent fault diagnosis.

GAT
In this section, two sets of data are chosen to verify the superiority and effectiveness of the GAT fault diagnosis algorithm. The data comprise the experimental bearing data from Case Western Reserve University [41] and the test bench data from this laboratory.

Case Western Reserve University Data Verification
The bearing test platform of Case Western Reserve University [42] is shown in Figure 28. The fault sizes were set to 0.007 inches, 0.014 inches and 0.021 inches, with 1 inch = 2.54 cm. In addition, the rotational speeds were set to 1797 r/min, 1772 r/min, 1750 r/min and 1730 r/min. The specific different fault states are shown in Table 8. As can be seen in Figure 27, the noise reduction effect in the high-frequency part o the signal is very obvious, with a large amount of noise having been removed. This al lowed for effective information to be retained, the ideal effect to be achieved and the prep aration for subsequent fault diagnosis.

GAT
In this section, two sets of data are chosen to verify the superiority and effectivenes of the GAT fault diagnosis algorithm. The data comprise the experimental bearing data from Case Western Reserve University [41] and the test bench data from this laboratory.

Case Western Reserve University Data Verification
The bearing test platform of Case Western Reserve University [42] is shown in Figur  28. The fault sizes were set to 0.007 inches, 0.014 inches and 0.021 inches, with 1 inch = 2.54 cm. In addition, the rotational speeds were set to 1797 r/min, 1772 r/min, 1750 r/min and 1730 r/min. The specific different fault states are shown in Table 8.   In this section, the size of the convolutional kernel and the number of convolutional layers of the GAT will be discussed. A total of 60% of overall samples was randomly selected as the training set, 20% as the validation set and 20% as the test set. The specific GAT parameter settings are shown in Table 9. The size of the convolutional kernel Θ, an important parameter of the network model, affects the accuracy of rolling bearing fault type identification. The impact on accuracy is discussed by comparing different sizes of convolutional kernels.
As shown in Table 10, when the convolution kernel size increases from Θ ∈ R 1024×1024 to Θ ∈ R 2048×2048 , the accuracy becomes higher, and the calculation time increases accordingly. When the convolution kernel size increases from Θ ∈ R 2048×2048 to Θ ∈ R 4096×4096 , the calculation time increases, but overfitting occurs. Therefore, the final convolutional kernel size selected for the graph in this paper is Θ ∈ R 2048×2048 . (

2) The Effect of Different Convolution Layers
In general, the more convolutional layers there are, the more filters are superimposed to solve the learning problem hierarchically. By deepening the layers, information can be transmitted at different levels. In this section, considering that other conditions are the same, the impact of two to six convolutional layers on the diagnostic accuracy is compared, as shown in Figure 29; the loss value is shown in Figure 30.

(2) The Effect of Different Convolution Layers
In general, the more convolutional layers there are, the more filters are superimposed to solve the learning problem hierarchically. By deepening the layers, information can be transmitted at different levels. In this section, considering that other conditions are the same, the impact of two to six convolutional layers on the diagnostic accuracy is compared, as shown in Figure 29; the loss value is shown in Figure 30. The comparison shows that an increase in the number of layers has less impact on this experiment, but increases the time for iterations to converge. In contrast, for the loss value, the fewer the layers, the smaller the loss value, the faster the iteration speed and the more realistic the prediction. Therefore, the number of convolution kernel layers there are in this paper is two.
This time, the effectiveness of the proposed method is verified using TSNE dimensional reduction visualization, and the superiority of the GAT model is verified by comparing several fault diagnostic models. Figure 31a shows the sample distribution of the initial data set. Before the model is trained, the label classification effect is not good. The labels of the same fault are not

(2) The Effect of Different Convolution Layers
In general, the more convolutional layers there are, the more filters are superimposed to solve the learning problem hierarchically. By deepening the layers, information can be transmitted at different levels. In this section, considering that other conditions are the same, the impact of two to six convolutional layers on the diagnostic accuracy is compared, as shown in Figure 29; the loss value is shown in Figure 30. The comparison shows that an increase in the number of layers has less impact on this experiment, but increases the time for iterations to converge. In contrast, for the loss value, the fewer the layers, the smaller the loss value, the faster the iteration speed and the more realistic the prediction. Therefore, the number of convolution kernel layers there are in this paper is two.
This time, the effectiveness of the proposed method is verified using TSNE dimensional reduction visualization, and the superiority of the GAT model is verified by comparing several fault diagnostic models. Figure 31a shows the sample distribution of the initial data set. Before the model is trained, the label classification effect is not good. The labels of the same fault are not The comparison shows that an increase in the number of layers has less impact on this experiment, but increases the time for iterations to converge. In contrast, for the loss value, the fewer the layers, the smaller the loss value, the faster the iteration speed and the more realistic the prediction. Therefore, the number of convolution kernel layers there are in this paper is two.
This time, the effectiveness of the proposed method is verified using TSNE dimensional reduction visualization, and the superiority of the GAT model is verified by comparing several fault diagnostic models. Figure 31a shows the sample distribution of the initial data set. Before the model is trained, the label classification effect is not good. The labels of the same fault are not  Figure 31b that the classification of all kinds of labels was completed. The aggregation of the same category is good, and there is no mixing of different category labels. This proves that the rolling bearing fault diagnosis method proposed in this paper is effective. aggregated and the labels of different categories are mixed. It can be seen in Figure 31b that the classification of all kinds of labels was completed. The aggregation of the same category is good, and there is no mixing of different category labels. This proves that the rolling bearing fault diagnosis method proposed in this paper is effective. As can be seen in Figures 32 and 33, the accuracy of the MLP and Attention models reached about 80% after 100 training iterations, but they are not in a convergence state. Therefore, the graph neural network model converges faster and has higher diagnostic accuracy than the traditional neural network model. In the change curve of the loss value, it can be seen that the loss value of the graph neural network is lower than that of the traditional neural network.  As can be seen in Figures 32 and 33, the accuracy of the MLP and Attention models reached about 80% after 100 training iterations, but they are not in a convergence state. Therefore, the graph neural network model converges faster and has higher diagnostic accuracy than the traditional neural network model. In the change curve of the loss value, it can be seen that the loss value of the graph neural network is lower than that of the traditional neural network. aggregated and the labels of different categories are mixed. It can be seen in Figure 31b that the classification of all kinds of labels was completed. The aggregation of the same category is good, and there is no mixing of different category labels. This proves that the rolling bearing fault diagnosis method proposed in this paper is effective. As can be seen in Figures 32 and 33, the accuracy of the MLP and Attention models reached about 80% after 100 training iterations, but they are not in a convergence state. Therefore, the graph neural network model converges faster and has higher diagnostic accuracy than the traditional neural network model. In the change curve of the loss value, it can be seen that the loss value of the graph neural network is lower than that of the traditional neural network.  After 100 training iterations, the accuracy of the GCN, CNN and GAT models reached about 100%, while the iteration time and loss value of the GCN model are higher than those of the CNN and GAT, and the stability of the GCN and GAT is better than that of the CNN. This indicates that the method proposed in this paper is superior, with fast iteration speed, good model stability and strong generalization ability that could better solve the problem of rolling bearing fault diagnosis.
The comparison confusion matrix obtained by the classification of the test set is shown in Figure 34. The accuracy of the experiment is calculated according to the comparison of the confusion matrix, as shown in Table 11.   After 100 training iterations, the accuracy of the GCN, CNN and GAT models reached about 100%, while the iteration time and loss value of the GCN model are higher than those of the CNN and GAT, and the stability of the GCN and GAT is better than that of the CNN. This indicates that the method proposed in this paper is superior, with fast iteration speed, good model stability and strong generalization ability that could better solve the problem of rolling bearing fault diagnosis.
The comparison confusion matrix obtained by the classification of the test set V test is shown in Figure 34. The accuracy of the experiment is calculated according to the comparison of the confusion matrix, as shown in Table 11. The confusion matrix can analyze the classification results of samples in detail. In the figure above, the horizontal coordinates represent the predicted label of the samples, and the vertical coordinates represent the actual label of the samples. As can be seen in Figure 34, the Attention, MLP, GCN and CNN models all deviate in their diagnostic effects, while the GAT is accurate in diagnosing various faults, and its diagnostic effect is better than that of the other four methods.
It can be seen in Tables 11 and 12 that the test set accuracy of the GAT is 100%, higher than that of the MLP, Attention, GCN and CNN models, which proves that the method proposed in this paper can realize a more accurate diagnosis of fault data.
those of the CNN and GAT, and the stability of the GCN and GAT is better than that of the CNN. This indicates that the method proposed in this paper is superior, with fast iteration speed, good model stability and strong generalization ability that could better solve the problem of rolling bearing fault diagnosis.
The comparison confusion matrix obtained by the classification of the test set is shown in Figure 34. The accuracy of the experiment is calculated according to the comparison of the confusion matrix, as shown in Table 11.          The confusion matrix can analyze the classification results of samples in detail. In the  The experimental data of the rolling bearings obtained in the experiment were decomposed and reconstructed using the WOA-VMD signal to eliminate noise components, and were subsequently used as input. The network model and structural parameters established above were used to divide the input data. The experimental data of the rolling bearings obtained in the experiment were decomposed and reconstructed using the WOA-VMD signal to eliminate noise components, and were subsequently used as input. The network model and structural parameters established above were used to divide the input data. The comparative experimental results are shown in Figures 35 and 36.
It can be seen in Figure 35 that the diagnostic accuracy of the MLP and Attention models are lower than that of the GCN, CNN and GAT models. It can be seen that the diagnostic accuracy of the graph structure used in this paper as input is higher, and the diagnostic accuracy of the GAT is higher than that of the GCN and CNN models. It can be seen in Figure 36 that the loss value of the GAT and CNN is lower than that of the GCN model, but the stability of the GAT is better. Therefore, the GAT has a certain stability and better accuracy.  According to Table 13, the test set accuracy of the GAT is 100%, which is higher than that of the MLP, Attention, GCN and CNN models.  It can be seen in Figure 35 that the diagnostic accuracy of the MLP and Attention models are lower than that of the GCN, CNN and GAT models. It can be seen that the diagnostic accuracy of the graph structure used in this paper as input is higher, and the diagnostic accuracy of the GAT is higher than that of the GCN and CNN models. It can be seen in Figure 36 that the loss value of the GAT and CNN is lower than that of the GCN model, but the stability of the GAT is better. Therefore, the GAT has a certain stability and better accuracy.
According to Table 13, the test set accuracy of the GAT is 100%, which is higher than that of the MLP, Attention, GCN and CNN models. As can be seen in Table 14, the diagnostic accuracy rate of the GAT for various fault signals can reach 100%, while the diagnosis results of the Attention, MLP, GCN and CNN models are all biased. It can be seen that the diagnostic effect of the graph neural network model is better than that of other models, and the GAT has better stability than the GCN. Therefore, the superiority of the GAT used in this paper is proved via the accuracy and precision rate indexes.