Gearbox Fault Diagnosis Based on Gramian Angular Field and CSKD-ResNeXt

: For most rotating mechanical transmission systems, condition monitoring and fault diagnosis of the gearbox are of great signiﬁcance to avoid accidents and maintain stability in operation. To strengthen the comprehensiveness of feature extraction and improve the utilization rate of fault signals to accurately identify the different operating states of a gearbox, a gearbox fault diagnosis model combining Gramian angular ﬁeld (GAF) and CSKD-ResNeXt (channel shufﬂe and kernel decomposed ResNeXt) was proposed. The original one-dimensional vibration signal of the gearbox was converted into a two-dimensional image by GAF transformation, and the image was used as the input of the subsequent diagnosis network. To solve the problem of channel independence and incomplete information caused by group convolution, the idea of channel shufﬂe is introduced to enable the branches of the group convolution part to establish information exchange. In addition, to improve the semantic expression ability of the model, the convolutional kernel of the network backbone is split and replaced. The model is veriﬁed under the different working conditions of the gearbox and compared with other methods. The experimental results show that the diagnostic accuracy of the model is up to 99.75%, and the precise identiﬁcation of gearbox faults is realized.


Motivation
Rotary machinery is mainly used to drive mechanical equipment and plays a crucial role in the mechanical equipment. Therefore, the reliability and safety requirements for rotating machinery are extremely high [1]. At the same time, with the rapid development of intelligent manufacturing, mechanical equipment tends to have high precision and high reliability. The failure of a component often causes a chain reaction, leading to a severe accident, which greatly increases the economic cost of equipment operation [2][3][4]. The gear is the one of three basic components of rotating machinery. Gear failure accounts for a large proportion of mechanical failures [5]. Therefore, it is very crucial to accurately identify gearbox state and diagnose and predict gearbox fault [6,7].

Analysis of Related Works
At present, fault diagnosis methods are mainly divided into three types, which include model-based fault diagnosis methods, signal-processing-based fault diagnosis methods, and data-driven fault diagnosis methods. In the data-driven fault diagnosis methods, the methods can be further divided into two types, namely, traditional machine learning fault diagnosis methods and deep learning fault diagnosis methods.
Model-based fault diagnosis methods utilize the correlation between gearbox fault features and physical model, analyze its fault mechanism to build and optimize the model, and realize real-time fault diagnosis and prediction [8]. However, it is difficult to establish an accurate gearbox model in practical application, which greatly limits the application of model diagnosis methods.
The fault diagnosis methods based on signal processing determine the effective indicators of diagnosis by analyzing the correlation between signals and faults. Fault diagnosis is achieved by constructing fault features through the dimensional and dimensionless indicators of signals [9,10]. However, the working conditions of the gearbox are complex and changeable, and the selected features are difficult to use in different conditions. Therefore, mining the commonness of fault data in massive data is an effective means of fault diagnosis.
In recent years, due to the large increase in training resources and the rapid development of computing power, data-driven fault diagnosis methods have gradually attracted more attention [11,12]. The development of machine learning algorithms provides a new path for gearbox fault diagnosis. According to the signal processing technology, the signal is analyzed and the feature vector that can effectively express the fault is constructed. Then, a machine learning algorithm is adopted for intelligent fault diagnosis, such as support vector machine (SVM) [13], KNN [14], random forest [15], etc.
However, in traditional machine learning algorithms, the screening and extraction of fault features still rely on manual operation, which brings uncertainty to fault diagnosis and fails to achieve the purpose of real intelligent diagnosis. The deep learning method with a powerful feature-learning ability can realize automatic feature extraction and fault classification, so it is widely used in the fault diagnosis field [16].
The input of the diagnosis model based on deep learning includes two fault sample types, which are the one-dimensional (1D) vibration signal and two-dimensional (2D) image [17]. The former directly extracts fault features from 1D vibration signals for diagnosis, while the latter combines signal processing technology to convert vibration signals into 2D images. Many studies have utilized signal preprocessing technology to improve sample quality in the conversion process. It has abundant data and strong computing power in the current fault diagnosis field. The fault feature extraction method, which inputs fault image samples into a deep learning model, is a necessary choice for accurate fault identification.
The analysis of the above fault diagnosis methods is shown in Table 1.  [20] Signal processing-based fault diagnosis method It does not need to rely on a large amount of data and also has better performance for signals with low SNR. However, the signal processing method is localized, and different research objects usually correspond to different fault diagnosis indexes.  [25] Traditional machine learning-based fault diagnosis method Machine learning algorithms inject intelligence into the field of fault diagnosis, but the feature extraction process and classification task are two independent subjects. How to extract the optimal features is still a problem that many researchers are paying attention to.  [29] Deep-learning-based fault diagnosis method One-dimensional signal as input It has low computational complexity and is suitable for real-time and low-cost applications, but the applicability of one-dimensional signals and most network structures is poor. The internal setup of the model is the problem facing to improve the applicability of one-dimensional diagnostic model. The signal is converted into a two-dimensional image as input The model can learn the most representative fault features by combining the signal preprocessing technology with the algorithm with excellent performance in the field of image recognition, but this method is restricted by the amount of data and training cost.  [37] Based on current research, common deep learning models include long short-term memory (LSTM) [38], convolutional neural network (CNN) [39], recurrent neural networks (RNN) [40], artificial neural network (ANN) [41], etc. Because of its powerful feature extraction and classification ability in the face of complex data, CNN has been widely used in the field of fault identification [42]. However, gradient dispersion/explosion will occur in some networks when the depth of the network is increased, such as AlexNet [43] and VGG [44]. The proposal of BatchNorm can alleviate the gradient problem to a certain extent [45], but there is network degradation. The problem of network degradation was solved by ResNet [46], as proposed by He et al. in 2017, but it increased the difficulty of network design and the cost of calculation. ResNeXt (suggesting the next dimension) [47] adopted the residual module and added the ideas of group convolution and stack to reduce the number of hyperparameters and calculation cost on the basis of ensuring accuracy.
ResNeXt has been applied in various recognition and classification tasks because of its strong comprehensive performance. Gao et al. [48] used ResNeXt50 to identify individual underwater fish. Zhang et al. [49] used ResNeXt-50 as a backbone network to detect an abnormal object in X-ray images. Wang et al. [50] identified the degree of maize disease occurrence by ResNeXt101. Fang et al. [51] realized accurate recognition of dynamic gesture by using ResNeXt. All the above studies gave full play to ResNeXt's excellent image recognition ability and achieved ideal experimental results. Therefore, it is effective to apply ResNeXt to gearbox fault identification and classification.

Contributions
Since ResNeXt has an excellent learning ability in "vision", it is adopted in this paper as a diagnosis model for gearbox faults, which can make up for the gaps in its application in the field of fault diagnosis. In view of this, a diagnostic method based on GAF (Gramian angle field) [52] and CSKD-ResNeXt is proposed in this paper. Using GAF, a signal conversion method that can preserve the correlation between signal and time and effectively express valid fault information, one-dimensional gear box vibration data are converted into two-dimensional images as the input of ResNeXt. To enable the diagnosis model to learn the gearbox state information more comprehensively, the structure of ResNeXt was optimized to improve the gearbox fault feature extraction ability. The main contributions of this paper are as follows: i.
In this paper, a Gramian image is used as the sample diagram of model input. After comparing the performance of GADF (Gramian angular difference field) and GASF (Gramian angular summation field), one with good effect is selected to process onedimensional vibration signals, and the output two-dimensional sample image is used to express time-dependent signal characteristics. ii. The 7 × 7 convolutional kernel in the backbone of the ResNeXt model was decomposed into three 3 × 3 convolutional kernels, which reduced the feature extraction ambiguity caused by a large convolutional kernel and improved model semantic capability. After receiving vibration signals, the convolution kernel can extract more accurate and detailed feature information and improve the diagnostic accuracy. iii. For the purpose of feature communication, channel shuffle is added to the group convolution part to break the isolation between channels and exchange data. The data flow in the model is enriched to obtain a more competitive feature-mining capability.
In addition, the process of fault identification and classification is demonstrated by using t-SNE visual dimension reduction.
The remaining sections are outlined as follows. Section 2 provides the methods of GAF and CSKD-ResNeXt. The introduction and partitioning of the dataset is illustrated in Section 3. Experimental results are shown in Section 4. Finally, concluding remarks are given in Section 5.

Methods
In this section, the preliminary knowledge of GAF and ResNeXt and their significance are first introduced, and then, the problems existing when ResNeXt is used as a fault diagnosis model are analyzed. Finally, channel shuffle and kernel decomposed are introduced to establish CSKD-ResNeXt.

The GAF
The Gramian angle field uses a one-dimensional time series in the Cartesian coordinate system for numerical scaling [53], and the matrix based on polar coordinates encodes the time series into images to maintain the correlation between signals and time, and then uses trigonometric functions to generate a GAF matrix and convert it into two-dimensional images [54]. Suppose that the original time series has n values, X = {x 1 , x 2 , x 3 , . . . , x n }, and the sequence is normalized to between [−1, 1] and [0, 1], denoted as X = { x 1 , x 2 , x 3 , . . . , x n }, and x i is the value of the normalized time series. Map data to polar coordinates as x i −1 and x i 0 : The time series is represented in polar coordinates; in Equation (2), x i is mapped to angle φ i , and time stamp t i is mapped to radius r i .
where X is the time stamp, and the interval [0, 1] is divided into N equal parts, so that the span of polar coordinate system is regularized. The encoding mapping of Equation (3) has two important properties. First, the transformation is bijective because cos(φ i ) is monotonically decreasing at φ i ∈ [0, π]; there is a unique corresponding value in the polar coordinate system when given a time series, and its inverse mapping is unique. Second, the transformation preserves the time information, and the time value can be determined by the radius coordinates. The correlation between each time point is defined using trigonometric difference or trigonometric sum: where φ i (i = 1, 2, · · · , n) is the angle of the ith time point in polar coordinates, and I is the unit row vector. In the formula, the inner product is redefined, , and a penalty term is added to reduce the interference of noise. Figure 1 shows the conversion process of vibration signals into images through GAF: (a) represents the time series containing 1000 vibration signal points, (b) represents the representation of vibration signal mapped in polar coordinates through Equation (3), and (c) represents the two-dimensional image of the final GAF transformation. The advantage of the Gramian angle field in converting time-series data into image data is that it cannot only retain the complete information of the signal but also maintain the dependence of the signal on time. Then, the advantages of ResNeXt in image classification and recognition will be further made full use of for state recognition.

ResNeXt
On the basis of the residual structure, ResNeXt proposes a new dimension of cardinality and uses group convolution [55] to replace the three-layer convolution structure of ResNet, which not only improves the accuracy of the neural network but also reduces the parameter complexity so that ResNeXt performs better in neural network models with the same complexity. In addition, based on the ResNet structure of ResNeXt, the idea of parallel topology is introduced to increase the cardinality to 32, as shown in Figure 2. The residual part of ResNeXt is composed of grouping convolution, which makes ResNeXt more accurate and more efficient than ResNet.
The increase in cardinality means that the ResNeXt structure contains more parallel topologies, which can be seen in Equation (6): where ω is the weight of different topologies, C(x) is the output value of the flat same topology, and n is the number of identical branches that a module has. Although the introduction of cardinality improves the computational efficiency and identification accuracy of ResNeXt, it also brings the problem of channel independence. The independence of channels leads to the output only being derived from a small part of the input channels. As a result, there is no information flow between channels, and the generated features lack representativeness and weaken the model generalization ability. In addition, the first convolution layer in the ResNeXt backbone is the first place for feature extraction after sample input. Whether the extraction of sample information is comprehensive and accurate has a great influence on the subsequent processing. The large convolution kernel has less nonlinear ability than the small convolution kernel. In the case of the same inductive field, multiple small convolution layers have more nonlinear functions, which can make the decision function more deterministic and play the role of implicit regularization. In view of the above problems, this paper makes the following improvements: (1) Channel Shuffle ShuffleNet [56], proposed in 2017, solved the problem of feature graph communication between different groups caused by channel sparse connectivity, such as group convolution. Different from the dense pointwise convolution (which requires considerable complexity) adopted by Xception, MobileNet, and other networks, channel shuffle has no expensive calculation cost or high complexity and can make the input and output channels completely related. Therefore, channel shuffle is used in this paper to solve the grouping convolution problem of ResNeXt to provide help for information flow between channels. The main steps are as follows: i.
Reshape: the input layer is assumed to be divided into g groups, and the total number of channels is g × n. The input channel dimension is reshaped into two dimensions (g,n), which represent the number of convolution groups and the number of channels contained in each convolution group. ii. Transpose: transpose two extended dimensions into (n,g). iii. Flatten: the transposed channel flatten is reshaped into dimension g × n, and channel shuffle can be finished.
After channel mixing, the feature graphs received by the subsequent group convolution layer from the previous layers are mixed fully correlated.
(2) Kernel Decomposed By observing the structure of ResNeXt, it can be seen that the first layer in the input backbone consists of a 7 × 7 convolutional kernel, whose receptive fields are the same as those of three 3 × 3 convolutional kernels. The computational cost backbone of the convolution layer is the square of the width of the convolution kernel or the height of the convolution kernel, so the computational amount of one 7 × 7 convolution kernel is equal to 5~6 times that of three 3 × 3 convolution kernels. Meanwhile, adding an activation function between the additional network layers can increase the nonlinear representation ability of the network when the receptive field is the same. On the premise that the details of the convolutional layer are not lost, the number of network model parameters is reduced, and the mining depth and feature precision of the model are improved.

Establishing the CSKD-ResNext Network
In this paper, feature graphs after group convolution are first "reorganized", i.e., "uniformly disrupted", to ensure that information can flow between different groups. Second, the 7 × 7 convolution kernel in the input backbone is replaced by three 3 × 3 convolution kernels, whose stride is 1; the output channel size is 64, and batch normalization is adopted. Meanwhile, the LeakyReLU function is adopted as the activation function in the convolution layer to solve the problems of neuron "extinction" and gradient disappearance caused by the Relu function. Set the output dimension of the full connection layer to 5, which corresponds to 5 different states of the gearbox. Table 2 below shows the details of the CSKD-ResNeXt network structure. Based on the above, a gearbox fault diagnosis method based on GAF and CSKD-ResNeXt is proposed in this paper, as shown in Figure 3. The time series is converted into GASF and GADF images, which are used as the input of the subsequent convolutional network. On the basis of ResNeXt-50, the backbone convolutional kernel is split, and the channel shuffle is introduced to obtain CSKD-ResNeXt, which is used to extract features from the input sample graph. After three 3 × 3 convolutional kernels and four convolutional layers composed of different block numbers, deep learning and feature mining of samples in different states are carried out. The final predicted fault category is output through a softmax classifier after global pooling. The performance of the model was evaluated by the accuracy and loss of the test set. The t-sne scatter plot and confusion matrix were used to visually display the diagnostic process and results.

Data Description
In this section, the sources, types, working conditions, and other information of the datasets are introduced; the division of training sets and test sets and the number of sample sets are also shown in detail, and the configuration of the experimental platform and the common parameters of the operating framework are explained.

Datasets
The gearbox dataset in this paper comes from the experimental setup for a gearbox of Southeast University in China [57], as shown in Figure 4. The dataset includes 20 Hz-0 V and 30 Hz-2 V load conditions. The gearbox state has four fault states and one health state. Each state signal includes the vibration signal of the motor, motor torque, planetary gearbox in x, y, and z directions, and parallel gearbox in x, y, and z directions. The data types are shown in Table 3.  Each of the 10,000 data points in the dataset was truncated to generate sample graphs with a size of 224 × 224 RGB three-channel. Under the working condition of 20 Hz-0 V, each fault has eight columns of vibration data corresponding to eight positions or directions of the gearbox test stand. Each column contains 1.04 million vibration data points and generates 104 images, so each failure type has 832 images. The operating data of 30 Hz-2 V is the same. The gearbox datasets under two working conditions contained a total of 8320 sample graphs, which were divided into training sets and test sets in a ratio of 4:1. Each fault type was composed of 1332 training samples and 332 test samples.

Experimental Platform Setting
The gearbox fault diagnosis model runs on the Pytorch framework, and the experimental platform is configured as follows: 64-bit Windows 10 operating system, Intel (R) Xeon (R) Gold 6330@ 2.00 GHz (CPU), RTX 3090 (24 GB) (GPU), and code written in the Python 3.8 environment. The adaptive moment estimation (Adam) algorithm [58] is used to update the network training parameters. The initial learning rate is 0.001. Use ReduceLROnPlateau to update the learning rate to achieve the self-attenuation process. It takes the accuracy of the test set as the adjustment indicator, and patience in ReduceLROnPlateau was set as 4 according to the results after repeated experiments. The model loss is calculated using cross entropy, and the dropout in the model is set to 0.2.

Analysis of Model Results
In this section, performance verification and comparison experiments are performed on the proposed model with accuracy, loss, and other indicators, including comparison between GADF and GASF, comparison between GAF and STFT and CWT, comparison between CSKD-ResNeXt and classical networks, and visualization of the fault classification identification process of key convolution layers.

Model Verification
To compare the effectiveness of the GADF and GASF methods, the images generated by GADF and GASF are respectively input into CSKD-ResNeXt, as shown in Table 4. In this paper, image samples of GADF are selected as the input of subsequent models. The accuracy and loss on the test set of the fault diagnosis model proposed in this paper under two working conditions are shown in Figure 5. It can be seen that the model does not converge, and the accuracy fluctuates in the early stage of training. After about 40 epochs, the accuracy fluctuates between 95% and 100%; after 60 epochs, the accuracy of the two conditions tends to be stable and converges to 99.75% and 99.27%, respectively, while the loss gradually approaches zero. The accuracy and loss trends of the two conditions are not too different, which proves that the model has a certain generalization ability. After testing, under the working condition of 20 Hz-0 V, 828 of the 830 gearbox image samples were correctly classified, and one sample of miss and root, respectively, was misjudged as normal. Under the working condition of 30 Hz-2 V, a total of six samples were misjudged, among which miss had the largest number of misjudged samples, four of which were misjudged as root. In addition, all health and surface samples were judged correctly, and the two confusion matrices are shown in Figure 6. In general, CSKD-ResNeXt can avoid state confusion and can identify different faults well.
To prove the superiority of GAF, the following fault diagnosis methods are selected to compare with the method in this paper. 1 STFT+ CSKD-ResNeXt: a one-dimensional time series is converted into two-dimensional time-frequency image by STFT, and then, the two-dimensional image is used as the input of CSKD-ResNeXt. 2 CWT+ CSKD-ResNeXt: a one-dimensional time series is processed into a two-dimensional time-frequency graph by continuous wavelet transform, and then, the time-frequency graph is used as the input of CSKD-ResNeXt. The fault diagnosis results of different methods are shown in Figure 7. The method in this paper (GAF+ CSKD-ResNeXt) has the best performance, while the accuracy of the STFT+ CSKD-ResNeXt method is the lowest, which is 94.06% and 95.85%, respectively, under two working conditions. The accuracy of the CWT+ CSKD-ResNeXt method can reach 96.75% and 97.69%. Therefore, GAF is used to process time-series signals with higher accuracy, which further indicates that the two-dimensional image transformed by GAF can retain the relevant information between the original time-series data better.  To intuitively show the influence brought by channel shuffle and convolution kernel splitting in CSKD-ResNeXt, ablation experiments were set to show and compare the improvement degree of modified parts, as shown in Table 5. ResNeXt represents the network without channel shuffle and convolutional kernel splitting; 7 × 7-ResNeXt represents the network with convolutional kernel splitting but without channel shuffle; CSKD-ResNeXt represents the network with both operations, namely, the network proposed in this paper. Other settings are consistent, such as LeakyReLU, initial learning rate, ReduceLROnPlateau (including patience), Adam, etc. Convolution kernel splitting can improve the comprehensiveness and delicacy of feature mining, and channel shuffle can make up for the defect of a group convolution's independent information channel. Both of these make CSKD-ResNeXt improve the breadth and depth of feature mining, so as to extract fault features more fully.

t-SNE Visualization
The popular t-SNE [59] is used to make the output of the representative stage in the model be low-dimensional mapped and visualized. Five colors represent five gearbox states. The visualized result of dimensionality reduction is shown in Figure 8. With the deepening of layers, data points in the same state gradually gather, while data points in different states gradually separate.

Contrast of Classical Model
To further verify the recognition performance of the proposed GAF+ CSKD-ResNeXt model and other fault diagnosis models, classical convolutional neural networks AlexNet, ResNet50, and DenseNet were selected for comparison experiments. The softmax classifier was used for all comparison models. The comparison results are shown in Figure 9. The accuracy of the GAF+ CSKD-ResNeXt model is higher than that of other models under two working conditions, and the convergence and stability of the GAF+ CSKD-ResNeXt model are better than those of other models.

Conclusions
In this paper, a gearbox fault detection method combining GAF and improved ResNeXt is proposed. The performance of the proposed method is verified in two operating conditions of the gearbox, and the accuracy of fault identification can reach 99.75%. The comparison experiment results between GAF and the time-frequency conversion method show that GAF has better ability to express different state features. After comparison between GASF and GADF, GADF is selected to output two-dimensional images according to the accuracy and loss performance. The ablation experiment shows that the modified ResNeXt model can promote the information exchange in the network and improve the feature-mining ability. The experimental results of comparison between the proposed model and other classical network models show that the GAF-CSKD-ResNeXt method has higher recognition accuracy and faster convergence speed and can effectively classify gearbox faults. Our future work will further improve the feature expression ability of vibration signals, reduce the workload of the feature selection process, and pay more attention to the interpretability of the feature selection process to further build a model with stronger generalization ability, higher stability, and better interpretation. Further consideration will be given to the fault diagnosis of the gearbox under the condition of load, voltage, speed, and other conditions changing at the same time.

Conflicts of Interest:
The authors declare no conflict of interest.