Bearing fault diagnosis with parallel CNN and LSTM

: Intelligent diagnosis of bearing faults is fundamental to machinery automation and their intelligent operation. Deep learning-based analysis of bearing vibration data has emerged as one research mainstream for fault diagnosis. To enhance the quality of feature extraction from bearing vibration signals and the robustness of the model, we construct a fault diagnostic model based on convolutional neural network (CNN) and long short-term memory (LSTM) parallel network to extract their temporal and spatial features from two perspectives. First, via resampling, vibration signal is split into equal-sized slices which are then converted into time-frequency images by continuous wavelet transform (CWT). Second, LSTM extracts the time-correlation features of 1D signals as one path, and 2D-CNN extracts the local frequency distribution features of time-frequency images as another path. Third, 1D-CNN further extracts integrated features from the fusion features yielded by former parallel paths. Finally, these categories are calculated through the softmax function. According to experimental results, the proposed model has satisfactory diagnostic accuracy and robustness in different contexts on two different datasets.


Introduction
Machinery operation and maintenance tend to be automated and intelligent, which challenges the diagnosis of their faults.As key and essential parts in rotating device, bearings are prone to failure since their operation tend to be affected by severe working conditions [1].Once the bearing fails, it causes serious system collapse and equipment damage, even cause great economic losses and casualties.Therefore, effective fault diagnosis and operating state analysis of bearings can help to detect mechanical equipment failures in time, which is the pre-requisite of reliable and stable operation of mechanical systems.The faults of bearings could be reflected by their vibration signals, which are time series data of periodic oscillation and contain various information about operational state [2].In different working conditions, the mapping relationship between the vibration signal of bearings and their fault state becomes more complicated, which increases the difficulty of fault identification [3].Over the last decades, based on vibration signal analysis, research on bearing fault diagnosis have mainly gone through three phases: (i) Fault diagnosis based on signal modal analysis; (ii) fault diagnosis based on traditional shallow machine learning; (iii) fault diagnosis based on deep learning.
Research on fault diagnosis based on signal modal analysis gradually develops into a mature methodology.Time-frequency methods are widely used for the pre-processing of bearing vibration signals, including empirical mode decomposition (EMD) [4], variational mode decomposition (VMD) [5], short-time Fourier transform (STFT) [6], and wavelet transform (WT) [7], etc. Focusing on the time domain, STFT can extract vibration components of the signal distribution in the frequency domain, but it is not applicable to time-varying signals.Based on STFT, WT can analyze signals at multiple scales, capturing details and features in different frequency ranges, and has been used extensively in signal analysis [8,9].
The development of machine learning push fault diagnosis towards automated and intelligent diagnosis.Generally, traditional shallow machine learning methods in fault diagnosis are usually utilized for signal processing, feature extraction, and final fault classification by a classifier [10][11][12][13].
Commonly-used methods in fault diagnosis include support vector machines (SVM) [14], Bayesian classifiers [15], artificial neural networks (ANN) [16], etc. Li et al. [17] used EEMD to decompose signal and extracted fault feature frequency from optimized signal by power spectrum to identify the fault.ANN could utilize the features extracted from envelope analysis to realize cross-domain fault diagnosis [18].Although great improvement has been achieved by these traditional machine learning methods in fault diagnosis, their diagnostic accuracy depends on manual feature selection and extraction.Thus, some scholars put methods for adaptive feature extraction.Li et al. [16] decomposed the signal into sub-signals of multiple scales, extracted their local features with BPNN, and identified fault features by SVM.However, the diagnostic performance of traditional shallow machine learning models (e.g., SVM, ANN) is limited by the quality of extracted fault features.Moreover, data from sensors in real time are so diverse and ever-increasing that these methods cannot fully utilize all information, and the accuracy goes down with the increasing quantity of the data.
Deep learning network (e.g., CNN, RNN, DBN, GAN, AE) [19][20][21][22][23], as an extension of traditional machine learning based on neural networks, has shown significant superiority in image classification, natural language processing, and target detection [24][25][26].It has excellent capability of feature representation, and could extract deep features of the data by constructing a deep network and achieve the complex nonlinear mapping between the pre-processing data and the fault states.In fault diagnosis, CNN, as a typical deep learning network, has achieved a great deal of excellence [27][28][29][30][31].Many stacked CNNs have been proposed which could directly extract information from the input of time domain vibration signal without priori expert knowledge, and achieve good performance under variable loads [3,32].Ye et al. [33] used adaptive variational modal decomposition to preprocess the signal to obtain the desired components, and a modified 1D-CNN is used to extract features and identify faults.Vibration signals of the faults belong to time series and contain temporal features.Although CNN could extract their spatial features, it cannot fully mine the fault features from the perspective of local space.While RNN could effectively deal with time series data, and better handle vibration signals with time-series correlation [34,35].An et al. [36] proposed a diagnostic method based on periodic sparse attention and LSTM, where LSTM is used to extract time-correlated features of the signal.Some scholars also proposed a combination of 2D-CNN and LSTM for fault diagnosis [37].Although neural network-based diagnosis models already have a high diagnostic accuracy, some fails to fully consider the feature specialty of vibration signals, and is unable to fully explore the time domain and frequency domain features.These inadequacies are reflected in the following two aspects: (i) The vibration signals belong to periodic vibration signals, if the signal analysis merely from the time domain or the frequency domain cannot adequately extract the fault features of the signals.(ii) The combination of CNN and LSTM for fault diagnosis is often utilized in a single-path and series way.In the series situation, the output of one network will be taken as the input of the following network, and extracted features of the latter are inevitably influenced by the former, thus negatively affecting performance of feature extraction and fault classification of the model.Moreover, as the deep network continue to be deeper, these diagnostic models tend to be overfitting [38,39].Extra optimization needs to be done to solve this problem [33,40].
To avoid over-stacked and dilated CNNs, attention mechanism is introduced to dynamically focus on different parts of the input, especially when sequential data or images are subject to multiple target processing with multi-stages [41,42].Wang et al. [43] embedded the attention mechanism into a neural network to achieve high diagnostic performance even if noises are added into the signal.Some scholars suggested converting original signal into two-dimensional (2D) images to facilitate feature extraction via CNN [37,44,45].Wen et al. [46] designed an improved LeNet•5 network with high diagnostic accuracy which could convert the original 1D signal into 2D images and then directly extract features from the latter.
In sum, although existing research provides many intelligent solutions for fault diagnosis, shortcomings exist, especially in the following aspects: 1) Most fault diagnosis models based on traditional shallow machine learning relies on manually extracted features, which cannot be applied into equipment fault diagnosis in the era of big data due to their structural limitations.
2) The mere model input of time-series data or frequency-domain data would lead to insufficient extraction of spatiotemporal fault features.
3) Existing fault diagnosis model combining CNN and LSTM utilizes the output of one network as the input of the following network, features extracted from CNN will cause negative effects on that extracted from LSTM, and CNN cannot adjust the deviation resulting from the output of LSTM under the cascade network.
In terms of the limitations above, to adequately extract the failure features and improve the performance while avoiding overfitting, this paper constructs a shallow neural network with parallel feature extraction from time and frequency domains, which can extract local frequency distribution features and time-correlated features from the vibration signal.The shallow structure has fewer network parameters, which could quickly stabilize and effectively converge.The major contributions of this paper are summarized as follows.
1) A dual-input fault diagnosis model with shallow structure is built based on CNN and LSTM with the inputs of vibration signals and time-frequency images simultaneously.The model could extract time-correlated and frequency distribution features of the signals in parallel from time and frequency domains to improve the quality of the features.
2) The features obtained from the parallel network are fused to form more comprehensive features that can further enhance the representation ability of the model, thus improving its robustness and generalization.
3) Experiments demonstrate that the model proposed provides excellent robustness and generalizability.
The rest of this paper is organized as follows: Section 2 briefly summarizes the related theoretical methods; Section 3 introduces the diagnostic process and model framework; Section 4 describes the experimental results; and conclusions are made in Section 5.

CNN
Convolutional neural network (CNN) is a feed-forward neural network with distinguishing features of weight sharing, local connectivity, and pooling.The core idea of CNN is to extract features through convolutional and pooling layers, then input the them into a fully connected layer for classification or regression tasks.A typical CNN usually includes input, convolutional, pooling, fully connected, and output layers.

Convolutional layers
The convolutional layer could extract feature by convolving the input with kernels.The convolutional kernel updates weights by learning, and convolutional kernels with different sizes can obtain different outputs.The computation of the convolutional layer is as follows: where  denotes the feature map of the j-th of the l-th convolutional layer, * represents the convolution operation,  and  are the weights and biases of the j-th convolutional kernel in the l-th convolutional layer, f is the activation function which is commonly ReLU function.

Pooling layers
The pooling layer can reduce the size of the feature map, extract salient features, and reduce the sensitivity of the model to the input data.Maximum pooling is applied in this paper, whose operation can be expressed as follows:

Fully connected layers
The fully connected layer could effectively integrate and transform the features extracted from the previous layer to improve expressive ability and performance of the network.The classification results are obtained by the softmax function.

LSTM
Long short-term memory network (LSTM) introduces memory block and gate mechanisms to improve feature extraction of the input.The gate unit in LSTM is responsible for controlling and regulating the flow of information, so as to realize selective memorizing and forgetting of important information in the sequence, which addresses the gradient vanishing or explosion issues in traditional RNNs. Figure 1 illustrates a common schematic of an LSTM unit.Forget gate determines which information would be forgotten from the memory state of the previous time step, and is implemented by a sigmoid function.It is calculated as follows: The input gate serves to control the extent to which new input information updates the current time-step memory state, and it is calculated as follows: The output gate is a gating unit which controls the flowing of information from the current state to the output layer.Selectively outputting information in the hidden state at current time step could pass useful, long-term dependencies to the final layer.Its calculation is as follows: where  is a sigmoid function, w is a weight matrix,  is the input of current time step, ℎ is the short-term state of the last cell, and b is the bias vector.

CWT
Continuous wavelet transform (CWT) is a technique for analyzing the time-frequency features of non-smooth signal, which can capture transient changes and local structure in the signal.It can provide a time window that changes with frequency and perform adaptive multiscale analysis on signals through scaling and translation transformation.Its mathematical model is formulated as follows: where a is the scale parameter, b is the translational value,   is the mother-wavelet function,  is the input.

Resampling
Deep learning requires a large number of samples for model training.Model tends to be underfitting when the training data are insufficient.Data augmentation technology could expand sample size which lays the foundation of a good model performance.Data augmentation method employed in this paper is resampling.It is performed by using a window width of w along the time axis.Resampling can be expressed as Eq (10).
where w is the window width, r is rate of repetition samples, s is the step size of sliding window.The process is shown in Figure 2.

SNR
In signal processing, signal-to-noise ratio (SNR) refers to the ratio of signal power to noise power.It generally indicates how much the noise affects the desired signal.Its calculation formula is as follows: where  is the signal amplitude,  is the noise amplitude,  denotes signal power,  denotes noise power.According to Eq (11), the smaller the SNR value, the stronger the noise compared with the signal, indicating the poorer the signal quality.

Framework of the diagnostic model
The proposed diagnostic model can extract frequency distribution and time-correlated features in parallel.Besides, gate structure in LSTM further enhances the generalization performance of the model.The framework of the proposed diagnostic model is shown in Figure 4.It mostly contains three submodules: Data pre-processing, feature extraction, and fault classification.Furthermore, these three stages are further divided into five steps in detail.
Step 1: Data pre-processing.The dual-path feature extraction network consists of two different networks which extract time-series features and spatial features of the signal, respectively.In data preprocessing, the raw vibration signals are sliced into short samples.Then, these samples are converted to time-frequency images through CWT.
Step Step 3: Feature fusion.The temporal and spatial features extracted are passed to the fusion layer for feature fusion whose results can represent more comprehensively global features of the signal, thus improving the comprehensiveness of extracted features.The fusion process can be expressed as follows: where ρ ，ρ denote the extracted feature vectors of LSTM and 2D-CNN, x denotes the fused feature vector, concatenate (•) denotes the fusion process of 1D vector.
Step 4: Comprehensive feature extraction.A Batch Normalization (BN) layer is embedded which can be applied to any neural network layer to reduce the sample differences between layers by reducing the internal covariance migration [47].The fused features are input to 1D-CNN for integrated feature extraction.
Step 5: Fault identification.Fault type is obtained by calculating the probability through the softmax function.

Model parameters
In the experiment, the feature extraction network consists of a two-layer LSTM, a two-layer 2D-CNN, and a one-layer 1D convolutional block.Fault classification is comprised of a flatten layer, two dense layers.The dropout method and mini-batch method are applied to eliminate overfitting as much as possible.The dropout rate is 0.5, the batch size is 128, and the epoch is 100.Model parameters are updated through Adam optimizer.Main parameters in the model are present in Table 1.
In fault diagnosis, accuracy, precision, recall, and F1-score can be taken as model evaluation standards: where TP, TN, FP and FN refer to true positives, true negatives, false positives and false negatives, respectively.

Experiments and result analysis
In this paper, two open access datasets, Case Western Reserve University (CWRU) and Jiangnan University (JNU) bearing datasets, are used to evaluate the effectiveness of the method.

CWRU dataset experiment
CWRU dataset, a standard bearing dataset, was collected from the Simulated Bearing Experimental Platform at Case Western Reserve University.The signals were acquired by acceleration sensors installed on the drive end housing of the motor (2 hp) and contained four loads, with a sampling frequency of 12 kHz.Single-point damage of different diameters was introduced through the EDM technique, and bearing failures were categorized as inner ring damage (ID), ball damage (BD) and outer ring damage (OD).Table 2 summarizes the features of CWRU bearing dataset.Each dataset consists of nine fault samples with different characteristics and one normal sample.The number of data samples for each type is 600 and each data sample contains 784 data points.There are a total of 6000 samples with 10 fault types, which are divided into 4200 training samples, 900 validation samples, and 900 testing samples.The original signal is cut into sub-signal with a same size by the resampling method; then, timefrequency images are obtained through CWT.The time-domain and time-frequency images with the four fault types (original signal, inner race, ball, outer race) in CWRU dataset are shown in Figure 5.

Performance under constant load
The convergence and robustness of the model are investigated under four constant loads (0, 1, 2 and 3 HP), and the convergence status and loss function of model training process is shown in Figures 6  and 7.According to the accuracy of the four loads in Figure 6, the proposed model can converge quickly and stably to 1 under all loads.Furthermore, the loss converges to 0 steadily during model training in Figure 7.When the epoch is 20, the loss values remain close under four loads.The above results show that the model has satisfactory robustness under constant loads.3.According to Table 3, the model can achieve 99.78% diagnostic accuracy at least.To extensively analyze the result under 1HP load whose accuracy is the lowest, the precision, recall, and F1-score of the model under 1HP load are shown in Table 4.These specific details combining with the former results demonstrate the high diagnostic accuracy under constant load.t-SNE method can effectively map features from high-dimensional to low-dimensional space.The final diagnostic results of the Dense 2 layer of testing set were visualized via t-SNE in Figure 9. Obviously, the proposed model shows distinct clusters with clear classification boundaries.To demonstrate the superiority of the proposed model, five comparison models are selected, and the performances of all these six models are evaluated on the same testing dataset.Moreover, 5-fold cross-validation is adopted to ensure experimental reliability.Among them, CNN-LSTM is a singlepath serial fault diagnosis model.Except SVM, major parameters of the other four comparison models are almost the same as those of the proposed model.In addition, the input of CWT-CNN is the timefrequency image, and the other four models are vibration signals.Accuracy results are depicted in Figure 10.Based on the results in Figure 10, the accuracy of CNN-LSTM is the lowest, that of CNN reaches 97%, and that of CWT-CNN reaches 99%, indicating that the time-frequency image is competent to represent fault features compared with the time-domain signal.The parallel mechanism is adopted to extract features from local spatial information and temporal dependencies of the signal in dual-domain, and the result shows that the diagnostic accuracy of the proposed model is superior to the other five models, indicating the effectiveness of the parallel feature extraction network.For the feature extraction network with CNN and LSTM connected in series, the spatial distribution features are extracted by CNN and then the time-correlated features are extracted by LSTM.Although such model has considered the periodicity of the signal frequency distribution, CNN corrupts the temporal features of the original data, resulting in the loss of important temporal features, which would affect the quality of final extracted features.The parallel feature extraction network proposed in this paper can significantly improve the quality of the extracted features by making up for the shortcomings of the series feature extraction mechanism.

Performance with variable loads
Based on CWRU dataset, composite datasets under mixed load were constructed for experiment.The dataset name implies its components (e.g., 12HP means the load contains 1 and 2HP).Composite datasets are tested on the proposed model, and their results are displayed in Table 5.
Based on Table 5, the diagnostic results of proposed model achieve satisfactory diagnostic accuracy on the composite dataset, indicating that the proposed model can be applied to variable load conditions.Therefore, the proposed model has excellent adaptability, and it is suitable for variable load conditions.Real working contexts of the bearing are usually harsh, and its vibration signal is inevitably mixed with some noises.To simulate the signals obtained in the interference of noises, Gaussian white noises with different SNRs (-6, -3, 0, 3, 6 and 9 dB) are added to the original signal, respectively.Figure 11 shows the time domain and the frequency domain images of the original signal as well as the signals with noise added at different SNRs (-6, -3, 0, 3,6,9).Noise is added to the bearing vibration signal, making it challenging to distinguish certain features within the signal.
Comparison experiments of model accuracy with different SNRs were designed and their results are summarized in Figure 12.They show that noise does affect the diagnostic accuracy of the model, with the more severe the noise interference, the lower the accuracy.However, the proposed model demonstrates the highest accuracy under different SNRs compared with other models.

JNU dataset experiment
To test the generalizability of the proposed model, a testing experiment was performed on JNU bearing dataset.Data pre-processing here is the same with that on CWRU dataset.
JNU bearing dataset is provided by Jiangnan University.It consists of three bearing vibration subdatasets at different rotational speeds with the data acquisition frequency of 50 kHz.The sub-dataset contains one normal state and three fault states: ID, BD and OD.Based on differences in working conditions, 12 fault types are classified.The features of JNU bearing dataset are summarized in Table 6.Confusion matrix for the testing data is presented in Figure 14.The accuracy of the proposed model is 93.80%.Then, the classification result of testing set was visualized via t-SNE, and shown in Figure 15.Based on Figure 15, bearing failures with different types have clear classification boundaries.However, the proposed model has low accuracy for fault types with true labels of 2, 3 and 8, but has high fault identification accuracy for most fault types.As for true label of 2, 3 and 8, there are similarities between time-frequency images, which may in turn lead to misdiagnosis.In addition, a comparison experiment is conducted, and the accuracy results is shown in Figure 16.They demonstrate that the proposed model on JNU dataset offers higher fault classification accuracy than other four models.In a sum, comparative analysis of the results from CWRU dataset and JNU dataset suggests that the proposed model exhibits superior generalizability on different datasets.

Conclusions
Bearing vibration signals are dynamic and periodic data.Traditional shallow machine learning and single deep learning methods can hardly extract temporal and spatial features at the same time.As for those deep learning-based single-path fault diagnosis methods, feature extraction of one stage will affect that of its next stage, which would affect the final diagnostic accuracy.To tackle the insufficient feature extraction in bearing fault diagnosis, a parallel feature extraction mechanism is proposed, which can extract time-correlated information and frequency spatial distribution features in parallel; this applies CWT, combined with CNN and LSTM, to bearing fault diagnosis.First, the signal is sliced into equal-sized sub-signals via a resampling technique.Then, the CWT converts the sub-signal into a time-frequency image, which, combined with the sub-signal, are viewed as two inputs of the proposed model.Second, LSTM and 2D-CNN dual-path network are employed to extract the temporal and spatial features in parallel.After the fusion of extracted temporal and spatial features, 1D-CNN is utilized to obtain the comprehensive features.Finally, the softmax function is implemented for fault classification.In addition, a BN layer is embedded after the feature fusion layer and mini-batch method is applied to reduce the computational complexity.On CWRU dataset, the model presents satisfactory accuracy and robustness, and performs well under complex conditions under variable loads and SNRs.Moreover, the generalization performance of the model is further validated on JNU dataset.In sum, a novel fault diagnostic model is proposed which achieves satisfactory diagnostic accuracy under constant and variable loads.However, it is unable to obtain satisfactory results under strong noise, and de-noise could be one of the future research fields.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

For
nonlinear vibration signals, 1D time-domain signals fail to adequately represent all fault features, and traditional deep learning methods are unable to sufficiently extract the frequency components and time-correlated features.Stacking of neural networks might lead to model overfitting and increase training time.Compared with time domain signal, time-frequency images can represent more features of non-stationary signals, and reveal the complex signal distribution of local impacts.In this paper, based on multi-domain learning, a fault diagnosis model based on parallel feature extraction with CNN and LSTM is proposed, aiming at extracting features from both time and frequency domains.Furthermore, the proposed model exhibits satisfactory accuracy and robustness with a shallow network structure.The diagnostic procedure of the model is illustrated in Figure3.

Figure 3 .
Figure 3.The diagnostic procedure of the proposed model.

2 :
Dual-path feature extraction.At this stage, a dual-path feature extraction network is constructed, and 1D vibration data and 2D time-frequency images are viewed as the input of this step: (i) 1D vibration data are input into the LSTM network for time feature extraction.(ii) time-frequency images are input into 2D-CNN for local spatial feature extraction.

Figure 4 .
Figure 4.The framework of the proposed diagnostic model.

Figure 5 (
a) exhibits time-domain waveform of the signal, while Figure 5(b) exhibits their timefrequency images after CWT.

Figure 8
Figure 8 displays confusion matrix under four loads (0, 1, 2 and 3 HP) for the testing set, and the accuracies of the model in four different loads are shown in Table3.According to Table3, the model can achieve 99.78% diagnostic accuracy at least.To extensively analyze the result under 1HP load whose accuracy is the lowest, the precision, recall, and F1-score of the model under 1HP load are shown in Table4.These specific details combining with the former results demonstrate the high diagnostic accuracy under constant load.

Figure 12 .
Figure 12.Diagnostic accuracy of different models under different SNRs.

Figure 13 (
Figure 13(a),(b) shows the accuracy and loss function of the model training process, where the accuracy on both the training and validation sets gradually converges and the loss function narrows down

Figure 14 .
Figure 14.Confusion matrix of testing set.

Figure 15 .
Figure 15.Classification results of testing set.

Figure 16 .
Figure 16.Accuracy of different models on JNU dataset.

Table 4 .
Diagnostic results under the 1HP load.

Table 5 .
Diagnostic accuracy with composite datasets.