Fault Diagnosis of Induction Motors Using Recurrence Quantification Analysis and LSTM with Weighted BN

. Motor fault diagnosis has gained much attention from academic research and industry to guarantee motor reliability. Generally, there exist two major approaches in the feature engineering for motor fault diagnosis: (1) traditional feature learning, which heavily depends on manual feature extraction, is often unable to discover the important underlying representations of faulty motors; (2) state-of-the-art deep learning techniques, which have somewhat improved diagnostic performance, while the intrinsic characteristics of black box and the lack of domain expertise have limited the further improvement. To cover those shortcomings, in this paper, two manual feature learning approaches are embedded into a deep learning algorithm, and thus, a novel fault diagnosis framework is proposed for three-phase induction motors with a hybrid feature learning method, which combines empirical statistical parameters, recurrence quantiﬁcation analysis (RQA) and long short-term memory (LSTM) neural network. In addition, weighted batch normalization (BN), a modiﬁcation of BN, is designed to evaluate the contributions of the three feature learning approaches. The proposed method was experimentally demonstrated by carrying out the tests of 8 induction motors with 8 diﬀerent faulty types. Results show that compared with other popular intelligent diagnosis methods, the proposed method achieves the highest diagnostic accuracy in both the original dataset and the noised dataset. It also veriﬁes that RQA can play a bigger role in real-world applications for its excellent performance in dealing with the noised signals.


Introduction
An induction motor is one of the most critical components in industrial processes due to its high reliability, low cost, and robust performance. It has been widely used as the key machine dynamical equipment to generate electromagnetic torque. However, the mechanical degradation with natural aging process, coupled with the fact that motors are often exposed to multifarious harsh environments, makes motors vulnerable to various sorts of faults [1]. Any unexpected motor failure may lead to the unexpected downtime and repair expense and even cause human casualties. erefore, accurate, effective, and reliable motor fault diagnosis approaches have received considerable attention from academic research to guarantee the enhancement of security and avoid the downtime losses [2]. Various sensing techniques for signals of vibration, noise, voltage, current, flux, etc. [3][4][5] have been used in the motor fault diagnosis during the past ten decades. Among those techniques, vibration-based sensing is most frequently used due to its easier acquisition of more detailed information [3].
Recently, owing to the significant development of the computing ability [6], massive efforts for motor fault diagnosis have been devoted to the data-driven approaches. For instance, Diazet provided an experimental comparative evaluation of various classifiers including k-NN, Bagging, AdaBoost, and SVM for motor fault detection [7]. Zhang presented a motor fault identification method through sparse representation and achieved good robustness to noise [8]. Pan improved the reliability of motors with feature extraction of entropy and SVM classifier [9]. Currently, several deep learning algorithms, which are thought as effective tools for the high-level feature learning based on the nonlinear transformations through multiple layers, are widely used to discover the underlying feature representations [10]. Several deep learning techniques such as sparse autoencoder (SAE) [11], deep stacking network (DSN) [12], and convolutional neural network (CNN) [13] are used for identifying motor faults. Wang obtained the time-frequency scale-down maps using short-time Fourier transform (STFT) and classified the faulty motors using CNN [13]. Sun proposed a fast diagnosis method through convolutional discriminative learning of a BP network [14]. LSTM, as a significant branch of RNN, capable of addressing data sequences of varying length, encoding temporal information, and capturing long-term dependencies [6], has been successfully used in several machinery fault diagnosis tasks [15][16][17][18]. Recurrence quantification analysis (RQA) is a useful tool to deeply investigate the mechanical vibration dynamic properties, owing to its good capability in dealing with nonstationary data sequences, especially the data sequences involving continuous fluctuations and presenting nonlinear characteristics [19]. RQA has proven its effectiveness in bearing fault detections [20,21].
However, the abovementioned methods still have some shortcomings. e performances of the traditional methods [7][8][9] generally depend on the manual feature extraction techniques, which may be incapable of fully exploiting the raw data and may omit some important information. Besides, they have limited adaptability when facing the new diagnosis issues or application objects. For the deep learning methods [11][12][13][14], the black-box characteristics make the intermediate representations hard to explain, and the lack of domain expertise weakens the interpretability. In some papers like [13,18], the deep learning techniques only act as classifiers, which are overqualified and constrain the deep learning's capability to discover the underlying features in an end-to-end fault diagnosis. LSTM and RQA have proven effective in machinery fault diagnosis, mainly aiming at bearings [15][16][17][18][19][20][21]. So far, neither RQA nor LSTM has been applied on motors. Furthermore, the deeper research about RQA's adaptability and robustness to different levels of noise has never been considered.
To tackle those issues, a deep learning framework is integrated with manual feature learning techniques to preserve the advantages of both sides. RQA is seamlessly embedded into the stacked LSTM architecture. RQA's antinoise capability is verified by a weighted BN layer. e major contributions of this paper are summarized as follows: (i) Propose a new hybrid feature learning approach that combines statistical parameters, RQA and LSTM neural network, for motor fault detection. Neither RQA nor LSTM has been applied on motors before. (ii) A modification of BN named weighted BN is designed to assign dynamic weights to the hybrid feature set and then perform batch normalization. It shares BN's advantages and possesses the capability to evaluate the contributions of the three various feature learning approaches [22]. (iii) Two datasets, respectively, the original dataset and the noised dataset, are acquired from tests to verify the proposal's adaptability and robustness to different levels of noise. e remainder of this paper is organized as follows. In Section 2, tests of 8 three-phase induction motors with different fault types on the drivetrain diagnostics simulator platform are introduced. en, the related theories of RQA and LSTM are presented, and the specific procedures of the proposed method are described in detail in Section 3. en, in Section 4, the performance of the proposed method is illustrated and discussed. Several other methods are tested and contrasted. Finally, Section 5 provides concluding remarks.

Tests Description.
In order to experimentally verify the performance of this proposed approach, tests of 8 three-phase induction motors with different fault types under a uniform operation condition were carried out in a drivetrain diagnostics simulator platform. As shown in Figure 1(a), this system is composed of a three-phase induction motor, a magnetic brake for loading, a two-stage fixed-axis gearbox, and a two-stage planetary gearbox. is test rig meets the all configuration's requirement of vibration analysts and provides a practical and reliable test environment for motor diagnosis. A data acquisition card NI-9234, shown in Figure 1(b), is selected as the dynamic signal acquisition (DSA) module for making high-accuracy measurements from sensors. e ADC resolution is 24 bits, the dynamic range is 102 dB, and the sampling rate per channel is 51.2 kS/s. e selected acceleration sensor BW-BJ14530, shown in Figure 1(c), belongs to the voltage-output type. e sensibility (±5%) is 100 mV/g, and the measurement range is 1-5 kHz. Specifically, the tests were under the operating condition of 33.90 N m (25 lbf. ft.) load. e motor frequency was set to 15 Hz, and thus, the rotation rate was 900 r/min. Under this uniform operating condition, 8 motors with different fault types (1 healthy and 7 faulty) were used to generate the required dataset. Data from the healthy motor are used as a benchmark for comparison with the experimental data from other faulty motors. e faulty types are as follows: (1) builtin broken rotor bars (BB), (2) built-in bowed rotor (BR), (3) rotor misalignment (RM), (4) stator winding faults (SW), (5) voltage unbalance and single phasing (VP), (6) built-in rotor unbalance (RU), and (7) faulted bearings (FB). e faulty types and the corresponding causes are listed in Table 1.
e acceleration sensor was used in the experiment and installed at the shell of motors to measure the vibration signals in radial direction. e acceleration signals were collected by the data acquisition card, and the sampling frequency was set to 10.24 kHz. Each test lasted approximately 120 s, and thus, the number of raw data points acquired from one motor is approximately 1228800. Figure 2 displays a set of time waveforms of the measured acceleration signals.

Datasets Generation.
From the 120 s acceleration signals, the middle 100 s stable signals are selected as the training and test data. A sliding window is used to obtain the samples of the same length. Suppose the length of window as l, the step size s, and the number of data points N. One sample is generated for every step. us, the sample number n is   Figure 2: e waveforms of the measured acceleration signals of 8 motors.

Shock and Vibration
n � N − l s + 1. (1) In this paper, the step size and window length are both set to 1024. It means that every 0.1-second signal which contains 1024 data points (10.24 kHz) represents a sample.
us, there are 1000 samples of each fault type, and those samples form the original dataset named dataset 1.
As the motor frequency is 15 Hz, one sample consists of 1.5 cycle of motor periodic rotation signals. erefore, the difference between samples is more distinct, compared with the samples containing integral cycles of signals in previous works [11,12,14]. In this way, it can be demonstrated that our proposed framework attains the goal of fault diagnosis by revealing the underlying representative features instead of overfitting and memorization of training samples when the samples are almost the same.
For the sake of investigating the antinoise capability of the proposed algorithm and verifying its effectiveness in real-world fault diagnosis applications, Gaussian noise with a signal-to-noise ratio of 5 dB is artificially embedded into the dataset 1.
is noising method is based on the assumption that the acceleration signals of motors in real world contain a higher-level Gaussian noise [23]. e processed dataset is named as dataset 2, which obviously elevates the difficulty level of fault diagnosis.
To obtain a precise diagnostic accuracy, a five-fold crossvalidation method is adopted to split the training and test data. e dataset is segmented into 5 subsets, and the holdout method is repeated 5 times. Each time, one of the subsets is used as the test set and the rest as the training set. 3.1. RQA. RQA is a kind of nonlinear analysis for the dynamical system based on the view of a phase space concept, aiming at quantifying the recurrence plots (RPs) [19]. Supposing a series t 1 , t 2 , . . . , t n with length n, it can be reconstructed to a new phase space T according to the Takens embedding theorem [24] and time-delay approach. T can be described as a reconstructed matrix:

The Proposed Method
where m is the embedding dimension, τ is the time delay, T is an m × m matrix, and T(i) presents the ith row of T. Consequently, the recurrence matrix and RPs can be calculated as follows: where ‖ · ‖ denotes the L2 norm, Θ (·) represents the Heaviside function, ε is the recurrence threshold parameter, τ is the time delay, and R i,j is a square matrix with is formula means that if the L2 distance between T(i) and T(j) is less than ε, then R i,j � 0. Otherwise, R i,j � 1 and a black dot is located at the (i, j) in a two-dimensional space. After all T(i) and T(j) are processed through this function, the drawn graph is a RP. As a visualization tool, RP can graphically describe the dynamic characteristics in a qualitative way and reveal the latent timecorrelated signatures. 8 RPs of 8 samples each with one fault are selected as examples, which are shown in Figure 4.
In RQA, three major parameters, respectively, embedding dimension m, time delay τ, and threshold ε, need to be determined. m is set to 4 based on the false nearest neighbours (FNNs), and τ is set to 5 based on the mutual information [25]. ε is empirically set to 0.8.
It is highly impracticable to directly use RPs to classify faulty types for their low resolution. us, RQA appears as a good tool to quantify RPs with recurrence statistics. e core of RQA is to identify and quantify the transient recurrent patterns which characterize the dynamic change behaviors. In this paper, 10 recurrence parameters named R 1 − R 10 are derived from the distribution of points and vertical lines of the RPs. e recurrence rate (RR) is the simplest parameter of cross-correlation sum. It is defined as where n is the number of RPs points and R i,j is the binary matrix. RR represents the density and trajectory of PRs points and physically indicates the probability of recurring a specific state. Determinism (DET) is a criterion of the predictability of a system. It is given by where l min is the minimum value of the diagonal line length and P(l) denotes the distribution of the diagonal line length. DET is used to distinguish the organized RPs points from the dispersed ones. A high value of DET reflects a stable system while a low value indicates a stochastic system. Shannon entropy (LENTR) denotes the complexity of a system. It is calculated as where l min is the minimum value of the diagonal line length and P(l) denotes the distribution of diagonal line length. LENTR indicates the complexity of RPs and can be used to estimate how much information is needed to recover the RPs. Laminarity (LAM) corresponds with the number of laminar phases of a system. It is defined as where TT estimates the average time in which a state of RPs is trapped.
Average diagonal length (ADL) is the average diagonal line length. It is the mean length of diagonal parallel lines. Similarity, longest diagonal length (LDL) is the longest diagonal parallel line length, longest vertical length (LVL) is the longest vertical line length, and average vertical length (AVL) is the average vertical line length. Besides these 9 parameters from the distribution of RPs, recurrence time is chosen as the 10th parameter, which evaluates the complexity of RPs though the calculation time. (RNN) is one of the deepest neural networks, which can address the data series with arbitrary length and has been applied successfully in many end-to-end tasks [26]. For the sake of avoiding the gradient vanishing or exploding problem, a modified RNN architecture named long short-term memory (LSTM) which involves a memory cell is created [27]. A concept of forget gates is introduced in LSTM to deal with the long-term dependency problem.

LSTM. Recurrent neural network
ese forget gates can screen the useful information of the historical network cells to capture more meaningful and distinct information in data series.
At time step t, the hidden state h t of the LSTM cell is updated with the following terms: the input data x t , the input gate i t , the output gate o t , the forget gate f t , the memory cell c t , and the hidden state h t−1 of the previous LSTM cell at time step t − 1. e schematic diagram of the architecture of a LSTM cell is shown in Figure 5. ose parameter-updating formulas are expressed as follows: where σ is the activation function of sigmoid and ⊙ represents the element-wise product. W, V, and b are, respectively, a d × k matrix, a d × d matrix, and a d-dimensional vector, where d is the input series length and k is the hidden layer size.
In order to establish a better LSTM architecture with higher diagnostic accuracy, several parameters, such as layer number, time steps, and learning rate, need to be determined. e quantification of these parameters is mainly based on a comparative evaluation of the performances of various optional values in dataset 1. e accuracy and cost time at different layer numbers are shown in Figure 6, with the layer size set to 64. It can be observed that as the layer number increases from 1 to 5, the accuracy curve has a distinct rise at first 3 layers and then tends to be stable, while the cost time always increases as the layers get deeper. By comprehensive consideration of both accuracy and efficiency, the optimal layer number is selected as 3.
During the construction of the stacked LSTM, the backpropagation is used for the update of weights, of which the learning rate is the main parameter. Different learning rates and corresponding results of 10 times of repeat tests are visualized in the boxplot in Figure 7(a). It shows that the test accuracy is relatively low as the learning rate is too small or too big. us, 0.001 is selected as the learning rate. en, the selection of time steps is investigated. Figure 7(b) shows that it attains the highest accuracy as the number of time steps is 4 and the performances vary little when the number of time steps is 2 and 4. It means that a sample with 1024 data points is segmented into 4 parts and converted to a tensor S ∈ R 4×256 . e hidden layer size of each stacked layer is empirically, respectively, 256, 128, and 64 in a descending order, and thus, the output size of each layer is 4 × 256, 4 × 128, and 4 × 64. e max-pooling function is used to eliminate the dimensionality curse and retain the most useful information of a region by returning the maximum value. In this method, a max-pooling layer is added after the stacked LSTM layers to convert the 4 × 64 output to a flatten 64-dimensional vector. To prevent model form overfitting, dropout based on early stopping mechanism is adopted in the training process [28]. Probability of dropping out neurons in the convolution layers is set to 0.25.

Statistical Features.
Twenty-nine statistical features, including 16 time-domain features and 13 frequency-domain features, are extracted [29]. It is shown in

Weighted BN.
BN is a popular type of normalization method which can transform the distribution of any neuron's input during a batch of iterations to Gaussian distribution in deep NNs [22]. BN has displayed its impressive impact on reducing overfitting and avoiding gradient dispersion. e weighted BN is a modification of the BN layer, which consists of a weighting layer and a BN layer. It is designed to assign dynamic weights to the hybrid features and then perform batch normalization. It targets not only on reducing overfitting and eliminating the phenomenon of internal covariate shift but also introducing an assessment mechanism of feature importance and facilitating evaluation of the contributions of the three feature learning approaches. e weighting layer can be regarded as a customized neural network layer before BN layer and optimized during the whole training process. e role of the weighting layer is to assign dynamic weights to the hybrid features. For the RQA feature set R 1 , R 2 , . . . , R 10 and statistical feature set TF 1 , . . . , TF 16 , FF 1 , . . . , FF 13 }, a one-by-one corresponding weighting vector W � w 1 , w 2 , . . . , w 39 is employed, which is expressed in equation (10). For the 64-dimensional LSTM output, due to the facts that no general consensus has been reached about the specific meaning of each dimension of LSTM output and there already exists a large amount of intrinsic weights and biases in LSTM, we assign a unique weight w 40 to the 64-dimensional vector, which is expressed in equation (11).
L l j � w l 40 L l−1 j , j � (1, 2, . . . , 64), where F i is the ith term in the combined feature set R 1 , R 2 , . . . , R 10 , TF 1 , . . . , TF 16 , FF 1 , . . . , FF 13 , L j is the jth number of the LSTM output, and l denotes the lth iteration. Note that the update of weights in the weighting layer is a  Shock and Vibration 7 part of the optimization in the training process based on the batch gradient descent algorithm.

Fully Connected Neural Networks.
ree fully connected layers with the size of {64, 32, 8} is added. In these layers, the neurons at different layers are all connected to each other; activation functions of ReLU are used at the first two layers, and SoftMax is used at the last layer. e whole architecture is constructed through minimizing the following cross-entropy loss function between the predicted values and real labels [30]: where y is the model output and y is the real-label value. e batch gradient descent and backpropagation algorithm are used for optimization and to minimize the cost function L.
us, the weights and biases can be updated through the following equations: where w denotes the weight, b represents the bias, and λ is the learning rate. In addition, the programs are performed on a GeForce GTX TITAN X graphics card and a E5-2630 processor with 126 GB memory, using TensorFlow as backend.

Results and Discussion
In this section, the performance of the proposed method is illustrated and discussed. Several other methods are tested and contrasted.

4.1.
e Diagnostic Accuracy of the Proposed Method. Figure 8 displays the training process during 100 epochs with the training and test accuracy and losses from the original dataset 1 and the noised dataset 2. It can be seen that a significant convergence trend occurs in both datasets. Besides, there are less fluctuations in the accuracy and loss curves of dataset 1 compared with the curves of dataset 2. e distinction can be explained by the fact that the difficulty level of fault diagnosis for dataset 2 with 5 dB noise is obviously higher than that of dataset 1. Figure 9 illustrates the visualized outputs of every step using the established model with 8 test samples of different fault types.
max(t(n)) TF 8 min(t(n)) TF 9 ( N n�1 (t(n) − TF 1 ) 2 )/(N − 1) TF 10 TF 7 − TF 8 TF 11 TF 2 /TF 4 TF 12 TF 7 /TF 2 TF 13 TF 7 /TF 3 TF 14 TF 7 /TF 4 TF 15 TF 5 /(TF 2 ) 3 TF 16 TF 6 /(TF 2 ) 4 Frequency Shock and Vibration Figure 10 shows the diagnostic accuracy of each faulty type using confusion matrix. e diagnostic accuracy is calculated as the rate of the correctly classified test sample number to the total test sample number. It can be seen that average classification accuracy reaches 99.3% in dataset 1 and 98.9% in dataset 2.  where w i is the weight of R i . It shows that the proportions of RQA features' weights from dataset 2 are generally higher than dataset 1. It proves that RQA plays a bigger role in dealing with the noised data and indicates its potentials in real-world applications which require strong antinoise capability.

Visualization at Different Layers.
In order to illustrate how the proposed method executes the classification task step-by-step, the different layers' outputs are visualized with 3-dimensional sketches using a nonlinear dimension reduction method, t-distributed stochastic neighbour embedding (t-SNE) [31]. t-SNE is employed to convert the high-dimensional outputs at different layers to 3D representations, which are shown in Figures 11-12. It can be observed that the 8 groups of point clouds are heavily overlapped at the first few layers, and as the layer goes deeper, they become more separable. Although the dimension reduction by t-SNE involves inevitable errors for the loss of information, it demonstrates the effectiveness of the proposal in datasets 1 and 2 in a qualitative manner.

Performance Comparisons.
To verify the superiority of the proposed method, several state-of-the-art intelligent fault diagnosis methods are employed for comparison: (1) RQA + SVM [19]: independent use of RQA for feature learning and optimal binary tree SVM for classification (2) LSTM: independent use of the proposed LSTM architecture on raw data (3) MLP [7]: multilayer neural network on statistical features (4) CNN [32]: one-dimension CNN on raw data (5) SIFT + CNN [13]: short-time Fourier transform for feature learning and CNN for classification (6) CDFL [14]: convolutional discriminative learning of a BP network In (1) and (2), the separate use of RQA and LSTM aims at proving its indispensability in the hybrid method. In (1), 10 RQA features mentioned in Section 3 are extracted and 7 optimal binary tree SVMs with RBF kernels are employed as the binary classifiers. In (2), the same three-layer stacked LSTM architecture is adopted. In (3), since MLP is incapable of addressing sequential raw data, the 29 statistical features mentioned in Table 2 are taken as the input. e hidden layers with layer size {64-128-64} are adopted. In (4), a 1D-CNN architecture, comprised of 3 convolutional and pooling layers, one flatten layer, and 3 fully connected layers, is used ( Figure 13). e kernel length of the convolutional layers is set to 64, and the max-pooling function is utilized in the pooling layers. In (5), 1D data are converted to a 2D time-frequency maps through short-time Fourier transform. en, the maps are compressed into 100 * 100 squares and fed into a deep CNN. In (6), discriminative learning based on BPNN is incorporated into the unsupervised CNN. e training of a BPNN takes the place of the training of a CNN, with the input size equal to filter number 32 and the hidden layer size equal to filter size 64. e classification results of all methods on datasets 1 and 2 are shown in Figure 14.
Several conclusions can be drawn from Figure 14. (1) e proposed hybrid method outperforms the other methods with diagnostic accuracies of 99.3% and 98.9% in the two datasets.

Shock and Vibration
Besides, the difference value 0.04% is the smallest, which indicates its stronger noise resistance. (2) e separate use of RQA and LSTM obtains lower diagnostic accuracy compared with the proposal. It indicates that both RQA and LSTM neural network are indispensable in the hybrid method. In addition, RQA + SVM performs worst, and it can be explained by the limitation of the inadequate number of features and the dependence of manual feature learning. (3) Generally, the deep learning techniques, including CNN, SIFT + CNN, LSTM, and CDFL, perform better than the traditional methods. It can be explained by the fact that the deep learning methods, with stronger feature learning and representation capacity, always present a superior performance to the methods that require manual feature extraction [33]. (4) e major shortcoming of the proposal is the computational cost. e average cost time of 100 epochs is 530 s, only faster than SIFT + CNN's 690 s and slower than most of the comparative methods. It can be interpreted by the hybrid complex architecture which is embedded with three different feature learning approaches. A part of the computational space is occupied by drawing the RPs and extracting various statistical features. Besides, the LSTM neural network contains the deepest layers among those methods.

Conclusions
In this paper, a novel fault diagnosis framework with high accuracy for three-phase induction motors is presented. A hybrid feature learning approach that combines empirical statistical parameters, RQA and LSTM, is proposed to integrate the state-of-the-art deep learning techniques with the manual feature learning approaches to gain a superior and robust performance. In addition, a modification of BN named weighted BN is designed to evaluate the contributions of each feature learning approach and facilitate validating the noise resistance performance. e tests of 8 motors (1 healthy and 7 faulty) are carried out to form datasets 1 and 2. It verifies that the proposed method, with the highest accuracies of 99.3% and 98.9% in fault recognition, performs better than other methods and possesses good noise resistance. More specifically, it yields 18.5% and 8.1% average performance improvements compared with RQA + SVM and MLP; it yields 3.6%, 4.7%, 9.2%, and 2.1% average performance improvements compared with the four deep learning methods, LSTM, CNN, SIFT + CNN, and CDFL. e weight distribution of the weighted BN illustrates RQA is more effective in dealing with the real-world noised data.
In future research, our effort will be devoted to two aspects. (1) Make the proposed method reliable for practical use, which demands a large amount of accumulated industry data.

Data Availability
e .tdms data used to support the findings of this study are available from the corresponding author upon request.