A Novel Hierarchical Algorithm for Bearing Fault Diagnosis Based on Stacked LSTM

Faced with severe operating conditions, rolling bearings tend to be one of the most vulnerable components in mechanical systems. Due to the requirements of economic efficiency and reliability, effective fault diagnosis methods for rolling bearings have long been a hot research topic of rotary machinery fields. However, traditional methods such as support vector machine (SVM) and backpropagation neural network (BP-NN) which are composed of shallow structures trap into a dilemmawhen further improving their accuracies. Aiming to overcome shortcomings of shallow structures, a novel hierarchical algorithm based on stacked LSTM (long short-term memory) is proposed in this text. Without any preprocessing operation or manual feature extraction, the proposedmethod constructs a framework of end-to-end fault diagnosis system for rolling bearings. Beneficial from thememorizeforget mechanism of LSTM, features inherent in raw temporal signals are extracted hierarchically and automatically by stacking LSTM. A series of experiments demonstrate that the proposedmodel can not only achieve up to 99% accuracy but also outperform some state-of-the-art intelligent fault diagnosis methods.


Introduction
Rolling bearings are vital elements in auto-manufacturing and heavy-load mechanical systems [1,2].Due to the harsh working conditions, any small fault that occurs to the bearings may cause fatal consequences to machines, which straightly leads to severe economic losses and casualties.
erefore, an urgent demand of detecting and recognizing faults automatically in rolling bearings as early as possible is practical and meaningful.During the past decades, with the rapid development of computer technology, scholars around the world have dedicated considerable efforts to bearing fault diagnosis, and many excellent intelligent algorithms have been proposed and utilized in practical applications.
Vibration analysis is one of the prevalent signal processing techniques which are widely used in fault diagnosis [3].Machinery fault diagnosis with vibration signal analysis can be transformed into the framework of pattern recognition problem, which consists of three main steps: feature extraction, feature selection, and classification [4,5].Prevalent pattern recognition algorithms such as backpropagation neural network (BP-NN) and support vector machine (SVM) are representative ones used in rotating machinery fault diagnosis issues [6,7].However, signals collected by vibration sensors are usually nonstationary and complex, and what is worse, heavy background noise contributes to the difficulty of feature extraction.A labor-intensive workload or expertise knowledge is indispensable before an effective model is constructed for certain fault diagnosis issue.erefore, it is a great challenge to learn discriminative fault features effectively and automatically.
As far as the authors are concerned, two main typical algorithms for feature extraction exist in bearing fault diagnosis: signal processing-based algorithms and artificial intelligence-based algorithms.e former, which take prior knowledge of signal and make full use of signal processing techniques, such as time-frequency analysis (TFA) [8], wavelet package transform (WPT) [9], and recently prevalent empirical mode decomposition (EMD) with its variants [10], have proven their effectiveness in many advanced achievements.Due to benefits from the maturity of signal processing techniques, methods mentioned above have been widely used.However, lack of flexibility also limits its further improvement in recognition accuracy.Namely, selecting suitable parameters of the model, the model can be applicable in a certain fault diagnosis issue, and the variation of loads or other factors may affect the accuracy of the model.e latter, which are represented by SVM and BP-NN, have newly sprung up because of its accessibility.Without knowing internal mechanism, users can design a fault diagnosis system just based on the theory of pattern recognition without much effort.We hold the view that the easier for users, the more widespread an algorithm is used.erefore, in this paper, we lay emphasis on intelligent methods for bearing fault diagnosis.
Recently, a large number of research achievements have promoted intelligent methods to a new level.Chine et al. [11] utilized several features and constructed an ANN-based voltaic system for fault diagnosis.24-dimension parameters extracted by Jamadar and Vakharia [12] for describing bearing working conditions are sent to a BP neural network for fault recognition.Volterra series was utilized by Xia et al. [13] for recognizing different working conditions of rotorbearing system, and backpropagation (BP) neural network was used as classifier for fault diagnosis.Batista et al. [14] took 13 statistical features in both time and frequency domains for description of different bearing conditions, and then employed radial-basis-based SVM for fault diagnosis.Zhang et al. [15] combined EEMD permutation entropy for input feature and SVM for fault recognition.Zheng et al. [16] employed multiscale fuzzy entropy for input features and SVM for classifier.
Although intelligent algorithms mentioned above have achieved an acceptable accuracy and been widely applied in practical engineering, two significant disadvantages cannot be avoided: (1) the diagnosis effectiveness largely depends on feature extraction which is obtained mainly by manual extraction according to the knowledge of mechanical engineering experts, and the quality of extracted and selected features plays a vital role in the performance of methods.e features that are either manually selected or handcrafted may not optimally characterize vibration signals and thus cannot fulfill a generic solution that can be used for any bearing fault data [17].What is worse, the task of selecting the most sensitive features for different diagnosis issues is a timeconsuming and labor-intensive work, which increases the burden of workers and researchers.(2) Intelligent diagnosis methods such as BP-NN and SVM are both shallow learning structures, that is to say, only one hidden layer is used for nonlinear transformation.Several research results have clearly illustrated that shallow architectures hinder the ability of learning complex nonlinear relationships among different fault diagnosis issues [18,19].erefore, it is essential to establish a deep and hierarchical architecture for better feature learning in rolling bearing fault diagnosis issues.
Deep learning, also known as deep neural network (DNN), has attracted an increasing attention from scholars of various fields in recent years.e predominant superiority of deep learning is the capacity of learning complex nonlinear features, which can discover inherent structures and useful features from raw data by a layerwise learning procedure.A great number of research achievements have demonstrated its powerful potential in many fields, such as natural language processing (NLP) [20], computer vision (CV) [21], and mechanical fault diagnosis [22].Chen et al. [23] introduced CNN-based deep learning to gearbox fault diagnosis.Several time-domain and frequency-domain features were extracted and sent to the framework of CNN, and a softmax-based classifier was used for fault diagnosis.Although CNN is used in fault diagnosis issue mentioned above, it is more like a classifier than a feature extractor.erefore, the capacity of CNN has not been fully utilized.Guo et al. [24] took a similar approach, which made use of time, frequency, and time-frequency features as input of deep neural network, constructing an autoencoder-based DNN for whole life validation of bearings.
Long short-term memory (LSTM) [25], which is an important component of recurrent neural network (RNN), has become a hot spot recently.By utilizing spatial and temporal information inherent in raw temporal signal, which imitates brain memory of human beings, LSTM-based structure has the potential for higher accuracy in fault diagnosis issues.Also, with the advantage of selective memory mechanism, LSTM solves long-term dependency problems derived from RNN network.
In this paper, a novel fault diagnosis method named as hierarchical LSTM-based deep network is proposed for both feature learning and fault recognition of rolling bearings.Experiment results verify that the proposed method can obtain a higher accuracy without relying on manual feature extraction as well as advanced signal processing techniques.To the best of the authors' knowledge, this paper is the first attempt to perform hierarchical LSTM-based strategy in rolling bearing fault diagnosis issue, which is meaningful and pioneering.e rest of the paper is arranged as follows: Section 2 makes a brief review of LSTM theory, Section 3 illustrates proposed methods for bearing fault diagnosis, Section 4 is used for experiments, and Section 5 makes the conclusion.

2.1.
e Origin of LSTM.Recurrent Neural Networks (RNNs) were firstly introduced to solve time sequence learning problems.Unlike traditional neural networks which are formed by multilayer perceptron that can only map input data to target vectors, RNNs have the capability of tracing back the whole history of previous inputs in principle.Like many other neural networks, backpropagation algorithm is used for training RNNs.However, faced with vanishing or exploding gradients during backpropagation period, the performance and potential of RNNs have greatly limited, which means that traditional RNNs cannot capture long-term dependencies.erefore, LSTM is proposed to get rid of the limitation of RNNs mentioned above.Forget gates which dominate the flow of information among different cell states are utilized to avoid 2 Shock and Vibration long-term dependency problem [26].To learn e ective representation and nonlinear dynamic features in timeseries data, LSTM is superior to traditional RNNs in that the former abandon vanishing or exploding gradient problem, which have the capability of capturing long-term dependencies.

Basic eory of LSTM.
e main idea behind LSTM lies in that a few gates that control the information ow along time axis can capture more accurate long-term dependencies at each time step.Speci cally, at each time step t, hidden state h t is updated by fusion of data at the same step x t , input gate i t , forget gate f t , output gate o t , memory cell c t , and hidden state at last time step h t−1 .e updated equations are as follows: where model parameters including W ∈ R d×k , V ∈ R d×d , and b ∈ R d are learned during training and shared at each time step, σ is sigmoid activation function, ⊙ means elementwise product, and k is a hyperparameter that characterizes the dimensionality of hidden layers.
Firstly, basic LSTM is utilized to deal with time-series data.And the nal output, which is at endmost time step, is utilized to predict the output by a linear regression layer, as is shown in the following equation: where W r ∈ R k×z and z are the dimensionality of output.In the phase of model training, the cross-entropy is used as loss function between the predicted label distribution q(x) and the target label distribution p(x).So the cross-entropy between p(x) and q(x) is loss Activation function enables the network to acquire a nonlinear representation of the input signal, which enhances the representation ability and makes the learned features more discriminative.In this text, recti ed linear unit (ReLU) is adopted for fast convergence of our model.ReLU has the advantage of making the network sparser and weights more trainable during adjusting parameters.ReLU can be described in the following equation: where y l+1 i (j) and a l+1 i (j) represent the output of LSTM and activation value of y l+1 i (j), respectively.e corresponding LSTM unit architecture is shown in Figure 1.erefore, stacking several LSTM layers for a deep LSTM-based neural network is meaningful.e main idea of deep neural network is that many nonlinear mapping layers between inputs and outputs are utilized for hierarchically feature learning.As is shown in Figure 2, the output of hidden layer is not only propagated forward through time, but also used as one of inputs of next LSTM hidden layer.erefore, the l-th layer can be updated by the following equations: ( e input of 1st layer is raw temporal signals, i.e., h t 0 x t , while the output of the 1st layer is an abstraction of raw signals, which is regarded as a hierarchical feature.Other LSTM layers use the output of previous layer as input, and the output of last LSTM is sent to a full-connect layer for classi cation.e advantages of stacked LSTM are obvious: (1) stacking LSTM layers enables the model to learn characteristics of raw temporal signal from di erent aspects at each time step.(2) Model parameters are distributed over the whole space of the model without increasing memory capacity, which enables the model to accelerate convergence and re ne nonlinear operations of raw data.
Note that LSTM neural network has the mechanism of recalling memory with time steps.As for one-dimensional signal processing, a signal with limited length can be reformed into a matrix with rows for input dimension and columns for time steps.It is intuitive that LSTM imitates the memory process as human beings do, which means it can memorize a signal line by line and catch important points

Structure of Proposed Method
Based on the study of rolling bearing fault diagnosis of this paper, a hierarchical structure of LSTM is proposed.Figure 3 depicts the ow chart of proposed method, which consists of three parts: (1) data augmentation: training dataset is one of the most important factors of all deep learning methods.Raw data of each condition is one dimension temporal signal, hence data augmentation strategy aims to enlarge training datasets by dividing raw signal with overlap, which helps to reduce computation cost and accelerate model convergence.(2) Model training: based on built model structure, the input data are divided into two groups with their corresponding labels: training dataset and testing dataset.In other words, the proposed method is a supervised training process with unsupervised feature learning.A dropout strategy [27] is adopted after each LSTM to avoid over tting for better generalization.(3) Evaluation with testing dataset: after the model being trained, a test dataset is utilized to validate the e ectiveness of the model, and the evaluation indicator is the accuracy of classi cation obviously.
All steps above form the main framework of the proposed hierarchical LSTM neural network.A series of experiments will be performed in the following section.

Experiment Analysis and Discussion
As mentioned above, rolling bearings are essential elements of rotating machinery, and recognizing their faults as timely as possible has great e ect on the reliability and performance of machinery that they are mounted on.erefore, a Case Western Reserve University (CWRU) dataset for rolling bearings with di erent fault rolling bearing conditions is adopted in our experiments [28].
e performance of proposed hierarchical LSTM neural network is compared with some existing state-of-the-art diagnosis algorithms, with details listed in the following subsections.

Introduction of CWRU Dataset.
e CWRU dataset has been regarded as a benchmark for testing algorithms related to vibration signal analysis of rolling bearings.e CWRU datasets consist of vibration time-series of various rolling bearing conditions which are generated by a test rig, which is shown in Figure 4. e test rig is composed of a 2-horsepower (hp) motor for driving a shaft, a control circuit model for controlling various speeds to meet di erent requirements, and a torque converter for signal processing.
e sampling frequency of the accelerometers is 12 kHz.In our current experiment, the adopted vibration data are collected from accelerometers mounted on the housing with magnetic bases and installed at the 12 o'clock position for the bearings.

Experiment Setup.
In our experiments, 13 kinds of health conditions with 1 hp are considered.All condition samples are segmented with length 256 and overlap 50%, and each condition has 300 samples, half for training and half for testing.Detailed information about experiment samples is listed in Table 1.
Also, some other key parameters used in proposed model are listed as follows: the input layer has 256 units which is equal to the dimension of input sample, the hidden units of 1st LSTM to 3rd LSTM are 64, 32, and 32 respectively, and a RMSprop optimization algorithm [29] is used to train the model.Mean square error (MSE) is an indicator for evaluation performance.In the output layer, a softmax classi er is used for classi cation.

Experiment Results.
For better and fair experiment results, a random selection strategy has been adopted for all samples.Namely, 150 samples of each health condition are randomly selected for training, while the remaining for testing.Each raw temporal signal is 256 for balancing information coverage and computing e ciency.It is worth noting that an "early stopping" strategy has been introduced in the proposed method during training phase for better generalization performance.Even though we set a max iteration 100, the training will stop if the loss of training dataset does not change much for several iterations.Accuracy and loss of our model during training phase are plotted in Figure 5. From the curves, we can clearly see a convergence after 43 iterations; obviously it greatly reduces time and cost for training.4

Shock and Vibration
To better illustrate experiment results, a multiclass confusion matrix for the third trail of proposed method is shown in Figure 6. e multiclass confusion matrix is an exhibition method for visualizing classi cation results of all conditions in detail, which consists of classi cation accuracy and misclassi cation error.e ordinate and horizontal axis of a confusion matrix refers to predicted label and true label, respectively.
Obviously, most faults have 100% accuracy in our model.e worst accuracy, 94%, occurs in outer ring fault with 21 inch at 6 o'clock position.e total average accuracy is up to 98.8%, which demonstrates the e ciency and feasibility of the proposed method.e detailed parameter settings of other methods in the experiment are depicted as follows: (1) 1-layer LSTM neural network: the architecture is the same as the 1st layer of proposed method.(2) BP-NN: it is also an "end-to-end" neural network with a 32-unit hidden layer.e whole structure is 256-32-13.(3) SVM: the feature set includes time-domain features (RMS, kurtosis, skewness, variance, standard deviation, etc.), frequency-domain features (frequency Note.IF, OF, and BF mean inner, outer, and ball faults, respectively.7 inch means the diameter of faults, and so on.@6 means the location of faults in outer fault, and so on.4) CNN: the CNN method used in our experiment is that referred in Reference [23], which has a structure of input layer, 2 convolutional layers, and 2 pooling layers.Time-domain and frequency-domain statistical features are sent to input layer after transformed to 2D format, the shape of input feature map is 32 × 32 with 6 kernels, and max pooling size is set to 2 with learning rate 0.1 and maximum iteration 100. Figure 7 shows diagnosis results of all methods in 10 trails.It is clear that the proposed method has the highest recognition accuracy among all methods with average accuracy up to 98.65%.1-layer LSTM neural network has the second highest accuracy partly because it has memory mechanism, which is derived from LSTM.However, its shallow structure hinders its accuracy from improvement.It is worth noting that BP-NN has the worst accuracy among all methods.BP-NN does not own memory mechanism, and it just uses information for forward propagation, which lacks the capacity of learning useful information in previous data points.Also, a shallow structure limits its performance.
In order to graphically display the performance of our hierarchical LSTM neural network, T-SNE [30] has been utilized for visualizing each layer of our model.After selecting the rst two important components obtained by T-SNE, the outputs of all layers are shown in Figure 8. From Figure 8, a clear and intuitive conclusion has arrived: from input layer to output layer, the distribution of each category has been shown more and more clearly.Namely, input layer mixed all categories together, in which we cannot distinguish any category.After the rst layer of LSTM, categories of No. 0, No. 4, No. 10, and No. 12 have converged into their own spaces.With the progress of deeper layers, a clearer distinguishability of each category can be obtained.Finally, in the output of the third LSTM layer, each category almost gets its own space in the 2D image, which demonstrates the availability of our model.

4.5.
In uence of Some Hyperparameters.In our proposed model, two hyperparameters need to be discussed, namely the

Shock and Vibration
obvious that (1) samples with larger input size tend to accomplish better diagnosis performance than fewer ones.It is probably that samples with larger input size contain more fault information which is essential to feature learning, while fewer data points may easily ignore the local spatial information inherent in the raw data.(2) e time cost increases rapidly with input size, which is not suitable for online applications.We choose the input size 256 for compromise.A similar conclusion can be reached in the Figure 9(b), which shows the layer number of LSTM versus accuracy and computation time, and we choose layer number 3 in our model for the balance of accuracy and computation cost.

Generalization Experiments.
In order to investigate generalization capacity of the proposed method, we form a testing dataset by taking samples under 2 hp load and 3 hp load corresponding to Section 4.1 samples.Similarly, we use confusion matrix to illustrate our results, which is shown in Figure 10.Although some accuracies fall in the variations of loads, the proposed method still achieves 97.8% and 98.4% accuracy in total.Mentioning that the worst fault recognition accuracy in our generalization experiments is 90.0%, it is still higher than some intelligent methods in 1 hp load such as SVM, BP-NN, and CNN in our comparative experiments conducted above, which clearly demonstrates the e ciency and superiority of the proposed method.

Discussion
rough various experiments conducted above, we can safely conclude that the proposed stacked LSTM neural network for self-learning method is able to adaptively mine inherent fault characteristics and e ectively identify faults with high diagnosis accuracy.e prominent superiority of proposed method is that the features are extracted by deep structure in a more identi able way than extracted by handengineered or prior knowledge, which makes it easier to apply to other diagnosis issues.
However, the proposed method also has some shortcomings which need to be improved in the near future.(1) e computation cost of proposed method is relatively higher than traditional ones such as SVM or BP-NN.Part of the reason is for the limitation of computer hardware used in our method.We believe that this defect can be perfectly solved by the hardware improvement in the future.(2) e parameter selection of our method needs consecutive trial-and-error experiments.Some necessary experiments need to be conducted before a suitable model constructed for a certain fault diagnosis issue.So far, no perfect solution to this problem has been proposed yet.We just follow a simple idea which has been introduced by many other scholars that the input length should contain several whole cycles of raw temporal signal and the number of hidden units should be no larger than the previous one.In practice, the principle works well in our model.

Conclusions
e proposed method of fault diagnosis for rolling bearings based on stacked LSTM neural networks is novel and promising.It has three main advantages that other traditional methods do not possess: (1) it gets rid of dependencies of handcrafted features or advanced signal processing techniques which are essential for traditional methods.(2) e learning process of LSTM is performed automatically based on raw temporal signal without any prior knowledge of signal types or inherent mechanism.
anks to the memory capability of LSTM, the correlation within signal is further strengthened.(3) Based on stacked architecture, features are extracted hierarchically, and the deeper structure gives the learning model more potential for mining inherent characteristics.Shock and Vibration All of the above demonstrate the e ciency and availability of the proposed method.However, the computation cost still has room for improvement, which will be the focus of our future work.

2. 3 .
Hierarchical LSTM.With the rapid development of computer hardware and a series of deep learning algorithms being put forward, deep architectures have shown their powerful capability in feature self-learning.

Figure 1 :
Figure 1: e internal structure of an LSTM unit.

4. 4 .
Comparison with Other Methods.For comparison, four other methods which consist of 1-layer LSTM neural network, backpropagation neural network (BP-NN), SVM, and CNN have been considered in our experiments.

Figure 3 :
Figure 3: e ow chart of the proposed method.

Figure 9 (Figure 7 :Figure 8 :
Figure 9(a) shows the influence of input size on the performance of proposed method.e size of the input units is set to 32, 64, 128, 256, 512, 768, and 1024, respectively.It is