Machine Learning Based Bearing Fault Diagnosis Using the Case Western Reserve University Data: A Review

The most important parts of rotating machinery are the rolling bearings. Finding bearing faults in time can avoid affecting the operation of the entire equipment. The data-driven fault diagnosis technology of bearings has recently become a research hotspot, and the starting point of research is often the acquisition of vibration signals. There are many public data sets for rolling bearings. Among them, the most widely used public dataset is Case Western Reserve University bearing center (CWRU). This paper will start from the CWRU data set, compare and analyze some basic methods of machine learning based rolling bearing fault diagnosis, and summarize the characteristics of CWRU. First, we give a comprehensive introduction to CWRU and summarize the results achieved. After that, the basic methods and principles of machine learning based rolling bearing fault diagnosis were summarized. Finally, we conduct experiments and analyze experimental results. This paper will have certain guiding significance for the future use of CWRU for machine learning based rolling bearing fault diagnosis.


I. INTRODUCTION
The most important parts of rotating machinery are the rolling bearings, and with the development and demand of industry, the workload of most rotating machinery is huge. Under high load, strong impact, high workload and complex environment, rolling bearings often produce faults in inner race, outer race and ball. If the fault cannot be found in time, the equipment will often shut down, resulting in huge economic losses and even safety accidents. Fault diagnosis and prediction is the core of Prognostics and health management (PHM). The main purpose of PHM in rotating machinery is to reduce use and support costs, improve the safety and integrity of rotating machinery, so as to achieve condition-based maintenance with less maintenance investment [1]. Fault diagnosis methods can be divided into three types, including model-based methods, statistical reliability-based methods, and data-driven methods [2]. The premise of fault diagnosis based on model is to know the mathematical model of the target system. This kind of diagnosis method can penetrate into the essential performance of the target system and realize the implementation of fault prediction. But for complex dynamic systems, it is difficult to establish a mathematical model with high reliability, and the related work of fault diagnosis is greatly restricted. Information needed by fault diagnosis techniques based on statistical reliability can be found in various probability density functions (PDFs). The bathtub curve is a good example. The purpose of fault diagnosis can be achieved by processing the data collected by the sensor and combining methods such as feature engineering, deep learning or machine learning [3]. Rolling bearing fault diagnosis based on data-driven is the focus of current research, the core issue is to obtain the data. The state of the rolling bearing and the vibration signal are complementary. When the rolling bearing fails, it is often accompanied by an impulse signal [60]. Sensors are installed at different positions of the rolling bearing to collect vibration signals. The state of the bearing can be judged by directly observing these signals.
Machine learning is to extract knowledge from data, use a given sample for learning, and automatically determine and It can be seen from Figure 1 that the basic process of machine learning can usually be divided into 3 parts, including feature extraction, feature selection, and classification. In the first step, multi-domain features are extracted from different domains by processing the vibration signal. In the second step, the dimensionality of the feature set is reduced according to different criteria. Extract the best feature subset with the best discriminability and the least quantity, so as to avoid the disaster of dimensionality, improve the classification accuracy, and reduce the classification time. And the last step is the classifier stage, which classifies the sample by inputting a subset of features into the classifier. [4][5] This paper will start from the CWRU data set, compare and analyze some basic methods of machine learning based rolling bearing fault diagnosis, and summarize the characteristics of the CWRU data set. the introduction is in Section I. Section II investigates the rolling bearing dataset and discusses the characteristics of the CWRU data set.
Section III presents a review of feature extraction. Section IV summarizes feature selection. Section V summarizes the fault diagnosis classifier. Section VI is a brief review of deep learning, and discusses the existing contributions of CNN and RNN models in the field of faults. Section VII is an experiment, which compares and analyzes the algorithms mentioned in the previous parts from multiple dimensions, and discusses the results. Section VIII is the conclusion.

II. REVIEW OF DATASET
The characteristics of the mainstream rolling bearing public datasets are shown in Table 1 The test bench of the CWRU dataset is shown in Figure 2. The test bench is composed of a 2-horsepower electric motor, a torque sensor, and a power dynamometer. Accelerometers are respectively installed on the housings of the drive end and the fan end to collect the vibration signals The test bench mainly records normal baseline data, drive end fault data and fan end fault data. The sampling frequency of drive end fault data is 12000 sps and 48000 sps, and the sampling frequency of the normal baseline data and fan-end fault data are both 12000 sps. Therefore, this public data set contains four categories of data. Each type of data is mainly composed of inner race faults, ball faults and outer race faults with different load under different fault diameters. The faults of this dataset are mainly the damage of electric spark. It is a kind of artificially damage. The damage diameter includes 0.007inches, 0.014inches, 0.021inches and 0.028inches. Among them, the fault diameter of 0.007, 0.014, and 0.021 use SKF bearings, and the fault diameter of 0.028 uses NTN bearings. And it only records the inner race fault and ball fault of the drive end when the sampling frequency is 12000 sps and the fault diameter is 0.028 inches. The bearing loads are 0, 1 HP, 2 HP, and 3 HP respectively, corresponding to the different speeds. Outer race faults are divided into 6 o'clock, 3 o'clock and 12 o'clock according to the location of the fault point. Each type of fault data is saved in a mat file, and each file contains the vibration data of the drive end and the fan end, as well as the speed. DE means drive end data, FE means fan end data, and RPM means speed.
Wade et al. [16] proposed a CWRU-based benchmark based on three established rolling bearing fault diagnosis methods, through which new bearing fault diagnosis algorithms can be tested. Li et al. [17] benchmarked the CWRU dataset by using various entropy and classifiers, and provided an evaluation method for subsequent new classification methods. Neupane et al. [18] discussed deep learning based bearing fault diagnosis and summarized the latest research results of deep learning based on the CWRU dataset.

VOLUME XX, 2017
In addition, for bearing fault diagnosis, the main purpose of data preprocessing is to solve the problem of data imbalance and small sample problems. The ratio of fault data to health data is unbalanced, the data of different fault types are unbalanced, and the data sample size of each type is small. Wei et al. [19] proposed an oversampling method named SCOTE. SCOTE transforms the multi-class data balance problem into multiple two-class data imbalance problems, and combines with multi-class LS-SVM to form a new model to solve the problem of rolling bearing fault data imbalance. Ding et al. [20] combined the probabilistic mixture model (PMM) and Markov Monte Carlo (MCMC) to achieve the expansion of the data set, and used the semisupervised ladder network (SSLN) to solve the problem of fewer labeled samples.

III. REVIEW OF FEATURE EXTRACTION
Features are essential for machine learning [59]. The current mainstream approach is to extract multi-domain features from the vibration signal of the bearings to form a multidomain feature set. Form as many features as possible from multiple dimensions such as frequency domain, time domain, and entropy domain.
The variable in the time domain is t, and t is usually used to observe the changes in the vibration signal in the time domain. Commonly used time domain features are mean, root mean square (RMS), kurtosis, peak-to-peak, variance (Var), standard deviation (Std), shape factor, peaking factor, pulse factor, and margin factor. The specific calculation formula is shown in Table 2. [21][22] Table 3. [23][24] Entropy is used to characterize the uncertainty of the system or information. The power spectrum describes the distribution of the power of the vibration signal in the frequency domain. The singular spectrum entropy is calculated by performing singular value decomposition on the vibration signal, and the local features of the vibration signal can be obtained. The singular spectrum entropy is a fault feature in the time domain. Wavelet energy entropy is a fault feature in the time-frequency domain. Bispectral entropy is the distortion decomposition of the vibration signal in the frequency domain to describe the fault.
In addition, Zhao et al. [25] solved the problem of insufficient accuracy of multi-scale entropy by improving the scale factor of multi-scale entropy. The features formed by this method can provide more precise feature vectors for classifiers. Thereby improving the accuracy of diagnosis. Zhu et al. [26] calculated the hierarchical entropy of the vibration signal and used the hierarchical entropy as a feature vector to input it into a classifier that combines particle swarm optimization (PSO) and SVM. It is more superior than the method that uses multi-scale entropy as the feature vector. Nayana et al. [27] extracted 12 statistical features in time domain and 6 time-dependent spectral features (TDSFs). The feature selection algorithm combining wheel-based differential evolution (WBDE) and PSO is used to process the initial feature set, and the final feature subset contains most of the features of TDSFs. Mezni et al. [28] proposed a fault diagnosis method for identifying the degree of failure for rolling element faults. First, EMD is used to preprocess the vibration signal, and KLD is used to further process the IMF to form a feature vector. Three classifiers DAG-SVM, KNN, and decision tree (DT) are used for comparison and verification. Wang et al. [29] used ensemble empirical mode decomposition (EEMD) to preprocess the vibration signal and calculate the hierarchical entropy of the sample. The improved CS-SVM is used as the classifier of the model.  In the field of bearing fault diagnosis, many papers on feature selection and feature extraction have been published, and the CWRU dataset has been used for research. Tang et al. [30] proposed a feature selection model GL-mRMR-SVM, which uses the maximum correlation and minimum redundancy as the criterion for feature selection, and uses the global features in the frequency domain and time domain and the local features extracted by RNN as the original feature set, the final classifier is SVM. Li et al. [31] proposed a new envelope processing method named ICIE by improving CIE, and combining ICIE with the local mean decomposition (LMD), and proposed the ICIELMD model, which provides a new idea for the feature extraction of rolling bearings. Sun et al. [32] combined PCA and BP neural network to propose a new fault diagnosis model. PCA is used to reduce the dimensionality of the multi-source feature set composed of time domain, frequency domain and entropy features, and the feature subset is input into the BP neural network for fault diagnosis.

V. REVIEW OF CLASSIFIER
Classifiers are the application of some machine learning algorithms in classification. Some classic supervised classifiers include K-nearest neighbor (KNN), naive Bayes classifier, support vector machine (SVM), grey relational degree (GRD), and decision tree. Some classic unsupervised classifiers include clustering methods including k-means clustering, DBSCAN, and agglomerative clustering. The parameters of the classifier will be improved according to different problems to improve the generalization ability of the classifier.

A. Support Vector Machine (SVM)
SVM is a linear classifier that solves binary Classification problems. It realizes data classification by finding the maximum interval hyperplane. SVM is to find a hyperplane like 0 T xb  += . As shown in Figure 4. The optimization purpose of SVM is to make the distance between the support vector and the hyperplane as large as possible under the premise of correct classification, that is, to find the maximum interval plane, the optimization problem of SVM is transformed into formula 2.
Many papers use the CWRU dataset to study the application of SVM in the fault diagnosis of rolling bearing. Wei et al. [33] proposed an oversampling method named SCOTE and used SVM as a verification classifier. SCOTE transforms the multi-class data balance problem into multiple two-class data imbalance problems, and combines with multi-class LS-SVM to form a new model to solve the 8 VOLUME XX, 2017 problem of rolling bearing fault data imbalance. Shao et al. [34] used wavelet packet transform (WPT) to preprocess the data, obtain the energy distribution of the signal, and extract the feature to form a feature vector, and use the improved particle swarm algorithm (IPSO) proposed in the paper to optimize the parameters of the SVM. Mao et al. [35] proposed a new distance-based feature selection method, by introducing a group identification matrix to obtain the coefficient of each feature. The verification classifier is SVM.

B. K-nearest Neighbor (KNN)
KNN is a supervised classification algorithm for multi-class classification, and it is necessary to grasp the labels of existing samples in advance. For an unknown sample, we need to calculate the distance between the unknown sample and all existing samples, and select k samples with the closest distance, and then judge the class of the unknown sample based on the number of various types of samples in the k samples. The basic principle is shown in Figure 5. The first key point of the basic principle of the KNN is to quantify the features of the training set. Since the distance between the existing sample and the unknown sample is calculated, it is necessary to make sure that the features contained in each sample are quantified as numbers. The second key point of the KNN algorithm is data normalization. The value range of the sample feature data has a direct impact on the calculation of the distance, so it is necessary to normalize each feature data to a specific value range. The third key point of the KNN algorithm is to determine the distance function. The existing distance functions include Euclidean distance, cosine distance, Hamming distance and Manhattan distance. Among them, the most widely used is Euclidean distance, which is shown in formula 5.
Many papers use the CWRU dataset to study the application of KNN in bearing fault diagnosis. Wen et al. [36] used singular values as the input of the model, and combined graph theory with SVD, and proposed a new method of graph modeling. Validation with KNN classifier proves the effectiveness of this method in early fault diagnosis. Wang et al. [37] proposed a weighted KNN fault diagnosis method named WKNN, and used the ReliefF feature selection algorithm to process the feature subset formed by the multidomain feature set as the input of WKNN, which solved the poor generalization ability of the classifier under variable working conditions.
Ensemble learning is to combine multiple basic classifiers to complete a classification task, thereby improving the overall generalization ability. Among them, Boosting is a serial ensemble learning method. Through training weak classifiers in turn, a strong classifier is finally formed by weighting each weak classifier. Bagging reduces the variance of classification by averaging the results of multiple classifiers. [38][39]
Convolutional neural networks (CNN) are roughly the same as other neural networks. They use superimposed layers to construct the network, except that the CNN has more convolutional layers and pooling layers. As shown in Figure  6, it is a simple CNN model. The input of the convolutional layer is the input feature map, and the output of the convolutional layer is the output feature map. The convolution layer mainly performs convolution operations, which is a bit similar to filter calculation in image processing. The convolution layer can ensure that the shape and size of the input data remain unchanged. The pooling layer is to compress the data in the high and long directions. With the promotion of CNN, more and more papers have improved the accuracy of rolling bearing fault diagnosis by improving CNN. Feng et al. [51] proposed a CNN based sigmoid multitask fault diagnosis model. Unlike other bearing fault diagnosis using CNN network, this model can identify the type and the extent of the fault. In the verification of the CWRU data set, the average accuracy rate can reach 96%. Wang et al. [52] proposed a fault diagnosis model by combining singular value decomposition and onedimensional convolutional neural network (SVD-1DCNN), which has a good effect on fault diagnosis and classification. based on deep learning. First, continuous wavelet changes are used to convert vibration signals into time-frequency images, and then CNN is used to extract image features which are input to the gcForest classifier. Yang et al. [56] proposed a ID-DCNN model, and used this network to automatically extract features. Finally, the softmax and T-SNE algorithms are used to verify the feasibility of the proposed method.

FIGURE 6. CNN
The purpose of recurrent neural networks (RNN) is to link past data with current data to achieve "memory persistence". Unlike convolutional neural networks, RNN has a ring structure, which allows data to be maintained. In terms of parameter settings, RNN is also different from other neural network. The parameters of each layer of a neural network are different, and the parameters used by each layer of the RNN are the same at each moment. A special architecture of RNN is the long and short-term memory network model (LSTM). Many papers also apply RNN to the bearing fault diagnosis. Shenfield et al. [57] combined CNN and RNN, and proposed a model named RNN-WDCNN, which solved the problem of domain adaptation and high-frequency noise under actual working conditions. Khorram et al. [58] combined end-to-end CNN and LSTM, and proposed a new type of recurrent neural network named CRNN. Using the data of CWRU as the input of the model, the fault type of the bearing can be identified with the highest accuracy in a short time.

VII. EXPERIMENTS DATA AND DISCUSSION
This experiment will conduct experiments from 3 parts of the CWRU data set, including 12k Drive End Data, 48k Drive End Data and Fan End Data. The machine learning classifier and deep learning network will be used to analyze the CWRU data set, and the results will be compared and analyzed.

A. Data Description
This experiment will compose three data sets from the three aspects of 12k Drive End Data, 48k Drive End Data and Fan End Data to carry out the experiment. Each dataset is composed of the health data under 0 load and the data of inner race fault, ball fault, and outer race fault (6 o'clock) with fault diameters of 0.007, 0.014, and 0.021 respectively. Each type of sample set contains 10 types of data.

B. Experiment 1: The influence of feature selection
This experiment chooses ReliefF feature selection algorithm which is based on Euclidean distance. It is a kind of feature weighting algorithm.
The dataset is processed to form a sample set containing 500 samples. Each type of sample consists of 50 samples, including 40 training samples and 10 test samples, and each sample contains 2048 sampling points. A multi-domain feature set containing 27 features is extracted from the four dimensions of the time domain, frequency domain, and entropy energy of the rolling bearing vibration signal, as shown in Table 4. The marginal spectrum energy feature is the marginal spectrum energy entropy of the top six IMFs obtained by decomposing the vibration signal through EMD. The three feature sets of CWRU are respectively used as the input of the ReliefF algorithm, and the weight scores of each feature of the three feature sets are shown in Figure 7, Figure 8, and Figure 9. The criterion is the feature weight score. It can be found that the sensitivity of each CWRU data set to the feature is not the same. The weight score of the frequency domain feature of the 12k Drive data set is relatively high, and the weight score of the energy feature is relatively low . The top 11 features include T2, T5, T7, T8,  T9, T10, F1, F2, F3, F4, H2. The weight score of the time domain features of the 48k Drive data set is relatively high. The top 11 features include T3, T4, T5, T7, T10, F3, F4, F5, H1, H2, E6. The energy features of the 12k Fan data set have a relatively high weight score, and the frequency domain feature has a relatively low weight score. The top 11 features include T1, T5, T6, H3, H4, H5, H6, E1, E2, E3, E4. It can be found from the figures that there are some features with a weight score of 0, indicating that these features are irrelevant features. The feature selection is uesd to remove irrelevant features and reduce the dimensionality of the feature set, thereby avoiding dimensional disasters.

C. Experiment 2: Performance of fault diagnosis with different machine learning classifiers
The classifiers selected in this experiment are SVM, KNN, Grey relational degree (GRD), Adaboost, Bagging and GDBT classifiers. What's more, GRD can compare the dynamic changes of rolling bearings. The degree of closeness between the test sample and various types of faults can be judged by the geometric similarity between the different state data rows of the rolling bearing. The basic classifiers of Boosting are set to SVM and DT respectively to form Adaboost and GDBT classifiers. And set the basic classifier of Bagging to SVM. The top 11 features in the weight score of each feature set are respectively formed into a fault feature subset, and the feature set and the processed feature subset are respectively used as the input of the classifiers. The classification accuracy of each classifier is shown in the figure 10.

FIGURE 10. THE ACCURACY OF FEATURE SUBSETS AND FEATURE SETS
As shown in Figure 10, the fault diagnosis classifier based on machine learning performs well on the three data sets of CWRU as a whole. Among them, the SVM classifier has the best fault diagnosis effect, which may be because the fault diagnosis problem of rolling bearings is a small sample problem, and SVM is a classifier suitable for solving the small sample classification problem. The overall fault classification effect of Adaboost and Bagging, which use SVM as the basic classifier, is also very good. It also shows that SVM has a very good effect in solving the problem of small sample classification.
In Figure 10, the blue, yellow, and green histograms indicate that the ReliefF process is included in the fault diagnosis. It can be found that the ReliefF has the effect of improving the classification accuracy of the classifier. It shows that the ReliefF removes the irrelevant features of the feature set.

D. Experiment 3: Performance of fault diagnosis with deep learning
In addition, this article will compare the performance of simple deep learning models and machine learning classifiers on the CWRU dataset.This paper will use two basic models of LSTM and CNN.
The window size of the convolutional layer is 64×1 and the step size is 16×1.The convolutional layer is followed by a pooling layer with a window size of 2×1 and a scale factor of 2. There are several convolutional layers and pooling layers, the size of the convolution kernel is 3×1, and the step size is 1×1. The window size of the pooling layer is 2×1, and the scaling factor is 2. The number of convolutional layers and pooling layers is determined by the subsequent optimization process. There are several fully connected layers, the number of neurons is 32, and the specific number of layers is also determined by subsequent optimization. Finally, there is a fully connected output layer, and the activation function is the softmax.
The output dimension of LSTM model is set to 32, the function used for the loop time step is sigmoid. And the activation function is the Tanh.

FIGURE 11. THE ACCURACY OF CNN AND LSTM
The accuracy of the three data sets under the CNN and LSTM are shown in Figure 11, and the running time is shown in Table 5. The overall fault diagnosis accuracy of the CNN and LSTM models perform well. The accuracy of the CNN model in the two datasets are close to 100%, which are higher than the accuracy of the LSTM model. Compared with machine learning classifiers, the accuracy is not much different. The advantage of the deep learning model in fault diagnosis is that it does not require manual feature extraction and feature selection. But the disadvantage of the deep learning model is that the running time of the model is too long, especially the LSTM model.

VIII. CONCLUSION
This paper firstly investigated the current mainstream rolling bearing public datasets. Subsequently, the contributions made by the CWRU dataset in the field of machine learning are summarized. Finally, experiments were carried out on three datasets of CWRU and the results were discussed.. Get the following conclusions: • The mainstream rolling bearing fault diagnosis data set mainly has two fault methods, one is the artificially damaged and the other is the accelerated lifetime test. Among them, the CWRU dataset is the most widely used in fault diagnosis. The advantage of the CWRU dataset is that the data is simple and can be used for initial verification of the fault diagnosis model. • Machine learning based bearing fault diagnosis can be divided into three steps. The first step is to extract features, the second step is feature selection, and the third step is classifier recognition. The feature extraction part mainly extracts the sensitive features of the vibration signal in different domains to form a feature set with strong discrimination. According to different discrimination criteria or the combination with the learner, the feature selection algorithm can be divided into different categories. It is worth noting that the sample data of the CWRU data set is evenly distributed, and the fault types are all single faults. In actual working conditions, the data is often unevenly distributed and most of the fault types are compound faults. Therefore, the CWRU data set is only suitable for preliminary verification of the fault diagnosis algorithm. For further verification of the algorithm, a full life cycle data set is required. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.