Multitask Convolutional Neural Network for Rolling Element Bearing Fault Identification

As one of the most vital parts of rotating equipment, it is an essential work to diagnose rolling bearing failure. )e traditional signal processing-based rolling bearing fault diagnosis algorithms rely on artificial feature extraction and expert knowledge. )e working condition of rolling bearings is complex and changeable, so the traditional algorithm is slightly lacking adaptability. )e damage degree also plays a crucial role in fault monitoring. Different damage degrees may take different remedial measures, but traditional fault-diagnosis algorithms roughly divide the damage degree into several categories, which do not correspond to the continuous value of the damage degree. To solve the abovementioned two problems, this paper proposes a fault-diagnosis algorithm based on “end-to-end” one-dimensional convolutional neural network. )e one-dimensional convolution kernel and the pooling layer are directly applied to the original time domain signal. Feature extraction and classifier are merged together, and the extracted features are used to judge the damage degree at the same time.)en, the generalization ability of the model is studied under a variety of conditions. Experiments show that the algorithm can achieve more than 99% accuracy and can accurately give the damage degree of the bearing. It has good performance under different speeds, different types of motors, and different sampling frequencies, and so it has good generalization ability.


Introduction
Rotating machinery and equipment is developing in the large-scale, refined, and systematic direction. Rolling bearing is one of the most important components of rotating machinery and equipment. e probability of mechanical equipment fault caused by the failure of rolling bearings is as high as 30%, so reliable condition monitoring of rolling bearings is critical for the normal operation of mechanical equipment [1].
It is difficult to directly monitoring the state of the bearing due to its small size and the concealed mounting position. Usually, the condition of bearing is diagnosed by analyzing the vibration signal [2][3][4][5]. Recently, the machine learning technology [6], especially deep learning [7], is rapidly developing. Many models of deep learning such as Stacked Auto Encoder (SAE), Deep Belief Network (DBN), and Convolutional Neural Network (CNN) have been widely applied to the diagnosis of rolling bearings [8]. Junbo et al. [9] used the stacking self-encoder to construct a deep neural network. e fault features are extracted by stacking the intermediate hidden layers by using the characteristics of the input and output of the encoder. Lu et al. [10] used a denoising self-encoder to improve the robustness of the model. e denoising input from the encoder randomly removed some of the input signals to improve the stability of the hidden layer extraction features. Gan et al. [11] used the Deep Belief Network to propose the Hierarchical Diagnosis Network (HDN). e method first uses DBN to determine the location of the bearing damage and then uses DBN to diagnose the size of the damage.
Convolutional Neural Networks (CNNs) have had many major breakthroughs in the field of image recognition [12,13]. In addition, they are widely used in the field of fault diagnosis [14,15]. Janssens et al. [16] used the shallow convolution neural network model to make discrete Fourier Transform of the original signal, transform the original timedomain vibration signal into the frequency domain, and use the extracted useful features as input to the one-dimensional convolutional neural network. Jia et al. [17] proposed a fourlayer deep neural network, which performs Fast Fourier Transform on 2400 data as the network input signal, then uses the self-encoder to perform unsupervised training layer by layer, and finally carries out supervised training. Wen et al. [18] proposed a convolution neural network model, which was based on LeNet. e original vibration signal was transformed into two-dimensional gray image as input of the model, and the fault diagnosis of the bearing and centrifugal pump was realized. Yuan et al. [19] constructed a two-dimensional CNN model and extracted the time-frequency map of the rolling bearing as the input of the twodimensional CNN. Wei et al. [20] proposed a deep onedimensional convolutional neural network model, using dropout technology and small batch training, to achieve accurate and stable diagnosis based on original vibration signals.
e abovementioned research studies have already achieved good diagnostic results, and even some models have a diagnostic accuracy of 99%. However, they only gave a judgment on the category of the fault, without further considering the extent of the fault. Cococcioni et al. [21] proposed a method, based on classification techniques, for automatic detection and diagnosis of defects of rolling element bearings containing different severities. at method proved the high sensitivity to different types of defects and to different degrees of fault severity. is method is applied to diagnosis bearing fault containing different severities of the same fault as belonging to the same class. e degree of bearing failure is an important indicator to guide equipment maintenance. In order to accurately estimate the degree of failure during fault diagnosis, a multitask convolutional neural network model is proposed in this paper. e proposed one-dimensional convolutional neural network model uses convolution kernels and pooling layers to directly act on the original time domain signal and trains the extraction features together with the classifier to achieve an "end-to-end" model. e model does not rely on artificial feature extraction and expert knowledge and maximizes the use of CNN for feature extraction. We improve on the basis of the model, get the fault degree information, and improve the generalization ability of the network. In the era of mechatronics big data, it is particularly important to improve the generalization ability of models. e proposed model conducts a more in-depth study of the generalization ability of the model. e remainder of this paper is organized as follows. Section 2 gives a brief introduction of CNN. Section 3 explains the proposed fault diagnosis method. Section 4 is concentrated on fault diagnosis experimental investigations. Finally, Section 5 gives the conclusion and future work.

A Brief Introduction of CNN
e CNN is a special structure of the feed-forward neural network, generally including a convolution layer, activation layer, pooling layer, and fully connected layer. Corresponding to the CNN model, the input signal is extracted by the convolution layer, the abstract features are obtained by the convolution and pooling operation of the input signal layer by layer, and finally, the classification or regression is performed through the fully connected layer. e two core ideas of the CNN network are local connection and convolution kernel parameter sharing. A key role of both is to reduce the number of parameters, making the operation simple, efficient, and able to operate on very large data sets. e local connection ensures that the filter, after learning, can maximize the response to local features, and the parameter sharing reduces the risk of overfitting to some extent.

Convolutional Layer.
e convolutional layer uses a convolution kernel to convolve the input signal or the local region of the previous layer output and generates a corresponding feature through a nonlinear activation function. It can acquire richer features with fewer parameters, compared to the fully connected layer. Due to its insensitivity to input dimensions and the ease of training of deep structures, it has become the most common method for complex high-dimensional input feature extraction. e mathematical model of convolutional layer can be described by the following equation: where i is i th convolution core; g(i) is the characteristic graphs obtained by learning the i th convolution kernel; a is input data; w is the weighting factor; b is bias of convolution kernels; and x, y, z are the dimensions of input data.

Activation Layer.
After the convolution operation, the activation function transforms the output of the convolution layer nonlinearly and maps the original linear and nonseparable multidimensional features into another space. e commonly used activation functions are the Sigmoid function, ReLU function, and Tanh function. e ReLU activation function is the most widely used in CNN among them. e mathematical model of the ReLU activation function can be described by the following equation: where g(i) is the output of the convolutional layer; y(i) is the activation value of g(i).

Pooling Layer.
e pooling layer carries out the downsampling operation, which includes maximum pooling and average pooling. e most common pooling method for CNN is maximum pooling. e maximum pooling layer can not only reduce the training parameters but also further extract the features to improve the robustness of the features. e mathematical model of maximum pooling can be described by the following equation: 2 Shock and Vibration where a l(i,t) is the t th neuron of the i th feature map of the l th layer; w is the width of the convolution kernel; and j is the j th pooling layer.

Fully Connected
Layer. e fully connected layer is used to classify the features obtained by the convolution kernel. e output of the last pooled layer is paved into a one-dimensional feature vector as the input of the fully connected layer. So, the input and output are fully connected, and the output layer uses the Softmax activation function. e mathematical model of the fully connected layer can be described by the following equation: where w l ij is the weight between the i th neuron in the l th layer and the j th neuron in the (l + 1) th layer; z l+1 j is the output of the j th neuron in the (l + 1) th layer; b l j is the bias of all neurons in the l th layer to the j th neuron in the (l + 1) th layer.

The Proposed Method
Although CNN has been applied in fault diagnosis, most of the models still need a lot of signal processing technology and rich engineering practice experience and do not make full use of the self-learning ability of convolutional neural networks. Limited by the number of feature extractions, the structure of the network is generally shallow, and the nonlinear features contained in the original signal often require a multilayer network to fully extract. Moreover, the proposed models cannot give a judgment on the degree of fault. e degree of failure is extremely important for the stable operation of the equipment. Different degrees of failure are likely to take different measures. e traditional rolling bearing fault diagnosis algorithm can only give the type of fault, but cannot predict the degree of fault. erefore, this paper proposes a multitask convolutional neural network model. e method proposed in this paper can not only accurately judge the type of fault but also predict the degree of fault, thus giving more accurate guidance.

e Structure of the Network.
e network structure proposed in this paper is shown in Figure 1, and it mainly consists of three parts: an input layer, feature extraction layer, and multitask layer. e input layer inputs the original vibration signal with a length of 1024 samples. e feature extraction layer contains a total of four convolution layers, and each one was followed by a ReLU activation function and a maximum pooling layer. e first convolutional layer has 16 convolution kernels, the second convolutional layer has 32 convolution kernels, the third convolutional layer has 64 convolution kernels, and the fourth convolutional layer has 128 convolution kernels. e kernel of the first convolution layer has a size of 64 * 1, and the kernel of the remaining convolution layers is 3 * 1. e purpose of the first convolution kernel is to obtain enough information from the original vibration signal to expand the receptive field. e network contains three fully connected layers. e last pooling layer is laid flat and connected to the first fully connected layer to obtain the features of the original vibration signal. e fully connected layer has 100 neurons, and it is shared by two tasks. e remaining two fully connected layers connect two tasks, respectively, and each one has 10 neurons. e multitasking layer uses the obtained features to perform classification tasks and regression tasks, respectively, to achieve fault type judgment and fault degree judgment. e purpose of joint training of two tasks is to be able to obtain more abundant features, so that the type of fault and the degree of fault can be predicted simultaneously.

Loss Function Design.
is paper has two tasks, one is the classification task, using the cross-entropy loss function and the other is the regression task, using the minimum mean square error (MSE) loss function. e sum of these two loss functions constitutes the overall loss function of the model. e cross entropy characterizes the distance between the actual output and the expected output, which is the smaller the value of the cross entropy is, the closer the two probability distributions are. e mean square error in mathematical statistics refers to the expected value of the square of the difference between the parameter estimate and the true value of the parameter. e smaller the value of MSE, the better the accuracy of the prediction model describing the experimental data.
Once the loss function is determined, the gradient descent algorithm can be used to update the parameters. e RMSProp optimization algorithm is used in this paper. RMSProp optimizer is suitable for processing nonstationary data, which can effectively prevent premature convergence in deep learning. e basic steps are as follows in Algorithm 1:

Experimental Data.
e data are obtained from the Case Western Reserve University Bearing Fault Database. ere are three kinds of defect locations in the data set: rolling element damage, outer ring damage, and inner ring damage. e damage diameters are 0.007 inch, 0.014 inch, and 0.021 inch. e original input signal is a one-dimensional time series signal. In order to meet the basic needs of deep learning a large number of samples, sample expansion is required. In this paper, the method of overlapping sample segmentation is used to realize sample expansion. Compared with nonoverlapping sample segmentation, the correlation between adjacent elements can be preserved as much as possible to improve the robustness of the model. e overlap rate is set to 0.3. e principle of sample expansion is shown in Figure 2. A total of 4 data sets were prepared for the experiment, as shown in Table 1. Data sets A, B, and C are data sets with rotational speeds of 1797 r/min, 1772 r/min, and 1750 r/min, respectively. We divide each data into two Shock and Vibration 3 parts, one including 1000 samples used as training data and another including 200 samples used as test data. It should be noted that since the model in this paper is multitasking, two labels are required, as shown in Table 1.

Multitask Model
Training. Train the built model with data sets A, B, C, and D. e initial learning rate of RMSProp is 0.001. e full connection layer uses the dropout technique to suppress overfitting, the dropout rate is 0.5, the small batch mini-batch size is 64, and the cutoff training round is set to 200. In order to verify the reliability of the training and test results, the network was run 10 times, and we calculate the average of the ten runs. For each time experiment, the 1000 samples in the training data are reused as the training data set, the test data set including 100 samples which are randomly extracted from the test data by sampling replacement. ose 100 samples are randomly selected from the test set for each run for testing to prevent the network from using the same test samples each time. e experimental results are shown in Table 2. It can be seen from the experimental results that the accuracy on each data set can almost reach 99% for the classification task. Performance on data sets A, B, and C is better than on data set D. For the regression task, the loss function value is the smallest on test set A, which can be reduced to 2.4563. e prediction result on test set A is shown in Figure 3. It can be seen that the model can relatively accurately determine the damage of the fault. Degree, at the same time, can also give the category judgment of the fault, achieving the purpose of multitasking.
In order to see if the model successfully extracted the feature, the first fully connected output is dimensioned using the t-SNE method to 2 dimensions. Figure 4 shows that useful features can be extracted from different data sets.
Require: global learning rate ε, Decay rate ρ Require: initial parameter θ Require: small constant δ, usually set to 10 − 6 Initialize accumulated variable r � 0 while failure to reach the stop criterion do Small batches containing m samples x (1) , . . . , x (m) from the training set, corresponding target is y (i) Calculating the gradient:   When detecting faults online, we should try to minimize errors. For a section of vibration signal, split the vibration signal into n segments, which can overlap, and put it into the network, so we can get n outputs. Voting methods are used for minimize classification errors, which means the category with the highest number of votes is selected as the result. One way to minimize the regression error is using the weighted average method. It weights the average of the results of the n-segment signals, and we get the final degree of fault of the given signal. e mathematical model can be described by the following equation: where y i is the output of the i th segment; w i is the weight of the i th segment.

Multitask Weight Selection.
e model in this paper optimizes the parameters of the network by multitask simultaneous training. e loss function of the whole convolutional neural network is obtained by adding the loss function of the two tasks, but the order of magnitude of the two tasks is different, which will lead to the bias of the training of network parameters. erefore, the selection of the weight of the loss function of the two tasks is one of the key factors affecting the diagnostic accuracy of the model. It can be seen from the training results that the loss of the classification task is smaller than the regression loss, and the weight of the regression loss function is relatively increased for training. e results of different weight ratios for data set A training are shown in Table 3 and Figure 5. It can be seen that when the weight ratio increases gradually, the classification accuracy will gradually increase and eventually stabilize, while the loss of regression tasks will gradually increase.
is paper considers that it is advisable when the weight ratio is 0.8 :1, because the classification accuracy tends to be stable and the loss function is at a turning point.

Multitask Learning Mode.
ere are many kinds of multitask learning modes in deep learning. e one used in this paper is multitask joint training and sharing the same  Train  100  100  100  100  100  100  100  100  100  100  Test  20  20  20  20  20  20  20  20  20  20   B  Train  100  100  100  100  100  100  100  100  100  100  Test  20  20  20  20  20  20  20  20  20     Shock and Vibration layer of hidden layers of the network, but only at the near end of the network to start bifurcation to do different tasks. Another mode is individual training, which trains one of the tasks separately, then fixes the structure and parameters of the network, and uses one layer of the hidden layer of the network as the input of another task to train another task. In order to compare the advantages and disadvantages of the two modes, the first fully connected layer is trained as the input of the second task. e experimental results of the two modes are compared as shown in Table 4.
It can be seen that individual training has better classification accuracy, but it does not perform well in the regression task, while the two tasks joint training mode is smaller in regression loss, which shows that this network structure can extract the features of fault category and fault degree more effectively. Compared with the two tasks, it is more capable of taking into account another task. It can be seen that the network proposed in this paper can extract more features. erefore, the proposed network can extract more abundant features.

Generalization Performance Experiment.
For today's electromechanical big data era, the optimal model must not only be able to adapt to different speeds but also apply to a variety of different conditions. is paper will study the generalization ability of the classification task model on different motors and different sampling frequencies.
e fault diagnosis test platform for rotating machinery is shown in Figure 6. e fault types also include rolling element damage, outer ring damage, and inner ring damage. Vibration signals at 1450 r/min, 1400 r/min, 1350 r/min, and 1300 r/min were measured and each includes 900 samples. e CWRU data is measured at 12 kHz and 24 kHz and each includes 3600 samples. In order to study the effect of  Shock and Vibration different sampling frequencies on the diagnostic results, we collected data of 2560 Hz and 5120 Hz on the fault diagnosis test platform and each also includes 3600 samples. e CWRU data set and the collected data are jointly trained and tested. e samples were divided into training sets and test sets in a 5 : 1 ratio. So, the total data set includes data from different speeds, different types of motors, and different sampling frequencies, as shown in Figure 7. e CWRU data set includes three different damage degrees. However, the fault diagnosis test platform for rotating machinery only has one damage degree, which is not similar to any of the CWRU data sets. e diagnosis of the damage degree is limited because the fault diagnosis test platform only has one damage degree and it even holds half of the data. e structure of the network is modified, the task of damage degree diagnosis in multitasking is removed, leaving only fault category diagnosis, and then the network is trained. e training result is shown in Figure 8. e accuracy of the training set of the model is up to 99.07%, and the loss function value is 0.0059. e accuracy rate on the test set is up to 98.12%, and the loss function value is 0.0120.
In order to verify the model's adaptive feature learning ability for the original input signal, the t-SNE method is used to perform dimensionality reduction visual analysis on the features learned by each convolutional layer. e partial convolutional layer output visualization is shown in Figure 9. e original data is not clearly distinguished after the first convolutional layer, but as the depth of the network increases, the feature separation of the convolutional layer extraction becomes higher and higher, and the classification is getting

Conclusions
e multitask convolutional neural network proposed in this paper can extract effective fault features from the original signal, realize accurate diagnosis of fault types, and give the fault degree. e advantages and disadvantages of the models under different modes are compared, and it is proved that the optimal results can be achieved when using multitask joint training. e proposed method has good generalization ability and can maintain a high fault recognition rate even if the speed, the type of motor, and the sampling frequency are changed. e research in this paper is mainly aimed at the analysis of the current state of the real time input signal, and the fault prediction is also important in rotating machinery. It is more meaningful to predict the state after the current information. e following research can be extended to the fault type and degree prediction.

Data Availability
e CWRU data used to support the findings of this study are obtained from the Case Western Reserve University Bearing Fault Database and have been deposited in the following website: http://csegroups.case.edu/bearingdatacenter/home.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Shock and Vibration 9