A Fault Diagnostic Scheme Based on Capsule Network for Rolling Bearing under Different Rotational Speeds.

Deep learning-based intelligent fault diagnosis methods have attracted increasing attention for their automatic feature extraction ability. However, existing works are usually under the assumption that the training and test dataset share similar distributions, which unfortunately always violates real practice due to the variety of working conditions. In this paper, an end-to-end scheme of joint use of two-direction signals and capsule network (CN) is proposed for fault diagnosis of rolling bearing. With the help of the superior ability of CN in capturing the spatial position information between features, more valuable information can be mined. Aiming to eliminate the influence of different rotational speeds, vertical and horizontal vibration signals are fused as the input to CN, so that invariant features can be extracted automatically from the raw signals. The effectiveness of the proposed method is verified by experimental data of rolling bearing under different rotational speeds and compared with a deep convolutional neural network (DCNN). The results demonstrate that the proposed scheme is able to recognize the fault types of rolling bearing under scenarios of different rotational speeds.


Introduction
Rolling bearing is one of the key components of rotating machines. However, due to the hostile working conditions, the rolling bearing is prone to failure and may cause the breakdown of the machines, or even catastrophic consequences. Therefore, effective fault diagnosis plays an indispensable role to ensure the safety of machinery and reduce unnecessary cost. With the rapid development of fault diagnostic technologies, a large number of effective methods have been proposed in the past few years, including temperature, oil-debris analysis, vibration and acoustic emission, etc. [1]. Among them, vibration analysis is a widely used method for rotating machinery due to its exceptional adaptability and easy-to-get signal. Many methods have been proposed based on signal processing techniques, data-driven techniques and the combination of both [2][3][4][5].
Traditional intelligent methods for fault diagnosis, such as support vector machine (SVM) [6], back propagation neural network (BPNN) [7], and random forest [8], require manual intervention and expert the application of the proposed scheme on rolling bearing is presented in detail. Then the performance is discussed. Section 5 concludes the paper.

Capsule Neural Network
Traditional convolutional neural network (CNN), called LeNet, was first proposed by Lecun in 1998 and showed excellent performance on image recognition [23]. Then, some improved CNNs with deeper layers were consequently designed and achieved satisfactory results on image recognition, including AlexNet, VGGNet, GoogLeNet, ResNet [24,25]. Although these networks may obtain high accuracy for the classification of images, the spatial information in image cannot be perfectly captured, in other words, the changes of positions or angles of objects in an image can lead to an unsatisfactory classification result. To this end, capsule network with a so-called dynamic routing is proposed to solve this very problem in CNN.
A general CNN is mainly composed of convolutional layers, pooling layers, fully connected layer and activation function. The feature of input is first extracted by convolutional layers and pooling layers. Then, a fully connected layer is applied for classification [26]. The application of convolution and pooling operation is the key steps to contribute to a deeper layer of CNN, which reduces the number of trainable parameters by parameter sharing. The pooling layer can reduce the size of data because the features of the image can stay unchanged after pooling. However, Hinton [27] indicates that the usage of max-pooling may also lead to a deficiency that the location relationship of features may be ignored. Focused on this deficiency, Hinton and his assistant proposed a novel neural network in 2017, i.e., capsule network (CN). This network integrates the advantages of CNN in non-linear feature extraction and retains the positional information by using the module length of vector to describe the probability of the predicted result [27].

The Architecture of Capsule Network
A simple CN architecture for image recognition is shown in Figures 1 and 2. It includes four parts, convolution, primary capsule, digit capsule and decoder. Each part is briefly introduced in this section. As an example, the input size is set as 28 × 28 and the number for classification is 10 [27].  (1) Convolution layer The first layer of CN is a traditional convolution layer with "ReLU" activation function, which is applied to extract the preliminary feature. After the initial convolutional operation with kernel size of 9 × 9 and number of 256, the extracted features with size of 20 × 20 × 256 are then used as inputs to the primary capsules. In addition, pooling operation is not appended after convolution to avoid the information loss.
(2) Primary capsule layer In this layer, the capsule is preliminarily developed. Another convolution operation with the kernel size of 9 × 9 and number of 256 is first applied. The output after the convolution is then reshaped into capsules with size of 6 × 6 × 32 × 8, which means the dimensionality of each capsule is 8. Squashing function, which will be interpreted later, is used to squash the vector length within the range of 0-1. This layer can obtain low-level features in a vector form.

(3) Digit capsule layer
This layer is to get high-level features in vector form and is fully connected with primary capsule layer between vectors by a matrix W ij . Each capsule from previous layer is multiplied by W ij with a size of 8 × 16, which will be interpreted in detail later. Therefore, the output size of this layer is 6 × 6 × 32 × 16. In addition, a dynamic routing algorithm, which also will be introduced later, is applied between primary and digit capsules. The weight parameter of each capsule can be determined by dynamic routing instead of backpropagation. After dynamic routing several times, the final capsules with a size of 10 × 16 are then sent to two different branches: one to compute the length for classification and the other to reconstruct the input images.

(4) Decoder
Decoder is an extra layer in CN compared with CNN. It serves as the function very similar to an autoencoder. The capsule with maxlength from digit capsules is selected to input the 3 fully connected layers for reconstruction as described in Figure 2. The reconstruction loss is also considered to update the parameters. Therefore, the total loss of CN contains two parts, namely classification and reconstruction, and is the weighted sum of both.

The Principle of Capsule Network
In CN, many capsules with special significances, including orientation and length, are the key components of the network. The activation of the neuron within the capsule represents a variety of properties present in the image. These properties can include many different types of instantiation parameters in the image, such as posture, deformation, velocity, tone, and texture. A very special property of the capsule is the existence of an instance of a category in the image. Due to these unique capabilities of CN, in this work, we will explore this method to solve the problem of fault classification problem under different rotational speeds.
Compared with the traditional neural network, the output of neuron is a vector rather than a scalar, whose length can represent the estimated probability of the existence of the object. In addition, the non-linear activation function is replaced with a "squashing" function to ensure that short vectors are confined to almost zero length and long vectors are confined to a length slightly below one. Therefore, the length after squashing can be represented as the probability of a given character.
where s j is the total input of capsule j and v j is the its vector output. The computational procedure of vector neuron v j of capsule j between primary capsule layer and digit capsule layer can be summarized as three steps and described in Figure 3. First, the initial vector u i from primary capsules is multiplied by weight matrix W ij to get the predicted vector u j|i . Secondly, a weighted sum over all predicted vectors u j|i is performed to obtain the following vector s j . Finally, the vector s j is transformed into vector v j using the "squashing" function.
where c ij is called coupling coefficients and are determined by the iterative dynamic routing algorithm. The coupling coefficients between capsule i and all the capsules in the layer above sum to 1 and are calculated through a "routing SoftMax" whose initial logits b ij are the log prior probabilities that capsule i should be coupled to capsule j [27].
The key function of CN is to obtain the coupling coefficient c ij and it is determined by the iterative dynamic routing process. The detail of routing algorithm is shown in Algorithm 1, where r is the times of routing. Algorithm 1 Dynamic routing algorithm 1: Procedure Routing(u j|i , r, l) 2: for all capsule i in layer l and capsule j in layer (l + 1):b ij ← 0. 3: for r iterations do 4: for all capsule i in layer l:c i ←So f tMax(b i ) 5: for all capsule j in layer (l + 1):s j ← Σ i c ij u ij 6: for all capsule j in layer (l + 1):v j ←squash(s j ) 7: for all capsule i in layer l and capsule j in layer (l + 1) : b ij ←b ij + u j|i ·v j 8: return v j

The Procedure of the Proposed Scheme
In this paper, an end-to-end scheme based on CN and two-dimensional (2D) input data is proposed for health state identification of the rolling bearing under different rotational speeds. Specifically, the 2D data is constructed by fusing the signals collected from two sensors perpendicular to each other. The detail of data fusion method will be introduced later. Taking the structured 2D data as input, 2D convolution is used in CN. Figure 4 displays the flowchart of the proposed method. In addition, there are four main steps: (1) The vibration signals are collected by multiple sensors; (2) Vibration signals from two-direction sensors, including horizontal and vertical, are normalized and sliced into segments; (3) The segments from two directions are fused and then input to the CN; (4) The CN is trained and tested using the data from different speeds. The test accuracy of the output result is used to evaluate the effectiveness of the proposed scheme.

Case Study and Experiment Results
In this section, the performance of the proposed method is tested. All data are collected from an MFS-MG test rig in University of Electronic Science and Technology of China (UESTC), as shown in Figure 5. The model of the capsule network is constructed in Python 3.6 using Pytorch and carried out on a PC with an Intel i7-8750H CPU, a GTX 1050 Ti GPU and operating system Win10.

Experimental Rolling Bearing Data Description
To test the performance of the proposed scheme, we conduct experiments on the setup. As shown in Figure 5, the test rig mainly consists of a motor, two support bearing and a rotor. Experiments are conducted on the rolling bearing away from motor under four health states, including normal, inner race fault, rolling element fault and outer race fault. For each health state, vibration signals are collected when the drive motor operates under different rotational speeds to form up different domain datasets. The settings of speed include 800 rpm, 1500 rpm and 1800 rpm, resulting in different speed gaps. Therefore, the experimental dataset contains 4 different health states and 3 sets for each health state. In addition, four accelerometers are employed in this experiment. For each rolling bearing, two accelerometers are installed perpendicular to each other to collect vibration data in the horizontal and the vertical synchronously. To display the location of four sensors clearly, the layout is described using a schematic diagram, shown in Figure 6. As sensors 3 and 4 are directly installed on faulty bearing orthogonally, only vibration signals collected from these two sensors are used for follow-up work. The sampling frequency and measurement time are 20 KHz and around 3.2 s respectively, contributing to 65,536 sample points for each collected signal.

Data Processing
The collected signals need to be processed to input the network. First, the collected signals perform global normalization to speed up training. Then the preprocessing data are sliced into segments. Considering that a longer segment could be contain more information but fewer samples will be obtained, which may lead to overfitting issue during the network training, therefore, the size of each segment is set as 2048 to balance the length and the number of samples. In addition, the same vibration signal is sliced with overlap and no overlap respectively. In addition, the shift is set as 300 for overlap. Therefore, vibration signal with 65,536 points can provide 212 segments with overlap and 32 segments with no overlap. The obtained segments are then prepared for postprocessing.
Three different data fusion methods are conducted on the segments from different sensors, as shown in Figure 7, namely no processing, channel overlay [28] and stack banding. No processing indicates that the segments from one single sensor are used and sensor 3 is selected in this study since training and test datasets are from the same rotational speed scenario and measured vibrations from either sensor 3 or 4 can be largely treated as the same input. The format of no processing is (1,2048) and one-dimensional convolution is used in the following studies. Channel overlay and stack banding fuse the two segments, namely the horizontal and the vertical vibration signals, to be an input data with the same period but different arrangement. The segments are first expanded into a three-dimensional representation, i.e., (1,1,2048), to in line with the input format of two-dimensional convolution (in_channel,in_width,in_height). Channel overlay refers to concatenating the segments of sensors 3 and 4 according to the "in_channel" to form up one data sample. In this way, the shape of the fused data sample will be (2,1,2048). Similar to channel overlay, stack banding concatenates the segments of sensors 3 and 4 according to the "in_width", and results in a data of (1,2,2048). The whole dataset in detail is shown in Table 1. It should be noted that no processing, channel overlay, and stack bonding are represented by A, B and C, while three different speeds are represented by a, b and c.

Parameters of the Proposed Model
The model parameters will have a significant influence on the experimental results. Therefore, reasonable selection of model parameters needs to be studied here, including batch size, iteration, the size and number of convolutional kernels. To perform the process of parameter selection, dataset with the "stack banding" (C) is used. Considering the large speed gap scenario between set a and set c in Table 1, the training set is from set a with overlap while the test set is from set c without overlap.
(1) Selection of batch size Batch size refers to a certain number of samples at each training step. Reasonable selection of batch size can reduce training time and improve classification accuracy. Considering the limited memory capacity of computer and the number of training samples, the selected batch size are 8, 10, 16, 20, 24, 32 and 64, respectively. In addition, the number of iterations is preliminarily set as 70. For the convolution layer, two convolution kernels are used. As illustrated in [21], the high frequency noise commonly in vibration signals disturbs the local feature extraction, and the kernels of the first layer should be wide for suppressing the noise. Therefore, a wide kernel is used in the first layer and the size is 2 × 64, following a convolution kernel of 1 × 8 in the second layer. The experiment is repeated for ten times, and the average value is taken as the final classification accuracy. The relationship between the classification accuracy, time and batch size is shown in Figure 8. Balancing the training time and accuracy, the batch size is selected as 24. (2) Selection of iteration times One iteration refers to forward propagation and back propagation once of the network. If the number of iterations is too small, the model is not well trained. When the number of iterations reaches a threshold, the model tends to be stable, which means that the accuracy will not increase, but time cost will increase with the increase of the number of iterations. Therefore, a suitable number of iterations needs to be chosen to reduce the training time. To simplify the complexity, the batch size is set as 24 according to the above and the convolution kernel is same. The relationship between classification accuracy and iteration times is shown in Figure 9. In Figure 9, the classification accuracy increases gradually with the increase of iteration times. When the iteration times exceed 100, the classification accuracy tends to be stable. To have a redundancy for the following parameter selection, the number of iterations is chosen as 120.  The size and number of convolution kernels determines the features extracted preliminarily in CN. Reasonable design of convolutional layer can improve the performance of CN. As mentioned in [25], several layers with small kernels are better than a single layer with large kernels. Therefore, the small kernels of size 3 are used except the first layer. The performance of different numbers of kernels are compared and shown in Figure 10. The kernel size of first layer is fixed at 64 and the latter layers are shown in Figure 10. When the latter layers are set as 3/3/3, the accuracy is the highest. Therefore, the kernels are set as 2 × 64, 1 × 3, 1 × 3, 1 × 3. According to the above discussions, the configurations of the proposed capsule network for experiments are shown in Table 2. In the training process of the proposed model, the total loss is the combination of two outputs: the loss of classification and the loss of reconstruction. To reduce the effect of reconstruction loss, we multiply by a weight λ = 0.5. Adam algorithm with default lr = 0.001 is used to train the model. Batch size is 24 and epoch number sets as 120. In addition, L2 regularization operation is applied to prevent overfitting.

Results and Discussions
In this part, the dataset shown in Table 1 is used to verify the proposed scheme. In addition, WDCNN (a wide first-layer deep convolutional neural network) proposed in [26] is used to compare with CN. The models are trained repeatedly with ten trials for each case.
First, to test the diagnostic performance of CN under the same rotational speed, dataset with the "no processing" (A) in Table 1 is used. It should be noted that the training and test sets are both from data with overlap for each rotational speed. The ratio of training set to test set is 4:1. The results are shown in Table 3 and Figure 11. Average and variance of the test accuracies in each case are listed in a format of "average ± variance". As shown in Table 3, the CN model achieves accuracies of "100 ± 0.00" in all three cases, while WDCNN has misjudgments in all three cases. In particular, The proposed scheme improves by 2.27% of average when the speed is 1500rpm. In addition, comparing the training process of two methods, shown in Figure 11, the proposed scheme remains stable after 20th epoch. However, WDCNN-based method shows slightly turbulence after 30th epoch. Therefore, the CN model has a slight improvement on diagnosis accuracy and stability when the training and test data are from the same rotational speed, while the WDCNN model still is a good method for fault diagnosis under this scenario. Table 3. Accuracies of different methods tested at the same speed.
To reflect the prominent property of the proposed scheme under different rotational speeds, the vibration signals from sensors 3 and 4 are fused as input of CN. For comparison, we replace the data fusion methods and model in turn. Dataset with three fusion method are used in this part. It should be noted that the training set is from the dataset with overlap under one speed while test set is from the dataset with no overlap under another speed. The results are shown in Table 4 and Figure 12. The time is average value of each trial. There is a 300 rpm speed gap between settings b and c, while a 1000 rpm speed gap between settings a and c. Following, two discussions will be made.   (1) Comparisons on the No 1,2,3 in Table 4. They are all performed using the CN model. First, for No 2 and No 3, datasets B and C is inputted into CN, respectively. The only difference is the data fusion method using two sensors. The results show that the test accuracies of these two data fusions have small difference overall. In addition, the training time of two data fusion methods is 154 s and 152 s, and is very close. Secondly, comparing No 1 to No 3, the input data are from one sensor and two sensors, respectively. The results show that the test accuracies using two sensors have a clear enhancement. In particular, for "a→c" and "c→a" scenarios, which have a relatively large speed gap (1000rpm), the average accuracies improve by 27.50% and 22.80% respectively. As for the training time, it reduces by 10s on the contrary. These results demonstrate that the two data fusion methods based on CN model with two sensors achieve satisfactory performance for rolling bearing diagnosis under different rotational speeds.
(2) Comparing the No 2,3 to No 4,5 in Table 4. They are performed using CN and WDCNN model, respectively. For No 2 and No 4, the only difference is the classification network. The test accuracy using CN model have an improvement of 4.53% for "b→c" scenario. As the speed gap increases, namely "a→b" and "a→c", the test accuracies improve by 29.38% and 32.68% respectively. Comparing the training time, CN spend more time due to its special structure. Considering the great accuracy improvement, spending more time is worthy. There is also a similar result between No 3 and No 5. The results show that the CN model is more suitable for fault diagnosis than CNN using two-directional sensors, especially at large speed gap.
Therefore, according to the above experiment and discussions, it demonstrates that joint use of two-direction signals together with CN is the best choice for fault classification under different rotational speeds scenario.

Conclusions
In this paper, we proposed a 2D data input capsule network scheme for fault diagnosis of rolling bearing. We aim to eliminate the influence of speed by taking advantage of capsule network with capturing the spatial relationship from the input. Experimental data from three different speeds are used to verify the performance of the proposed method. According to the results of the experiments, it may conclude that: (1) The CN model can achieve a better performance than WDCNN when the training and test data are from the same rotational speed and only one sensor is used. (2) When two-directional sensors are used, the proposed scheme can largely get rid of the influence of different rotational speeds for a significant improved diagnostic accuracy. For future work, the performance of the proposed scheme under fluctuating speeds will be further studied.