Deep Learning Approach for Vibration Signals Applications

This study discusses convolutional neural networks (CNNs) for vibration signals analysis, including applications in machining surface roughness estimation, bearing faults diagnosis, and tool wear detection. The one-dimensional CNNs (1DCNN) and two-dimensional CNNs (2DCNN) are applied for regression and classification applications using different types of inputs, e.g., raw signals, and time-frequency spectra images by short time Fourier transform. In the application of regression and the estimation of machining surface roughness, the 1DCNN is utilized and the corresponding CNN structure (hyper parameters) optimization is proposed by using uniform experimental design (UED), neural network, multiple regression, and particle swarm optimization. It demonstrates the effectiveness of the proposed approach to obtain a structure with better performance. In applications of classification, bearing faults and tool wear classification are carried out by vibration signals analysis and CNN. Finally, the experimental results are shown to demonstrate the effectiveness and performance of our approach.


Introduction
Vibration signals can be applied for machine diagnosis and help discover problems during machining. By the signal processing methods, the signals can be decomposed and transformed into different domains for analysis, e.g., fast Fourier transform, wavelet transform, etc. [1][2][3][4][5][6][7][8]. Statistical features and other characteristics related to physical phenomena are then extracted for applications. Based on data analysis, machine learning approaches model the relationship of features and physical phenomena. The corresponding features are usually extracted by statistical analysis in time and frequency domains.
In mechanical systems, rolling element bearings (REBs) are one of crucial components and the bearing failures can cause safety problems. A lot of the literature has proposed the diagnosis of bearings or building monitoring systems with machine learning models, e.g., support vector machines (SVMs), neural networks (NNs) [9][10][11][12][13][14]. Recently, deep learning approaches were proposed to auto extract the characteristics of vibration signals for signals analysis [9,[12][13][14]. For signals analysis, methods of frequency spectra can also be used for prediction or diagnosis [15,16]. The statistical features are usually utilized to be inputs of machine learning for diagnosis model development [17][18][19]. Herein, the convolutional neural network (CNN) discussed in this paper is also widely applied for bearing diagnosis using raw signals or spectra of signals [20][21][22][23][24][25][26].
The condition of machine tools affects the quality and the productivity directly. A blunt tool can cause terrible quality since the magnitude of vibration during machining increases. Excessive tool wear can even lead to tool breakages. The diagnoses of tool status were proposed by on-line and off-line monitoring [27][28][29][30][31]. For off-line monitoring, the tools

Convolutional Neural Network (CNN)
The CNN was first proposed by Lecun et al. [51] and the structure of the CNN is shown in Figure 1. The three basic operations in the CNN are convolutional layers, pooling layers, and fully connected layers. Convolutional layers and pooling layers are adopted for automatic feature extraction when fully connected layers are general neural networks which play the roles of classifier or predictor.
At first, the convolutional layer is introduced, and the inputs are convolved by filters to obtain the corresponding features. The convolutional operation of single filter can be represented as z k l = f c (α l * x + b) (1) Figure 1. Structure of convolutional neural network. Reprinted from ref. [47].
At first, the convolutional layer is introduced, and the inputs are convolved by filters to obtain the corresponding features. The convolutional operation of single filter can be represented as where * represents the convolutional operation; ∈ × denotes the input and fc denotes the activation function of convolution layer; b and are the bias and corresponding kernel of the lth filter, respectively; denotes the corresponding output feature map. Herein, kernel matrix are obtained by training and l = 1, …, N is the selected kernel size.
In pooling layers, the important features are reserved, and the number of features are reduced by a max-pooling operation. The operation of a single filter can be represented as where and r are the row and column index of features after pooling, and represent the length and width of filters in pooling layers.
The feature maps after feature extraction are flattened into a one-dimension array and inputted into fully connected layers. The feedforward operation of a single neuron in fully connected layers is represented as where is the input of the neuron, is weight of , = 1, 2, … , , b is the bias, is the activation function of the neuron in the fully connected layer, y is the output of the CNN.

Short-Time Fourier Transform (STFT)
Discrete Fourier transform (DFT) is widely applied to generate frequency spectra of signals. However, frequency spectra do not contain the information of time domain. In order to present time domain and frequency domain at the same time, STFT is employed (Softmax) Figure 1. Structure of convolutional neural network. Reprinted from ref. [47].
In pooling layers, the important features are reserved, and the number of features are reduced by a max-pooling operation. The operation of a single filter can be represented as where q and r are the row and column index of features after pooling, L P and W P represent the length and width of filters in pooling layers. The feature maps after feature extraction are flattened into a one-dimension array and inputted into fully connected layers. The feedforward operation of a single neuron in fully connected layers is represented as where h a is the input of the neuron, w a is weight of h a , a = 1, 2, . . . , n, b is the bias, f f is the activation function of the neuron in the fully connected layer, y is the output of the CNN.

Short-Time Fourier Transform (STFT)
Discrete Fourier transform (DFT) is widely applied to generate frequency spectra of signals. However, frequency spectra do not contain the information of time domain. In order to present time domain and frequency domain at the same time, STFT is employed [8,52]. In STFT, signals are divided into short-time segments firstly, and frequency distributions of segments are computed by DFT. Finally, the time-frequency spectra of signals can be obtained by stacking the frequency spectra of segments. STFT can be represented as where x is the discrete signal with size N, ω is frequency, n is the index of data points in x, w is discrete window function, m is discrete index in the window w. STFT is applied as the preprocessor of signals in the study. The time-frequency spectra are the inputs of convolutional neural networks, which is introduced in the following section. Note that the axes of spectra are removed when input into the model.

Particle Swarm Optimization (PSO)
Particle swarm optimization (PSO), simulating the social behaviors of fish and birds while foraging, was proposed in 1998 [53]. Firstly, the fitness function and the target of optimization are defined. By fitness function, the score of particles can be evaluated. The particles adjust their directions and locations according to the best location of the group and themselves using and respectively, where V i is the direction of the ith particle, t represents the index of iteration, w is the weight of inertia, c 1 is the weight representing how much P pbest affects the optimization, c 2 is the weight representing how much P gbest affects the optimization, P i (t) represents the location of the ith particle at the tth iteration. Finally, while reaching the set maximum of the iteration or the fitness of P gbest remains the same, the optimization is complete and P gbest is the optimized result. In this study, the minimized mean absolute percentage error (MAPE) of prediction is adopted to be the objective function for optimization of hyper parameters.

Machining Roughness Estimation Application
In this section, machining surface roughness estimation is achieved using the CNN. The optimization of the CNN structure is also discussed. Firstly, the dataset is introduced. Then, the experimental design is carried out and executed. After the experiments are complete, a simple neural network (NN) is applied to model the relation between hyper parameters and the performance of model. Optimization using PSO is then discussed. The optimized results are verified, finally.
At first, the optimization of the model structure is introduced.

Optimization of Model Structure
Herein, the concept of optimizing the model structure (hyper parameters) is utilized [54]. An improvement by uniform experimental design (UED) [55], a neural network, and a PSO algorithm is introduced. It preserves the ability of the CNN and optimizes the performance. The procedure of optimization is introduced. The flow chart of optimization procedure is shown as Figure 2. The procedures include (1) parameter selection of the CNN, (2) experimental design using UED, (3) data acquisition, (4) model development, (5) optimization, and finally, (6) validation.

Optimization Procedure
Step 1. Parameter selection of CNN: Select the main structure (convolution filter size, pooling, fully connected nodes), the optimized hyper parameters, and levels.
Step 2. Design experiments using UED: Choose the appropriate uniform layout (UL) of model structure according to the parameter selection and design experiments.
Step 3. Data acquisition: Complete the experiments. The model with the above structure is trained and the corresponding hyper parameters/trained MAPE are collected as input/output data.
Step 4. Model development: Modeling the function between hyper parameters and performance using neural network. The performance applied in this study is MAPE.
Step 5. Optimization: Obtain the hyper parameter combination with better performance using PSO. In this study, the goal of optimization is to minimize the MAPE of the CNN.
Step 6. Verification: Verify the performance of the optimized result.
In this study, a simple neural network is applied for the model and particle swarm optimization (PSO) is adopted for optimization to compare with MR and the full-factorial searching algorithm [54]. Optimization Procedure Step 1. Parameter selection of CNN: Select the main structure (convolution filter size, pooling, fully connected nodes), the optimized hyper parameters, and levels.
Step 2. Design experiments using UED: Choose the appropriate uniform layout (UL) of model structure according to the parameter selection and design experiments.
Step 3. Data acquisition: Complete the experiments. The model with the above structure is trained and the corresponding hyper parameters/trained MAPE are collected as input/output data.
Step 4. Model development: Modeling the function between hyper parameters and performance using neural network. The performance applied in this study is MAPE.
Step 5. Optimization: Obtain the hyper parameter combination with better performance using PSO. In this study, the goal of optimization is to minimize the MAPE of the CNN.
Step 6. Verification: Verify the performance of the optimized result.
In this study, a simple neural network is applied for the model and particle swarm optimization (PSO) is adopted for optimization to compare with MR and the full-factorial searching algorithm [54].

Surface Roughness Estimation Using CNN
Data of milling are proposed by Wu et al. using a tungsten carbide milling cutter to cut S45C steel [34]. There are six single-axial accelerometers (Wilcoxon Research 785A) mounted on the spindle and vise for measuring X-axial, Y-axial, and Z-axial vibration signals. The signals are acquired using DAQ NI 9234 with 10 kHz of sampling frequency. The experimental setup can be found in [34]. The surface roughness is measured using Mitutoyo SV-C3200S4. The machining parameters and setup values are: spindle speed (rpm)-900, 1000, 1800, 1900, 2000, 2100, 2700, 3000 (rpm); feed rate-228, 240, 252, 320,
A one-dimensional CNN (1DCNN) with sensors fusion in parallel structure, shown in Figure 3, is applied for machining roughness estimation. The features of vibration signals in X, Y, Z directions are extracted separately. In order to obtain a CNN structure with better performance, the optimization for hyper parameters combination is applied [52]. The range of optimized hyper parameters and the structure of the CNN are selected as shown in Table 1. According to Table 1, there are six design factors: F C for the size of filters in convolutional layers, F P for the size of filters in pooling layers, N C1 for the filter number in the first convolutional layer, N C2 for the filter number in the second convolutional layer, N F1 for the number of nodes in the first fully connected layer, and N F2 for the number of nodes in the second fully connected layer. The feature extraction for three axial signals are Sensors 2021, 21, 3929 6 of 17 the same. The performance of the model is assumed as a function of hyper parameters, which is represented as axial signals are the same. The performance of the model is assumed as a function of hyper parameters, which is represented as According to UED [49], four levels are selected for all factors and the corresponding uniform layout applied here is (4 ), as shown as According to UED [49], four levels are selected for all factors and the corresponding uniform layout applied here is U 28 4 6 , as shown as Table 2. The final experimental design is introduced in Table 3. The corresponding combinations of parameters and trained MAPE (average testing MAPE of corresponding experimental CNNs) are also introduced. Every structure has been tested three times and the average MAPEs are computed. The maximum epoch of each model is 700. In order to reduce the needed time for experiments, an early stop criterion is set up according to testing experiences: if the loss has not decreased for 15 epochs, the training process is stopped. After the experiments, the function between hyper parameters and average testing MAPE is modeled using MR and NN for comparison. The performance of models, optimization results, and verifications are compared as follows. The data are normalized before modeling.
At first, modeling using stepwise MR is obtained as The corresponding R-squared (R 2 ) of MR model is 0.9061 and the normalized root mean squared error (NRMSE) of MR is 0.0634. The objective function (fitness) is selected as the MAPE of each structure. The optimization target is to minimize the fitness. The hyper parameters combination optimized using the full-factorial searching algorithm are: F C = 25, F P = 20, N C1 = 20, N C2 = 20, N F1 = 100, N F2 = 10. The testing MAPE prediction of the MR model for the combination is 5.788%. The structure with the optimized hyper parameters combination has been trained three times. The testing MAPEs are shown in Table 4. The average MAPE is quite different to the prediction, with an error of 147.06%. The combination does not perform better compared to the experiments. Then, an NN is applied to model the relation between factors and testing MAPE. The structure of NN is shown in Table 5. The initial learning rate is 0.005, and the optimizer is Adam. The R-squared (R 2 ) of NN is 0.9999999996 and the normalized root mean squared error (NRMSE) of the NN is 3.347 × 10 −5 . The hyper parameters combination optimized using the full-factorial searching algorithm are: F C = 25, F P = 11, N C1 = 18, N C2 = 12, N F1 = 100, N F2 = 50. The testing MAPE prediction of the NN model for the combination is 10.849%. The combination has also been trained three times. The testing MAPEs are shown in Table 6. The error between the average MAPE and prediction of the NN model is much smaller, with an error of 7.337%. The optimized structure improves the performance by 11.3%. The results show that modeling using NN can also create a better and more stable hyper parameters combination than the best hyper parameters set in the experiments. However, the structure, learning rate, and normalization affect the performance of modeling and optimized result a lot. A simple NN with a smaller learning rate is recommended in this case. Normalization is also necessary.
Herein, PSO is applied for optimization to compare with the full-factorial searching algorithm. Modeling using an NN is applied for comparison. The number of particles is selected as 250, and the number of iterations is set to be 3000. The reason for choosing this number of particles and iteration is to ensure the optimized result is the same as the result using the full-factorial searching algorithm. The weights of updating velocity are  Table 7. If the fitness of P gbest does not improve for 500 iterations, the optimization is stopped. The fitness during optimizing using PSO is shown as Figure 4. The optimized result is the same as the full-factorial searching algorithm. Moreover, PSO takes 45.435 s to complete the process, while it takes 146.87 s for the full-factorial searching algorithm. If the number of particles and iterations are reduced according to the testing results, the time for optimization can be less than the previous experiment result. When the structure of the optimized CNN is more complex, the computing time for PSO and other optimization methods are much less compared to the time for the full-factorial searching algorithm.

Classification of CWRU Bearing Data
Bearing data of CWRU [56] are discussed in many other studies for bearing fault classification [57][58][59]. The signals discussed in the study are collected by the accelerometer mounted at the drive end of motor. The sampling frequency is 12 kHz. The bearing statuses include normal bearings, bearings with inner ring faults, bearings with outer ring faults, and bearings with ball faults, which are human-made using an electrical-discharge machine (EDM). The statuses of bearings are labeled according to normal: 0; inner ring fault:1; outer ring fault: 2; and ball fault: 3, respectively. There are 64 data in the original dataset. In order to increase the number of data, sliding window is utilized to slice the signals into one-second signals. The length and the stride of window are 12,000 data points (1 s) and 3000 data points, respectively. The length of window is selected after considering the completeness of signals in the frequency domain and the testing results. Finally, there are 2368 data; 1657 data (70%) are chosen randomly as training data and the rest (30%) are applied as testing data.

(a) Bearing Faults Classification Using Vibration Signals
Herein, we introduce the classification of bearing faults using 1DCNN with vibration signals as inputs. The selected structure of 1DCNN is introduced in Table 8. The initial learning rate is 0.001, and the optimizer is Adam. The average of training and testing accuracy of the model are both 100% after testing three times using different training data. The confusion matrix of the model predicting testing data is shown in Figure 5. The result shows that 1DCNN can provide excellent performance using vibration signals as inputs directly for classification. The classifying time of 1DCNN using NVIDIA Tesla V100 32 GB GPU is 0.00133 s per data.

(b) Bearing Faults Classification Using STFT Time-Frequency Spectra
The time-frequency spectra after STFT of different bearing conditions are shown in Figure 6. A 2DCNN is applied to classify the bearing faults. The structure of the CNN is shown as Table 9. The initial learning rate is 0.001 with the Adam optimizer. The average of training and testing accuracy are both 100% after testing three times. The confusion matrix of the model for testing data is shown as Figure 7. The result shows that 2DCNN can also be applied for the classification of bearing faults with great performance. The inputs of 2DCNN can be other types of two-dimensional arrays, e.g., time-frequency spectra using wavelet transform. The transformation time using STFT is 0.75258 s per data, and the classifying time of 2D CNN using NVIDIA Tesla V100 32 GB GPU is 0.00419 s per data. Classification using 2DCNN takes more time due to the input size of the model. 1DCNN uses raw signals as inputs; the input size is 12,000 × 1. 2DCNN uses STFT time-frequency spectra as inputs; the input size is 434 × 558 × 3.

(b) Bearing Faults Classification Using STFT Time-Frequency Spectra
The time-frequency spectra after STFT of different bearing conditions are shown in Figure 6. A 2DCNN is applied to classify the bearing faults. The structure of the CNN is shown as Table 9. The initial learning rate is 0.001 with the Adam optimizer. The average of training and testing accuracy are both 100% after testing three times. The confusion matrix of the model for testing data is shown as Figure 7. The result shows that 2DCNN can also be applied for the classification of bearing faults with great performance. The inputs of 2DCNN can be other types of two-dimensional arrays, e.g., time-frequency spectra using wavelet transform. The transformation time using STFT is 0.75258 s per data, and the classifying time of 2D CNN using NVIDIA Tesla V100 32 GB GPU is 0.00419 s per data. Classification using 2DCNN takes more time due to the input size of the model. 1DCNN uses raw signals as inputs; the input size is 12,000 × 1. 2DCNN uses STFT timefrequency spectra as inputs; the input size is 434 × 558 × 3.

Classification of Tool Wear Using STFT Time-Frequency Spectra
The experimental setup is introduced in Figure 8; the tool wear data of a tri-axial milling machine (CHMER HM4030L, Figure 8a) are applied in the study. The machine tools are a tungsten carbide milling cutter with two blades, as shown in Figure 8b. The diameter of the cutters is 6 mm. The work-pieces are S45C steel. The tri-axial accelerometer (CTC AC230) is mounted on the spindle, as shown in Figure 8c. The vibration signals are acquired using DAQ NI PCIe-6361 with 100 kHz of sampling frequency. The tool wear is measured using a Deryuan RS-500 industrial camera with ImageJ and PhotoImpact for image processing. The tool worn criteria is selected as 0.4 mm according to ISO.
A 2DCNN with a small structure (shown in Table 10) is adopted for classifying tool wear using STFT time-frequency spectra. The vibration signals are sliced using sliding window to increase the size of data. The length and stride of window is 100,000 data points (1 s) and 30,000 data points, respectively. The STFT time-frequency spectra using Y-axial vibration signals of an unworn tool and a worn tool are shown in Figure 9. There are a total of 742 data; half of the data are selected randomly as training data, and the rest are testing data. Firstly, the classification model is trained. The initial learning rate is 0.001 with the Adam optimizer. The average training and testing accuracy are both 100% after

Classification of Tool Wear Using STFT Time-Frequency Spectra
The experimental setup is introduced in Figure 8; the tool wear data of a tri-axial milling machine (CHMER HM4030L, Figure 8a) are applied in the study. The machine tools are a tungsten carbide milling cutter with two blades, as shown in Figure 8b. The diameter of the cutters is 6 mm. The work-pieces are S45C steel. The tri-axial accelerometer (CTC AC230) is mounted on the spindle, as shown in Figure 8c. The vibration signals are acquired using DAQ NI PCIe-6361 with 100 kHz of sampling frequency. The tool wear is measured using a Deryuan RS-500 industrial camera with ImageJ and PhotoImpact for image processing. The tool worn criteria is selected as 0.4 mm according to ISO.   A 2DCNN with a small structure (shown in Table 10) is adopted for classifying tool wear using STFT time-frequency spectra. The vibration signals are sliced using sliding window to increase the size of data. The length and stride of window is 100,000 data points (1 s) and 30,000 data points, respectively. The STFT time-frequency spectra using Y-axial vibration signals of an unworn tool and a worn tool are shown in Figure 9. There are a total of 742 data; half of the data are selected randomly as training data, and the rest are testing data. Firstly, the classification model is trained. The initial learning rate is 0.001 with the Adam optimizer. The average training and testing accuracy are both 100% after testing three times. The confusion matrix of the CNN model using testing data is shown in Figure 10. The result shows that 2DCNN can be applied for not only bearing faults classification but also other classified problems in vibration signals analysis.

Conclusions
In this study, vibration signals analysis using CNN has been discussed, including an improved optimization method for the structure of a CNN, 1DCNN and 2DCNN with raw signals and STFT images, respectively. The experimental results were introduced to illustrate that the CNN can be applied for both prediction and classification. In regression application, a 1DCNN with parallel feature extracting structure was applied to estimate

Conclusions
In this study, vibration signals analysis using CNN has been discussed, including an improved optimization method for the structure of a CNN, 1DCNN and 2DCNN with raw signals and STFT images, respectively. The experimental results were introduced to illustrate that the CNN can be applied for both prediction and classification. In regression application, a 1DCNN with parallel feature extracting structure was applied to estimate machining roughness. The optimization of the CNN structure was also introduced and used to demonstrate the effectiveness of the proposed approach to obtain a structure with better performance. The most important factor in optimizing the structure of CNN is to choose the correct method and level for the experimental design. The level can be comprehended as the resolution experiments. If the level is too large, the number of experiment results is too little to represent the real situation. On the other hand, the cost of time will be enhanced due to the large number of experiments. Other experimental design can also be applied; for instance, the Taguchi method. In classifications, 1DCNN and 2DCNN are applied according to the inputs. Both 1DCNN and 2DCNN provide excellent performance. The results also show that CNN can extract features in vibration signals and time-frequency spectra automatically. While using raw signals as inputs, the length of signal must be long enough to ensure the information of the signal is complete. If time-frequency spectra are utilized as inputs, the resolution of STFT affects the model since time-frequency spectra show the distribution of frequency with respect to time. If the resolution is not appropriate, the information in the frequency domain will be reduced and influence the performance of model.