Using a Support Vector Machine Based Decision Stage to Improve the Fault Diagnosis on Gearboxes

Gearboxes are mechanical devices that play an essential role in several applications, e.g., the transmission of automotive vehicles. Their malfunctioning may result in economic losses and accidents, among others. The rise of powerful graphical processing units spreads the use of deep learning-based solutions to many problems, which includes the fault diagnosis on gearboxes. Those solutions usually require a significant amount of data, high computational power, and a long training process. The training of deep learning-based systems may not be feasible when GPUs are not available. This paper proposes a solution to reduce the training time of deep learning-based fault diagnosis systems without compromising their accuracy. The solution is based on the use of a decision stage to interpret all the probability outputs of a classifier whose output layer has the softmax activation function. Two classification algorithms were applied to perform the decision. We have reduced the training time by almost 80% without compromising the average accuracy of the fault diagnosis system.


Introduction
Gearboxes are mechanical devices that provide speed and torque conversion from rotating sources of power to other mechanisms. ey have a crucial role in several applications, e.g., industrial rotating machines, automotive vehicles, and wind turbines. eir malfunction may not only impair the operation of a given system but also result in economic losses and safety risks [1]. is way, the use of fast and effective fault diagnosis techniques is necessary, since the early detection of failures allows more efficient management of the maintenance activities and leads to safer operation of the system [2].
Gearboxes may present several failure modes. Most of them are related to mechanical components and lubrication conditions. One failure mode that requires attention is the tooth breakage of gears, which is liable to compromise the machine operation in a significant way [3].
Supported by the advent of powerful computational devices, e.g., graphical processing units (GPUs), deep learning-based techniques have become essential tools in fault detection and fault diagnosis research fields. eir superior performance in applications related to classification and object detection tasks has also supported their popularization [4].
Plenty of works that relate deep learning and fault diagnosis in gearboxes have arisen in recent years. Zhao et al. [5] have proposed a variant of deep residual networks (DRNs) that uses dynamically weighted wavelet coefficients to improve the performance of the diagnostic process. eir work is based on the absence of a consensus about the most critical frequency bands regarding the useful information for systems that perform the diagnosis on planetary gearboxes. eir system finds discriminative sets of features by dynamically adjusting the weights applied to the wavelet packet coefficients. Cabrera et al. [6] proposed the use of a deep convolutional neural network (DCNN) trained in advanced by a stacked convolutional autoencoder (SCAE) to determine fault severity in gearboxes. eir system performs unsupervised detection of hierarchical time-frequency patterns using the DCNN. e SCAE improves the DCNN performance by capturing a priori patterns.
Furthermore, Deutsch and He [7] use a feedforward deep belief network (DBN) to predict the remaining useful life of mechanical machines. ey combine the self-taught feature-learning capability of DBNs with the predicting power of feedforward neural networks to extract features from vibration signals, assess the integrity of the machine, and make the prediction. Jiang et al. [8] proposed the use of stacked multilevel-denoising autoencoders to perform the fault diagnosis on the gearboxes of wind turbines. e features are learned through an unsupervised process, which is followed by a supervised fine-tuning process with the label information for classification. ey also use multiple noise levels to train the autoencoder and enhance the feature learning and classification capabilities. Jiang et al. [9] also proposed a gearbox diagnosis system based on multiscale convolutional neural networks. ey combined multiscale and hierarchical learning to capture information at different scales, improving the performance of the classifier.
Monteiro et al. [10] proposed a fault diagnosis system based on the Fourier transform (FT) spectrograms and deep convolutional neural networks. ey have also discussed in their work about the influence of the model depth and the amount of training data available on the network performance. Shao et al. [11] used transfer learning to perform the fault diagnosis on mechanical machines. A DCNN model pretrained on ImageNet, followed by a fine-tuning process, carried out the fault diagnosis. Other works, e.g., Zeng et al. [12] and Liao et al. [13], use convolutional neural networks associated with S and wavelet transforms to classify the gearbox heath condition, respectively.
One of the major issues about deep learning-based solutions for fault diagnosis systems is their computational burden; e.g., the training process of deep models is often long and demands a large amount of training data. Such a setback is usually overcome by using computers with powerful GPUs, e.g., [12]. However, this sort of hardware is not always available to everyone. us, it is necessary to find alternative ways to reduce the computational cost of deep learning-based solutions without compromising their performance regarding accuracy.
is paper proposes the addition of a decision stage in the output of DCNN-based fault diagnosis systems, which are commonly based on classification algorithms [5,10,12,13]. e outputs of those systems often represent the probabilities of a given input to belong to a failure mode in a given set. e failure mode that presents the highest probability value is chosen. Although this approach has proved to be reasonable to a number of applications, the information provided by the remaining outputs is usually lost.
We believe that this information can also be used to improve the performance of the classifier. e case study is the one analyzed in [6,7], which poses the problem of the fault severity diagnosis related to the gear tooth break failure mode. e decision stage analyses the outputs of all classes, i.e., severity levels, and decides the severity of the gearbox fault based on their probabilities distribution. Since this artifice improves the classification results, the same model architecture can be trained within fewer epochs without compromising its accuracy, thus reducing the training time. Decision stages, on the other hand, are well-known tools. ey are commonly employed in multimodal and committee-based classification systems.
ey combine the results obtained by multiple classifiers to improve the accuracy of the whole system [14,15]. e remainder of this paper is defined as follows: Section 3 presents the details of the experiments carried out in this research, Section 4 presents the results obtained and discusses their relevance, and Section 5 explains the main findings and implications of this work.

Convolutional Neural Networks.
e convolutional neural networks are models inspired by biological processes. e pattern of the connections among neurons, i.e., the processing units of neural networks, is similar to that of the animal visual cortex. ey perform object recognition and classification tasks [16] well. Object detection [17], diseases detection [18], and fault diagnosis [6,10] are three examples of applications that use CNNs. eir basic structure consists of an input layer, alternating blocks of convolutional and pooling layers, which are followed by fully connected layers, and an output layer [16]. Modifications in this structure may occur, depending on the application. is structure is illustrated in Figure 1. e role of each layer is explained as follows: (a) Input layer: this layer receives and stores raw input data. It also specifies the width, height, and number of channels of the input data [19]. (b) Convolutional layers: they learn feature representations from a set of input data and generate feature maps. ose maps are created by convolving their inputs with a set of learned weights. An activation function, e.g., the ReLU function, is applied to the output of the convolution step. e following equation shows the general formulation of a convolutional layer: in which l refers to the current layer, i and j are the indices of the elements of the previous and current layers, respectively, M j is a set of input maps, k is the weight matrix of the i-th convolutional kernel of the l-th layer applied to the j-th input feature map, and b is the bias. (c) Pooling layers: they reduce the spatial resolution of feature maps, improving the spatial invariance to input distortions and translations [19]. Most of the recent works employ a variation of this layer called the max pooling [16]. It propagates to the next layers the maximum value from a neighborhood of elements. is operation is defined by y jrs � max (p,q)∈R rs in which y jrs is the output of the pooling process regarding the j-th feature map and x kpq is the element at location (p; q) contained by the pooling region R rs . e pooling process is also known as subsampling [19]. (d) Fully connected and output layers: they interpret feature representations and perform high-level reasoning [16]. ey also compute the scores of each output class [19]. e number of output nodes depends on the number of classes [12].

Fourier Transform Spectrograms.
e Fourier transform (FT) is an essential technique in the field of signal analysis. It informs the frequency composition of a given signal, as well as the contribution of each frequency concerning magnitude [20]. Noise filtering, pattern recognition, and signal modulation are some applications that may be improved by the Fourier transform and its variants, e.g., the discrete Fourier transform (DFT), suitable for processing digital signals, and the fast Fourier transform (FFT), a more efficient algorithm to calculate the DFT [21]. e Fourier transform spectrograms represent signals using time, frequency, and magnitude information. e short-time Fourier transform (STFT) is an FT variant commonly used to generate this sort of representation because it performs time-dependent spectral analyses [21]. e spectrograms show how the spectrum of frequencies of a given signal varies over time. Spectrograms are also used in fault diagnosis applications [10,22].

Support Vector Machines.
e support vector machine (SVM) is a versatile and powerful machine learning technique [23]. It can be used to solve classification (both linear and nonlinear), regression, and even outlier detection problems, making it one of the most popular machine learning algorithms [23,24]. Its use is also popular in fault diagnosis of rotating machinery [25]. is technique aims the identification of hyperplanes capable of separating datasets into high-dimensional feature spaces. e separation between datasets is called margin, and the SVM maximizes the margin [23].
A linearly separable dataset allows the SVM to define hyperplanes capable of separating the data into categories, regardless of the number of dimensions presented by the feature space. However, in most applications, the information is not linearly separable in feature spaces with a given dimensionality. us, it is necessary to map the dataset into a feature space with a higher number of dimensions, in which the data will be linearly separable. is mapping process is performed by using kernels, e.g., polynomial and radial basis function kernels [23,24].

Multilayer Perceptron.
e multilayer perceptron (MLP) is a feedforward neural network. MLPs can distinguish nonlinearly separable patterns.
ose algorithms consist of several nodes, named "neurons," which are arranged in multiple layers just as a directed graph. Each layer is fully connected to the subsequent one. ose layers are usually divided into three types: input, hidden, and output layers. Multilayer perceptrons are considered to be universal approximators. One hidden layer MLP with enough neurons can approximate any given continuous function [23,24].

Experimental Setup: Obtaining the Vibration Signals.
We arranged the experimental setup according to Figure 2. It was used to obtain the vibration measurements of the gearbox. e electric motor (M) drives the gearbox, composed of two gears (Z1 and Z2). ose gears are mounted on independent shafts. A magnetic brake (B) is connected to the output shaft. Table 1 lists some features of those components.
Besides, the speed drive Danfoss VLT 1 : 5 kW drives the electric motor, and the voltage source power TDK Lambda (GEN 150-10, 0-150 V, 10 A) drives the magnetic brake. A unidirectional accelerometer (A), which was vertically placed on the gearbox, close to the input shaft, collects the vibration signals. is accelerometer is an IMI Sensor 603C01, 100 mV/g. An NI9234 data acquisition card performs the digitalization of analog signals. is card has a 24bit resolution, a 50 kHz sampling rate, and it is specific for piezoelectric sensors.
As previously mentioned, the proposed experiment aims to use the vibration signals of the gearbox to evaluate the severity of tooth breakage faults in helical gears. For this Computational Intelligence and Neuroscience purpose, one tooth of the helical gear Z1 was subjected to different damage levels. On the other hand, gear Z2 has not been modified. Ten scenarios were taken into account, i.e., one for the gear Z1 unbroken and the others regarding nine fault severity levels of gear Z1. ose scenarios are listed in Figure 3 and Table 2.
We have also considered the gearbox working under different operation conditions; that is, we took into account different loads and rotation speeds. e rotation speed had five scenarios, in which it was constant in three of them and variable in the others. On the other hand, the load applied by the magnetic braking system had three scenarios, in which the load has presented constant values. ose scenarios are detailed in Tables 3 and 4, for the rotation speeds and the loads, respectively.
We acquired each sample of vibration signal over a time interval of 10 s. Furthermore, we performed each combined scenario three times. is way, the database is composed originally by 45 signals for each fault severity, that is, a balanced database of 450 signals considering all the ten severity levels. e magnitude of those signals was normalized to the range [0, 1] and divided into 0.25 seconds lengthy excerpts, resulting in 1,800 signals for each fault severity level and 18,000 signals in the whole and balanced database.

Experimental Setup: Training the Classification System.
e system proposed to assess the fault severity of the gearbox is based on a deep convolutional neural network architecture.
us, a bidimensional representation of the input signals was necessary. We chose to represent them on a time-frequency domain, since this sort of representation allows to visualize when the specific frequency components related to the failure arise. e short-time Fourier transform was the technique we used to generate the bidimensional representation of the signals, i.e., the Fourier transform spectrograms. e STFT has low computational cost than other time-frequency representation techniques [26]. is characteristic is especially important to the proposed system since we are dealing with a real-time application.
e STFT configuration included a Hamming window of size 128 and overlapping equal to 50%. ese choices combined the selective property of the Hamming window with the balance between the smooth variation of the resulting signal and low computational cost.
Two experimental scenarios were designed. In the first one, the signal information was condensed into 175 × 175 pixels RGB images.
is sort of data allows more information to be provided to the fault severity assessment system, since artifices such as colormaps can be used. On the other hand, it increases the computational burden of the system because the input presents 3 channels. In the second scenario, we used 175 × 175 pixels grayscale images. Unlike the previous scenario, the only information provided by spectrograms is the magnitude of the Fourier transform. Figure 4 shows an example of spectrogram obtained by the described process.
e classification system used in this work is composed of three convolutional layers, three max-pooling layers, one fully connected layer, and one output layer. Since the outputs of such structure are probability values in the range between 0 and 1, the softmax activation function was used in the neurons of the output layer and the ReLU activation function was used in the neurons of the remaining layers.
is architecture provided a satisfactory performance in the fault severity assessment of gearboxes, according to Monteiro et al. [10]. It is illustrated in Figure 5. In addition, a support vector machine was used to analyze the output of the system and improve its performance. is algorithm was already used in similar applications, like the one proposed by Li et al. [27], which employed the SVM to merge the results of classifiers that act on multimodal data.
Regarding the training step, the data were split into three groups: training, test, and validation. We used the validation dataset to reduce the occurrence of overfitting-related problems. Each one contains, respectively, 50%, 25%, and 25% of balanced signals for each fault severity level. e training process was performed according to 10 and 50 epoch scenarios. e configuration of the computer used to

Results and Discussion
e first discussion is regarding the training time of a fault diagnosis system based on deep convolutional neural networks. It is well known that computers with GPUs can handle the computational burden of deep learning solutions much better than those with CPUs. On the other hand, computers with GPUs are more expensive, meaning that they are not always available. Table 5 illustrates this problem. It shows the average training time of 30 DCNNs (in each scenario) with the architecture mentioned in the last section, regarding computers with different configurations and the RGB image dataset. e models were trained in 50 epochs. We trained this number of models to guarantee the statistical relevance of the results. e first computer configuration (GPU computer) was used by Monteiro et al. [10]. It consisted of a OS Ubuntu 16.04 LTS, 64 bits, memory 15.6 GB, processor Intel Xeon(R) CPU E 5-2609 v3 @ 1.90 GHz × 12, and graphics Gallium 0.4 on NV117. e second one was presented in the previous section.
One can observe that the training process for the computer without the GPU was much longer than the one for the computer with GPU; i.e., it was about 13 times longer. In some situations, depending on the amount of data or time available, the use of computers without GPUs can be impractical.
Some choices can be made to overcome this problem. Reducing the number of training epochs is one of them. As can be observed in Table 6, reducing the number of training epochs from 50 to 10 decreased the average training time in about 78.7%. On the other hand, such reduction had a performance cost. e average accuracy reduced about 1.8%.
is behavior was already expected, since the models had less iterations to learn features of the training data. e results in Table 6 were obtained by training 30 DCNNs in each scenario.
Before proposing modifications to the fault diagnosis system, it is necessary to identify the major difficulties of the model. We have taken the model trained for 10 epochs as the reference. Table 7 lists the average and standard deviation values of the accuracy for 30 models. One can observe that for some classes, e.g., P1 and P2, the system presented high accuracy values, i.e., close to 100%. On the other hand, the models presented a poor performance for the input data from classes such as P6 and P7.
is analysis can be deepened by observing the outputs of the classifier. Figure 6 shows how the output probabilities of the models are distributed according to the class of the      Computational Intelligence and Neuroscience 7 input image. Regarding the inputs belonging to class P1, one can observe that the output probabilities of the models were very close 1 for class P1 and very close to 0 for the others. It helps to understand why the model accuracy for this class was 100%. On the other hand, the distribution profiles belonging to classes P6 and P7 show that the outputs of the networks were not so accurate as in the previous case. Indeed, choosing only the output that presented the highest probability value can lead to wrong classifications due to the significant presence of outliers.
To overcome this issue, we proposed a solution based on using the output probabilities of all the ten classes to perform the correct classification. Such a solution can be implemented in several ways. One of them is the use of shallow classifiers, e.g., multilayer perceptron or support vector machine. ose classifiers identify the gear severity class by using the information contained in the output probabilities of the deep convolutional neural network. us, the system response is obtained by analyzing a probability distribution, and not only a single value.
We used the support vector machine in this research. It was trained with the outputs of the DCNN regarding the training data previously established. e results of the proposed modification are listed in Tables 8 and 9. Table 8 shows the average results of each class, regarding 30 models, and compares them with the results of the scenario without the additional classifier. Table 9 shows the average results regarding all classes and models, also comparing to the scenario without the additional classifier.
From Table 8, one can infer that the inclusion of a classifier improved the model performance regarding all the 10 classes, both concerning average accuracy and standard deviation. Also, from Table 9, one can observe that the average accuracy increased about 2.56% by the cost of increasing less than 1 second to the average training time. ese results are even more significant when compared with those obtained from the training process with 50 epochs, seen in Table 5. e average accuracy of the proposed model was only 0.76% higher, but with an average training time 78.64% smaller. is made the training of the fault diagnosis system significantly faster. Furthermore, we employed two additional metrics to ensure the reliability of the obtained results: the F-score and AUC. e first one is the harmonic mean of the precision and recall. e second metric, on the other hand, is defined as the area under the curve of the receiver operating characteristic (ROC). eir average values are listed in Table 10, and both of them show an improvement trend aligned to the one observed in Table 9, i.e., the diagnosis system with the classifier presented values for the metrics about 2% higher than that without the classifier.
Regarding the average time to perform the classification of one single input, the addition of a decision stage did not cause significant changes. Indeed, the average classification time, which was about 0.03 seconds without the decision stage, increased less than 0.001 seconds.
To evaluate how significant were the improvements provided by the proposed solution, the signal-ranked   Wilcoxon statistical test can be applied to the outputs of the fault diagnosis systems with and without the decision stage. e results are listed in Table 10. e Wilcoxon test is a nonparametric hypothesis test that can be used to assess if two distributions are equivalent or not [28]. If they are not, it means that there was a statistically significant improvement (Λ symbol). Otherwise, the improvement was not significant (-symbol). Table 11 shows that we had significant improvements for classes P4, P5, P6, P7, and P9. Although the remainder classes have also shown some improvement, they were not statistically significant.
We also analyzed the performance of the decision stage when a different classification algorithm was applied. is algorithm was the multilayer perceptron, i.e., a neural network. We set the size of the input layer equal to 10, the hidden layer with 21 neurons, and the output layer with one neuron. We deployed the logistic sigmoid as the activation function. We set the number of training epochs equal to 200. e number of training, test, and validation samples remained the same as in the SVM scenario. We trained 30 MLPs to guarantee the statistical relevance of the results. We show the results for the SVM and MLP in Tables 12 and 13. Table 12 lists the average values of F1-score, AUC, and accuracy. Table 13 lists the average training and running times.
Regarding the metrics, despite the slight advantage presented by the SVM decision stage concerning the accuracy, a Wilcoxon test suggested that the results achieved by both algorithms were the same. It means that the proposed solution can be implemented with other classifiers than the SVM. On the other hand, an interesting fact arises from the training and running times. e training process of the MLP decision stage was about 7.1 times longer than that of the SVM, whereas the running time was more than 20 times faster. e proper decision considering this trade-off may be attractive depending on the contemplated application.
ose analyzes with the SVM decision stage were also performed regarding the grayscale spectrogram images. ey aimed to evaluate how the fault diagnosis system would deal with the reduction of the available information. Table 14 shows the average accuracy and the average training time for 30 models belonging to each scenario. ese scenarios regarded models trained with RGB and grayscale images, for 10 epochs. Only the test data were employed in the calculation of the average accuracy. One observed that a model trained with RGB images provided the best results regarding accuracy, i.e., about 10% higher. On the other hand, the average training time of the models trained with grayscale images was lower. It probably happened due to the smaller amount of information that was processed. Table 15 and Figure 7 help to understand what is happening to the performance of the classification model trained using the grayscale images, regarding all the ten classes. One can observe that, concerning the RGB scenario, even the results of classes like P1 were worsened, for both average accuracy and standard deviation. Figure 7 shows how the output probabilities of the DCNN are distributed according to the class of the input image. e number of outliers has significantly increased for the other scenario. is behavior occurred to all classes.  Class    Computational Intelligence and Neuroscience  Tables 16 and 17. Table 16 shows the average results for each class regarding 30 trained models.
ose results are compared with those of the scenario without the additional classifier. Table 17 shows the average results regarding all classes and models, also comparing them with the scenario without the additional stage. From Table 16, one can infer that the inclusion of a classifier improved the model performance regarding all the ten classes, both regarding average accuracy and standard deviation. On the other hand, even these improved systems were not capable of outperforming those trained with the RGB spectrogram images, as seen in Table 14. Besides, from Table 17 one can observe that the average accuracy increased about 4.18% by the cost of increasing less than 1 second to the average training time.
is relative improvement was superior to the one observed in the previous experimental scenario. Table 18 shows the average F-score and AUC for the systems with and without the decision stage. Table 18 shows the average F-score and AUC for the systems with and without the decision stage. ese two metrics reinforce the improvement trend caused by the addition of a decision stage.
Regarding the average classification time, in this scenario, the addition of a decision stage did not perform significant changes as well. It increased from 0.022 seconds to less than 0.023 seconds.
To evaluate how significant was the improvement provided by the proposed solution in this new scenario, the signal-ranked Wilcoxon statistical test was also applied on the outputs of the fault diagnosis systems with and without the decision stage. e results are listed in Table 19, which show that we had significant improvements for classes P4, P5, P6, P7, P8, and P10. Once again, although the other classes have also shown improvements, they were not significant.

Conclusions
We analyzed the use of a decision stage to interpret the outputs of a fault diagnosis system based on deep convolutional neural networks. ose outputs correspond to the probability that an input belongs to the classes of a given set.
is way, instead of using a conventional approach like choosing the class with the highest probability value, we     analyzed the output distribution of the deep classifier to perform more reliable fault diagnosis. e results have shown that we could improve the accuracy of the classification system and reduce almost 80% of the training time without compromising the execution time, which increased about 0.001 seconds. is improvement is especially significant for situations in which the powerful hardware, e.g., graphical processing units, is not available.
us, a fault diagnosis system with a given accuracy value can be obtained by using only a small fraction of the training time that would be required to perform the complete training. ose results were achieved by using the SVM as the decision maker, which had the output probabilities of the original fault diagnosis system as input information. Similar results were achieved by implementing an MLP as the decision maker. It suggests that the proposed solution can also be implemented with algorithms other than the SVM.
We also assessed the use of RGB and grayscale input spectrograms. Although the addition of a decision stage caused improvements in both scenarios, these improvements presented different magnitudes.
e accuracy of systems operating on grayscale images increased more than that in the other scenario. However, the final accuracy of those systems trained on RGB images was superior. is behavior can be explained by the amount of information available in each kind of image, as previously discussed. Furthermore, we saw that the difference presented by the execution times in both scenarios was not significant concerning absolute values. us, it suggests that using RGB images would not compromise the operation of the system on real-time applications.
Future works regard the application of this methodology of fault diagnosis to other kinds of failures and for problems belonging to different physical domains, e.g., fault diagnosis using acoustics. Moreover, we can evaluate the performance of other algorithms employed as decision-makers.

Data Availability
e data used to support the results of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.