Deep Domain Adaptation Model for Bearing Fault Diagnosis with Domain Alignment and Discriminative Feature Learning

,


Introduction
Traditional machine learning techniques, especially deep learning, have recently made great achievements in the data-driven fault diagnosis field [1][2][3][4][5][6]. Most machine learning methods assume that the training data (source domain) and test data (target domain) must be in the same working condition and have the same distribution and feature space. However, in many real-world working conditions, the distribution of source domain samples is different from that of target domain samples, resulting in performance degradation.
To address this challenge, the main research on domain adaptation techniques focuses on how a machine learning model built in a source domain can be adapted in a different but related target domain, which is necessary to avoid reconstruction efforts. In the field of knowledge engineering, many beneficial and promising examples with domain adaptation have been found, including image classification, object recognition, natural language processing, and feature learning [7][8][9][10].
In recent years, considerable research has been conducted on domain adaptation on the basis of deep architectures. Most published deep domain adaptation works can be roughly divided into three categories [11]: (1) discrepancy-based, (2) adversarial adaptation, and (3) reconstruction-based methods.
Typical discrepancy-based methods are shown in [12,13]. ey are usually implemented by adding a loss to minimize the distribution discrepancy between the source and target domains in the shared feature space. For example, Tzeng et al. [12] applied a single linear kernel to one layer for minimizing the maximum mean discrepancy (MMD), whereas Long et al. [13,14] minimized MMD by applying multiple kernels to multiple layers across domains. Another impressive work is Deep Coral [15], which extends CORAL [16] to deep architectures and aligns the second-order statistics of the source and target distributions.
Another increasingly popular work is adversarial domain adaptation methods, which include adversarial discriminative and generative methods. e former aims to encourage domain confusion through an adversarial objective with respect to a domain discriminator. Tzeng et al. [17] proposed a unified adversarial domain adaptation framework that combines discriminative modeling, untied weight sharing, and a Generative Adversarial Network (GAN) loss [18]. Among the discriminative models, Tzeng et al. proposed a model with confusion loss [19] and also considered an inverted label GAN loss [17], whereas Ganin et al. [20] proposed a model with minimax loss. e latter combines the discriminative model with a generative component on the basis of GANs. Liu and Tuzel [21] developed a Coupled Generative Adversarial Network (CoGAN) that adopts two GANs, each corresponding to one of the domains, and the CoGAN learns a joint distribution of multidomain samples and enforces a weightsharing constraint to limit the network capacity. e method presented in [22] also adopts GANs to generate source domain images to appear as if drawn from the target domain.
Typical reconstruction-based methods can be seen in [23][24][25]. Data reconstruction can be viewed as an auxiliary task to support the adaptation of the label prediction. Ghifary et al. [23] combined the standard convolutional network for source label prediction with a deconvolutional network [26] for target data reconstruction. Bousmalis et al. [24] introduced the notion of private and shared subspaces for each domain. Meanwhile, a reconstruction loss is integrated in the model by using a shared decoder, which learns to reconstruct the input sample through domain-specific and share features. Tan et al. [25] presented a selective learning algorithm that uses the reconstruction error to select useful unlabeled data from intermediate domains.
Despite the success achieved by domain adaptation, limited research can be found with respect to its application on fault diagnosis. Zhang et al. [27] took raw vibration signals as inputs of a deep convolutional neural network with a wide first-layer kernel convolutional neural network (WDCNN) model. ey also used adaptive batch normalization (AdaBN) as the algorithm of domain adaptation to realize fault diagnosis under different load conditions and noisy environments. Lu et al. [28] introduced a deep CNN model with domain adaptation for fault diagnosis, and this model integrates MMD as the regularization term into the loss function of the model to reduce the cross-domain distribution difference. Zhang et al. [29] developed an adversarial domain adaptation model, which comprises a source feature extractor, a target feature extractor, a domain discriminator, and a label classifier, for fault diagnosis. Jian et al. [30] proposed a fusion CNN model that combines 1DCNN and Dempster-Shafer evidence theory to enhance the cross-domain adaptive capability for fault diagnosis. Tong et al. [31] proposed a bearing fault diagnosis domain adaptation method to find transferable features across domains, which were obtained by reducing marginal and conditional distributions simultaneously based on MMD. Li et al. [32] presented a deep domain adaptation method for bearing fault diagnosis on the basis of the multikernel maximum mean discrepancies between domains in multiple layers to learn representations from the source domain applied to the target domain. Furthermore, Han et al. [33] proposed a new intelligent fault diagnosis framework, which extends the marginal distribution adaptation to joint distribution adaptation and guarantees an accurate distribution adaptation.
Improving CNN performance by learning additional discriminative features has become a recent trend. For example, contrastive loss [34] and center loss [35] are presented to learn discriminative deep features for face verification and recognition. Furthermore, Liu et al. [36] proposed a large-margin softmax loss to extend the softmax loss to large margin softmax, which leads to a large angular separability between the learned features. Chen et al. [37] proposed two discriminative feature learning approaches, namely, instance-based and center-based discriminative losses, with joint domain alignment and discriminative feature learning.
Inspired by these methods, we propose a novel deep domain adaptation model for bearing fault diagnosis with Deep Coral and center-based discriminative feature learning. By combining domain alignment and discriminative feature learning, the domain-invariant features extracted by the model can be well clustered and separable, which can clearly contribute to domain adaptation and classification. e main contributions of this work include the following: (1) An end-to-end method with domain adaptation for fault diagnosis is proposed, that is, CACD-1DCNN. is method directly works on raw temporal signals and does not require time-consuming denoising preprocessing and a separate feature extraction algorithm.
(2) By combining domain alignment and discriminative feature learning, CACD-1DCNN aims to extract domain-invariant features with improved intraclass compactness and interclass separability and guarantees high classification performance in the two domains.
(3) Extensive experiments on the Case Western Reserve University (CWRU) bearing datasets demonstrate that CACD-1DCNN achieves superior diagnosis performance over existing baseline methods. (4) Furthermore, network visualization and loss analysis provide an intuitive presentation of the adaptation results and verify the effectiveness of our method.
e remaining parts are organized as follows. In Section 2, the domain adaptation problem of fault diagnosis is formulated and the basic theories of Deep Correlation Alignment and Center-Based Discriminative Loss are introduced. In Section 3, the proposed intelligent fault diagnosis method, CACD-1DCNN, is presented. e comparison methods, the experiments, and discussion are given in Section 4. Finally, the conclusions are drawn in Section 5. Formally, domain D is composed of m-dimensional feature space X ⊂ R m with marginal probability distribution P(X), where X � x 1 , . . . , x n ∈ X. Task J consists of two components: label space Y and predictive function f(X) corresponding to the labels. f(X) � Q(Y|X) is also the conditional probability distribution, and Y ∈ Y, where Y represents the possible machine health condition. e following is the formal definition of the unsupervised domain adaptation problem for fault diagnosis. Given labeled source , where N s and N t are the numbers of samples of the source and target domains, respectively. e eigenspaces of D S and D T (that is, X s � X t ), the label space (that is, Y s � Y t ), and conditional probability distribution Q s (y s | x s ) � Q t (y t | x t ) are assumed to be the same. However, the marginal probability distributions of the two domains, that is, P s (x s ) ≠ P t (x t ), are different. Unsupervised domain adaptation aims to use labeled D S to learn classifier f(x t ) ⟶ y t for predicting labels y t of D T , where y t ∈ Y t .

Deep Correlation Alignment.
To fill the gap between the domains, CORAL loss is adopted by aligning the secondorder statistics of the source and target features. In the activations computed at a given layer, x s i and x t i are d-dimensional representations. e domain discrepancy loss measured by CORAL loss (L CA ) [16], as shown below, minimizes the distribution discrepancy between the secondorder statistics (covariance) of the source and target features.
where ‖‖ 2 F denotes the squared matrix Frobenius norm and C S and C T are the covariance matrices of the source and target features, respectively. According to reference [16], C S and C T are, respectively, computed as follows: where J is the centering matrix [38]. Taking the source domain as an example, J is a matrix of N S × N S , and it is derived as follows: e training process is realized by mini-batch Stochastic Gradient Descent (SGD) in which only a batch of training samples is aligned in each iteration.

Center-Based Discriminative Loss.
To make the deep features learned by the deep CNN model further discriminative, center-based discriminant loss L CD is adopted [37], which is different from center loss [35]. e latter penalizes the distance of each sample to the center of the corresponding class, whereas the former not only has the characteristic of center loss but also enforces large margins among centers across different categories. Center-based discriminant loss is defined as follows: e loss is composed of two items. e first item is used to measure the intraclass compactness, whereas the second item is used to measure the interclass separability. β is the trade-off parameter, and m 1 and m 2 are the two constraint margins. f s i ∈ R d represents the deep features of the i-th training sample in the fully connected layer, and n is the number of neurons of the fully connected layer. c y j ∈ R d denotes the class center of the y i -th sample corresponding to the deep features, y i ∈ 1, 2, . . . , c { }, where c is the number of classes.
Equation (4) shows that the center-based discriminative loss forces the distance between intraclass samples to be no more than m 1 and the distance between the interclass samples to be no less than m 2 . Obviously, this penalty item can make the deep features further discriminative.

Shock and Vibration
Ideally, class center c y j should be calculated by averaging the deep features of all samples, which is evidently inefficient and unrealistic. In practical applications, we update the central point through the mini-batch training samples. In each iteration, we calculate the central point by averaging the features of the corresponding class. e updated formula in each iteration is presented as follows: When the condition is true, δ(condition) � 1, and 0 otherwise. b denotes the batch size, and λ is the learning rate. Every class center is initialized as the "batch class center" in the first iteration, and it is updated according to (5) and (6) for the next batch of samples in each iteration.

CACD-1DCNN Fault Diagnosis Model. CACD-1DCNN is
proposed to solve the cross-domain learning problem in the bearing fault diagnosis area. As illustrated in Figure 1, taking CNN as the main architecture, the two-stream CNN architecture with shared weights is adopted, and the model employs a domain adaptation layer with correlation alignment and center-based losses before the classifier. As the input of the two-stream, the labeled source and unlabeled target data are fed into the CACD-1DCNN model during the training process. Subsequently, domain-invariant features with discriminative raw vibration signals are extracted through the multiple convolutional and pooling layers. e distribution discrepancy is minimized at the last fully connected layer. eoretically, correlation alignment can be performed at multiple layers in parallel. Empirical evidence [15,39] indicates that a solid performance is obtained even if this alignment is conducted only once. As a common practice, correlation alignment loss (L CA ) is performed after the last fully connected layer. Similarly, centerbased discriminative loss (L CD ) is generally placed after the last fully connected layer. erefore, the two kinds of loss functions in the proposed model are trained on the basis of the features extracted by the last fully connected layer. In addition to the conventional softmax loss function (L s ) based on source domain, the loss function of the CACD-1DCNN model is defined as follows: where α and β are the trade-off parameters for balancing the contributions of the domain discrepancy and discriminative losses.
Only the source domain data are discriminated here. During the training process, the source features are discriminant learning and aligned with the target features. Joint training with classification, correlation alignment, and center-based discriminative losses between the two domains in the last fully connected layer can adapt the learned representations in the source domain for application to the target domain. is joint training can also guarantee domain-invariant features with improved intraclass compactness and interclass separability. Meanwhile, the extracted features can efficiently improve the cross-domain testing performance.

Architecture Designs of 1DCNN.
Considering that the bearing vibration signals collected by acceleration sensors are usually 1D, using a 1DCNN is reasonable for the processing of vibration signals. In this study, the 1DCNN is adopted to deal with bearing fault diagnosis. e network structure is composed of four convolution and pooling layers, two fully connected layers, and a softmax layer at the end.
e first convolutional layer uses wide kernel for extracting feature and suppressing high-frequency noise. Small convolutional kernels in the following layers are used to deepen the network for multilayer nonlinear mapping and preventing overfitting [27]. e parameters of 1DCNN are presented in Table 1. e pooling type is max pooling, and the activation function is ReLU. To minimize the loss function, the Adam stochastic optimization algorithm is applied to train our model and the learning rate is set to 1e − 4. e experiments are conducted using the TensorFlow toolbox of Google.

Data Augmentation.
Without sufficient training samples, the model can easily result in overfitting. Data augmentation techniques are commonly used in computer vision to increase the generalization of networks by adding the number of training samples. In fault diagnosis, the vibration signals collected by the acceleration sensor is 1D, and overlap sampling can easily obtain a large number of data by slicing the training samples with overlap. Figure 2 illustrates a vibration signal with 120,000 points. We can take 2,048 data points from this signal as a sample. We can also offset it by a certain amount to be the second sample. In this study, vibration signals of different fault locations and different health states with a sampling frequency of 12 kHz at the driving end of rolling bearing are selected for experimental research. e detailed description of the datasets is shown in Table 2. ree datasets are acquired     1-3 and 6 are methods that work with the data transformed through fast Fourier transform, whereas 4 and 5 are CNN-based methods that work with normalized raw signals. Notably, the OFNN in [30] is not used here because the OFNN uses the diagnostic result of data fusion between the drive-end and fan-end datasets, which are different from the datasets used in the experiments in this research. By contrast, the datasets used by OFNN-DE are the same as the datasets used in the experiments in this study. Furthermore, the diagnosis accuracy of the OFNN is slightly lower than that of the method used in this research.

Experimental Analysis of the
For a fair comparison, we adopt the accuracy reported by other authors with the same setting or conduct experiments by using the source code provided by the authors.
A total of 10 experiments are conducted for each domain transfer scenario to reduce the influence of random factors. ese results show that the domain adaptation performance of the proposed method is remarkable and stable. e comparison with other approaches is shown in Figure 4. e average performance of the CACD-1DCNN is better than that of the A2CNN and six other baseline methods. e CACD-1DCNN also achieves the state-of-theart average accuracy of domain adaptation in all domain transfer scenarios.
As illustrated in Figure 4, the performance of SVM, MLP, and DNN in domain adaptation is poor, with average accuracies of 66.63%, 80.40%, and 78.05% in the six scenarios, respectively. ese results suggest that the sample distribution differs under varying conditions and the model trained in one working condition is unsuitable for fault diagnosis in another condition.
Compared with recent approaches, such as OFNN-DE and A2CNN, our method achieves an average accuracy of 99.47%, which is higher than those of OFNN-DE and A2CNN with average accuracies of 98.73% and 99.21%, respectively. is result shows that the features learned by the proposed method have better domain invariance and fault discrimination than those learned by other methods.
In five out of six shifts, that is, A ⟶ B, A ⟶ C, B ⟶ C, C ⟶ B, and C ⟶ A, the fault diagnosis accuracy of the proposed method achieves state-of-the-art domain adaptation performance and reaches up to 100% in the first four domain shifts. In the domain transfer scenario of B ⟶ A, the accuracy of the proposed method is 98%, which is, respectively, 0.18% and 0.5% lower than those of the A2CNN and OFNN-DE methods and far better than the accuracies of the SVM, MLP, DNN, and WDCNN methods. On this basis, the CACD-1DCNN can well learn domain-invariant and fault-discriminate features and effectively solve the domain adaptation problem caused by different loads of bearing data.
Taking the domain shift scenarios of C ⟶ B, C ⟶ A, and B ⟶ A as examples, for the CACD-1DCNN model, we compare the test accuracy of the target domain under the four loss functions of L s , L s + L CA , L s + L CD , and L s + L CA + L CD (for simplicity, the coefficients of each loss function is omitted), as presented in Table 3. e results of other shifts  )  Train  660  660  660  660  660  660  660  660  660  660  Test  25  25  25  25  25  25  25  25  25  25 Dataset B (2 HP)  Train  660  660  660  660  660  660  660  660  660  660  Test  25  25  25  25  25  25  25  25  25  25 Dataset C (3 HP)  Train  660  660  660  660  660  660  660  660  660  660  Test  25  25  25  25  25  25  25  25  25  25   6 Shock and Vibration are similar. We observe that the target test accuracy is worst with the loss function of L s because no adaptive strategy is adopted between the source and target domains. e target test accuracy is in the middle level with the loss functions of L s + L CA and L s + L CD , and the target test accuracy is the highest with the loss function of L s + L CA + L CD . erefore,   in the case of L s + L CA and L s + L CD , the model classification performance is comparable, and only when jointly supervised by the three loss functions can the proposed model achieve the best performance. e results confirm that (1) the CACD-1DCNN is effective in filling the domain gap and (2) L CD is added on the basis of L CA , that is, by combining domain alignment and discriminative feature learning, the proposed model guarantees that the domain-invariant features are extracted with improved intraclass compactness and interclass separability. e gap between the class clusters and the hyperplane is large, which is conducive to the correct classification of target samples near the edge or far from their corresponding class centers.
Furthermore, taking the domain shift scenario of C ⟶ A as an example, the accuracy of the training and testing stages in the case of the joint loss of L s + L CA + L CD is illustrated in Figure 5. is approach clearly helps us to achieve improved performance in the target domain while maintaining a strong classification accuracy in the source domain.
Taking the domain shift of C ⟶ A as an example, the domain loss and intraloss are analyzed. Notably, the proposed model adopts batch values. Although batch values cannot fully represent the distance between the entire source and target domains, it is a practical and fast approximation method for classifying samples. e domain loss under the four loss functions is illustrated in Figure 6. In the case of L s , where only the source domain is trained, the feature representations obtained from the source domain are likely to be different from the target features because the sample features of the target domain are not considered at all during the model training. erefore, overfitting occurs, and the domain loss is great. In the case of L s + L CD , without considering domain alignment and only considering discriminant learning, the domain loss is smaller than that in the case of L s , within the range of 0-0.55. Only L s + L CA and L s + L CA + L CD are compared in Figure 7 for clarity. e domain loss is minimal, and the domain loss curve is smooth in the case of L s + L CA + L CD , indicating that the slight change of weight causes a slight change of distance between domains. A further stable and accurate domain adaptation model can be obtained by considering the domain alignment and discriminative feature learning. e intraclass loss under the four loss functions is illustrated in Figure 8. Evidently, the intraclass loss is large in the case of L s because the sample features of the target domain are not considered. e case of L s + L CD takes second place. e cases of L s + L CA and L s + L CA + L CD are small, suggesting that discriminant learning can achieve a small intraclass loss only in the model training of joint domain alignment. e comparison of the intraclass losses of L s + L CA and L s + L CA + L CD shows that the curve is smooth in the case of L s + L CA + L CD , indicating that the model is stable and reasonable in this case.

Sensitivity Analysis of Fault.
For each type of fault detection, we introduce three evaluation indexes, namely, Precision, Recall, and F-Measure, to further analyze the sensitivity of the proposed CACD-1DCNN method. In the multiclassification problem of fault diagnosis, for each fault category f, Precision and Recall are defined as follows: where

Shock and Vibration
F-Measure denotes the geometric weighted average of Precision and Recall, with α as the weight. Setting α to 1 indicates that Precision is as important as Recall. When α > 1, Precision is important; when α < 1, Recall is important. In this study, α is set to 1; the closer the F-Measure to 1, the better the fault diagnosis performance. is evaluation method considers the Precision and Recall. e highest F-Measure is 1.
e Precision, Recall, and F-Measure of each health state in the comparison method A2CNN and the proposed method CACD-1DCNN are presented in Table 4, and the comparison results of other methods are similar.
For the first (rolling body) and fourth (inner raceway) fault types, both of which have a fault size of 0.007 inch, the CACD-1DCNN method has low Precision values in the domain shift scenario B ⟶ A, which are 89% and 93%, respectively.
us, approximately 10% of these kinds of fault alerts in this domain shift are unreliable. For the first fault type, the Precision of the proposed method in the domain shift of C ⟶ A is 89%, indicating that 11% of the samples are incorrectly classified as this fault category. In A2CNN, we can see that, in some domain shifts, based on the first, second, third, and fifth fault types and normal state, there are some fault alarms which are not reliable.
For the third fault type, the rolling body fault size is 0.021 inch.
e Recall values of the CACD-1DCNN method in domain shift scenarios C ⟶ A and B ⟶ A are low, which are 88% and 80%, respectively. at is, 12% of these faults are undetected in the domain shift scenario of C ⟶ A, whereas 20% are undetected in the domain shift B ⟶ A. In A2CNN, In general, the Precision, Recall, and F-Measure of the CACD-1DCNN are higher than that of the A2CNN, which means that the CACD-1DCNN has less false alarms and missed alarms. Except for a few third fault types in domain shift B ⟶ A, which are incorrectly classified into the first and fourth fault types, and a few third fault types in domain shift C ⟶ A, which are incorrectly classified into the first fault type, the CACD-1DCNN method divides all categories into the correct classes. e results show that, after combining the domain alignment and discriminative feature learning, the classification performance of the proposed method achieves remarkable improvement.  Taking the domain shift of C ⟶ A as an example, the results of α with different values are illustrated in Figure 9. β is fixed at 0.003. Similar trends are observed in other domain transfer scenarios. A large range of α (α ∈ [10 −1 , 10 4 ]) can be selected to obtain better results than those of the best baseline methods. When the value of α is larger than 10 4 , the accuracy rapidly decreases. e effectiveness and robustness of the proposed method are further verified.
With a fixed α of 100, we consider the influence of parameter β, which balances the discriminant loss to increase the intraclass compactness and interclass dispersion. A large β can produce deep discriminate features, whereas a small β is insufficient to improve the discrimination of features. Figure 10 shows the change in accuracy in the domain shift scenario of C ⟶ A when β (β ∈ {0.0003, 0.003, 0.03, 0.3, 1, 10, 20}) takes different values. When β is very small, the classification accuracy of the target domain is high. At this time, correlation alignment loss plays a role and the model classification performance is high. With the increase of β, when the domain alignment keeps up with the change of the source features under the influence of discriminant loss, a domain adaptation model with high accuracy is obtained. However, when β is larger than a certain interval, which is 10 in this case, the classification performance of the target domain is poor. e reason may be because the discriminant influence is too large to exceed the speed of domain alignment. In this case, we can conclude that when β is taken (0, 10], the test accuracy remains high.   e experimental results reveal that appropriate tradeoff parameters between domain alignment and discriminant feature learning in the CACD-1DCNN model can improve the domain adaptation performance.

Network Visualizations.
To further describe the effectiveness of the CACD-1DCNN, the t-SNE technology [41] is adopted for visualizing the feature representations of the proposed approach in all convolutional and fully connected layers. Domain scenario B ⟶ C in Figure 11 is taken as an example. ree points are worth noting. (1) As the number of layers of the CACD-1DCNN model increases, the signals become increasingly separable at each layer, indicating the necessity of a deep structure. (2) In the third and fourth convolutional layers, the phenomenon of linear inseparability of feature representations occurs. In the fully connected layer, the feature representations of all faults are linearly separable. erefore, the nonlinear expression ability of the model increases as the number of layers increases. (3) e feature representation of signals in the first convolutional layer is similar to that of the original signals and fails to show any separability. In the third and fourth convolutional layers, the signal samples gradually show separability. At the fully connected layer, the faults can be well distinguished. e following describes the core idea of the proposed model to implement domain adaptation. e correlation of the data is initially removed, and then recorrelation operations are performed on the basis of the information of the target domain.

Conclusion
In this work, we propose a CACD-1DCNN model for the domain adaptation of bearing fault diagnosis by combining domain alignment and discriminative feature learning. e CACD-1DCNN aims to extract domain-invariant features with improved intraclass compactness and interclass separability and guarantees high classification performance in the two domains. e experimental results on the CWRU bearing datasets confirm the superiority of the proposed method over many existing methods. Future research may focus on two aspects: (1) applying correlation alignment at multiple layers between the two domains in parallel and (2) further reducing the domain shifts in the aligned feature space through other constraints. Data Availability e data used in this paper are acquired from the Bearing Data Center of Case Western Reserve University (CWRU) and web page: http://csegroups.case.edu/bearingdatacenter/ home (accessed October 2015).

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Jing An conceived the study, participated in the research design, conducted the experiments, and wrote the paper. Ping Ai designed the methodology and reviewed the manuscript critically for important intellectual content. Dakun Liu carried out more supplementary experiments, including the calculation of Precision, Recall, and F-Measure of the comparison methods and the effectiveness verification of the proposed method. All authors read and approved the final manuscript.