Cross-domain intelligent fault diagnosis of rolling bearing based on distance metric transfer learning

Rolling bearings are present ubiquitously in mechanical equipment, timely fault diagnosis has great significance in guaranteeing the safety of mechanical operation. In real world industrial applications, the distribution of training dataset (source domain) and testing dataset (target domain) is often different and varies with operating environment, which may lead to performance degradation. In this study, a cross-domain fault diagnosis of rolling bearing method based on distance metric transfer learning (DMTL) and wavelet packet decomposition (WPD) is proposed. The Mahalanobis distance is adopted for learning the intrinsic similarity or dissimilarity between instances and learned by simultaneously maximizing the intra-class distances and minimizing the inter-class distances for target domain. The features of source domain and target domain are first extracted from original vibration signals by WPD which is a powerful tool in dealing with non-stationary signals and can provide meticulous analysis. Then, the DMTL model is adopted to eliminate the error propagation across different components, which can weaken the weight of low-quality instances and enhance the weight of high-quality samples. Finally, the k-nearest neighbor (KNN) classifier is applied to accomplish the cross-domain intelligent fault-type classification. The superiority and effectiveness of the proposed fault diagnosis model is validated by two diagnosis cases. The experimental results demonstrated that the proposed method performs better than other compared methods in recognizing various fault types and has the capability in handling the complex cross-domain adaptation scenarios.


Introduction
Rolling bearings play an indispensable role in equipment, which are prone to breakdown since they often operates with awful conditions, such as high temperature, heavy loads, high rotating speed, and etc., and almost 45%-55% of rotating machinery failures are rolling bearing faults. [1][2][3] Unexpected failures may boost the cost of operation, maintenance and even lead to catastrophic casualties. 4 To ensure the safety and reliability of the rotating machinery, accurate and efficient diagnosis of incipient faults is extremely important.
Conventionally, the fault diagnostic techniques collect and process various signals with the goals of resuming from malfunctions or faults and precluding from future failures as early as possible. 2,5 Data-driven fault recognition approaches related to artificial intelligence techniques or machine learning techniques, such as support vector machine (SVM), k-nearest neighbor (KNN), and artificial neural network (ANN), etc., have been extensively studied to improve existing techniques with the goal of more accurately and effectively dealing with various complex problems, such as varying load effect and noise contamination. [6][7][8] Additionally, deep learning methods have been widely used for condition recognition over the past decades. [9][10][11] These intelligent recognition techniques have achieved great success in distinguishing operating conditions of various machines under complex working environment.
Despite huge success, most of the intelligent recognition methods work well under two general hypotheses: one is abundant labeled training samples are available; the other is the training and testing dataset are drawn from the identically probability distribution. However, in real-world scenarios, the performances of these methods may dramatically decline because of the variations between the distributions of training data (source domain) and testing data (target domain). 12 The distribution of collecting dataset varies with the operating environment, such as the installation conditions of experimental platform, motor loads, humidity, temperature, and etc., which is known as cross-domain learning problem. 13 The variations (domain shifts) could cause a great discrepancy between the features extracted from the signals obtained from experimental settings and the signals collected from actual operating situation.
Recently, transfer learning methods have received extensive studies, which can adapt a machine learning model trained by dataset of source domain to a different but related target domain. [14][15][16] The learning strategies of most published transfer learning methods can be roughly divided into three categories: feature-based methods, instance based methods and metric-based methods. 17 The feature-based learning aims to discover a feature subspace in which the recognition model trained in source domain is qualified for target domain. 18,19 Instance-based transfer learning aims at reweighting the source samples according to the shared information provided by target data, which the reweighted instances can be further analyzed. [20][21][22] Metric-based methods, distance metric learning (DML) algorithms, aims to learn an optimal distance metric for measuring sample pairs similarity or dissimilarity by exploiting meaningful correlations between instances in source and target domain, 17 which can effectively reduce the distribution divergence between domains and extract the weakly discriminative information. 23 Cao et al. proposed consistent distance metric learning to estimate the instance weights under covariate shift situations, which the Euclidean distance metric was utilized to determine sample pairs correlation. 24 Huang and Zhou presented an unsupervised metric transfer learning method (UMTL) to learn domain invariant features with more discriminative information via Maximum Mean Discrepancy. 25 Most existing transfer learning algorithms use the Euclidean distance to estimate the dissimilarity or similarity between the samples in source or target domain.
However, the Euclidean distance is adopted to measure the dissimilarity and similarity between samples in most existing transfer learning techniques, which may decline the transfer learning performance since Euclidean distance would not maximize the inter-class distances while minimizing intra-class distances. 26 Ahmadvand and Tahmoresnezhad presented Metric Transfer Learning via Geometric Knowledge Embedding, Mahalanobis distance metric and the graph optimization were employed to reweight the instance weights of source samples for distribution matching. 17 Xu et al. put forward a metric transfer learning framework to encode metric learning in transfer learning. 27 Therefore, Mahalanobis-distance-based transfer learning can effectively minimize the distance between source and target domains.
Inspired by the strategy of transferring knowledge from source domain to target domain, a cross-domain fault diagnosis of rolling bearing method based on distance metric transfer learning (DMTL) and wavelet packet decomposition (WPD) is proposed in this study. Based on the metric learning, the Mahalanobis distance instead of Euclidean distance is adopted in the objective function of cross-domain adaption for learning the intrinsic similarity or dissimilarity between instances. Then, the intra-class distances of target domain are maximized and the inter-class distances of target domain are minimized to improve recognition accuracy by using the intrinsic information among samples from different domains with labels. Experimental investigations are carried out to demonstrate the feasibility and effectiveness of the proposed method for rolling bearing fault diagnosis.
The rest of the article is organized as follows. In section II, the principle of DMTL is introduced. Then, the fault diagnosis model based on the DMTL and WPD algorithm is presented in Section III. After that, the practical cases are studied to validate the superior performance of the proposed model in Section IV. Finally, some concluding remarks are summarized in Section V.

Principle of distance metric transfer learning
Suppose the domain D={X, P(X)} is composed of a d-dimensional feature space X and a marginal probability distribution P X ð Þ, where X = x i f g n i = 1 2 R d 3 n is a dataset consists of each x i 2 R d 3 1 sample from this domain. Its corresponding task T = {Y, f(X)} is composed of a label space Y and a prediction function f X ð Þ where Y = y i f g n i = 1 2 R is the label vector of feature dataset X with y i is the label of x i , and f (X) = Q(YjX) is the conditional probability distribution. Let D s = X s , P s (X s ) f gand D T = X T , P T (X T ) f g denote the source domain and target domain respectively, their corresponding feature dataset are X s = x si f g N S i = 1 and X T = x Tj È É N T j = 1 , and the samples of source domain are labeled and only a few samples of target domain are labeled.
f gdenote source task and target task, respectively. This study is focus on the homogeneous metric transfer learning, which implies that the feature space and label space of source domain and target domain are same, while the marginal probability distribution and the conditional probability distribution are different, that is, X s = X T , Y s = Y T , P s (X s ) 6 ¼ P T (X T ) and Q(Y s jX s ) 6 ¼ Q(Y T jX T ). The experimental dataset contains C different operating statuses and each status has n c samples in the subdomain. Without loss of generality, let Y s = Y T = f1, . . . , Cg and let y i =1 represents the normal status of rotating machinery in fault diagnosis.
Since there is a discrepancy between the distribution of source domain and target domain, the labeled samples of source domain cannot be directly applied to learn a distance metric for target domain. To address this issue, the labeled samples of source domain are reweighted meanwhile preserving the distance relation among data in source domain and target domain, which can provide discriminative information for target domain. In this study, a reweighting instance strategy called distance metric transfer learning (DMTL) is investigated, and the objective function of DMTL method consists three parts as follows: where the first term R A is the primary objective of distance metric learning as same with distance metric learning, 24 which controls the generalization error of the distance metric. The second term R w is the regularization term on instance weights of source-domain labeled samples. The third term R l is the loss function of prediction model of the learned distance metric on target domain labeled samples along with the reweighted source-domain labeled samples. l.0 and b.0 are the trade-off parameters to balance impact of those three terms in equation (1).

Gegularization term of distance metric learning R A
Since the Mahalanobis distance is learned by information theoretic metric learning which is helpful for classification problems, 27 Mahalanobis rather than Euclidean distance metric learning for target domain is applied in this study. Assuming x i and x j are the feature vectors, and the Mahalanobis distance is parameterized by distance metric M and can be defined as follows: where M is a symmetric positive semidefinite realvalued matrix, which can ensure that d M satisfies the properties of pseudo-distance, such as identity, symmetry, nonnegativity, and triangle inequality. 23 Obviously, if M is an identity matrix (M = I), d M turns into the Euclidean distance. The M can be decomposed into Thus, learning the Mahalanobis distance in terms of M is the same as learning the matrix A, and the Mahalanobis distance metric in terms of A can be defined as follows: Here, the regularization term of distance metric learning R A can control the generalization error of Mahalanobis distance metric in terms of A.

Regularization term of instance weights R w
To avoid the potential issues in studying the instance weights w and Mahalanobis distance metric A, the regularization term of instance weights R w is applied, which can effectively estimate the instance weights. The R w term is defined as follows: where w 0 x ð Þ = P T x ð Þ P S x ð Þ are the estimated density ratios or weights of instances x of source domain with Euclidean ð Þ is the ideal instance weights with Mahalanobis distance metric, A and A 0 are the ideal distance metric of source domain and target domain, x ð Þ are the density estimations of with distance metric A and A 0 , respectively. Obviously, the higher value of w 0 x ð Þ, the higher value of P T x ð Þ and the smaller value P S x ð Þ, which implies that x is closer to distribution of target domain than that of source domain, and the instance weights of target domain are 1, that is, w 0 x Ti ð Þ= 1 for x Ti 2 D T . The density ratios w 0 x ð Þ = P T x ð Þ P S x ð Þ can be estimated by a linear combination of some basic functions, that is, where u i represent a set of predefined basis functions and a i is the corresponding positive parameters to be learned. The estimation performance of density ratios is determined by the setting of u i . Thus, the Gaussian kernel function centered at c i is adopted to define the basic function Here, equation (5) can be transformed by the optimization problem as follows: Since the logarithmic function is convex, the optimal solution of equation (6) can be calculated by gradient ascent approaches. As seen in equation (6), all the samples of source domain and target domain including labeled and unlabeled are adopted to deduce parameters a i . Then the initial weights of instances in source domain can be obtained through which can be utilized to further deduce more precise weights w x ð Þ with regularization term R l in equation (1). Therefore, the distribution discrepancy between reweighted instances of source domain and instances of target domain can be minimized.
To improve the performance of knowledge transferred across domains, the Mahalanobis distance metric in terms of A and A 0 are introduced. The opti- , and the w 0 x ð Þ is applied as the initial weights to learn w x ð Þ. 27 Loss function of prediction model R l The loss function of prediction model with learned Mahalanobis distance metric A of labeled samples in target domain along with the re-weighted labeled samples in source domain is introduced to improve the classification performance, which is defined by K nearest neighbor along with instance weights w x ð Þ under Mahalanobis distance metric A as follows: where l w A, w ð Þ and l b A, w ð Þ are within-class and between-class of the accumulated weighted differences measured by distance metric A, respectively, which are defined as follows: Substituting equations (3)(4), and equations (7-8) into equation (1), the objective function of DMTL method can be obtained: s:t: where d ij is an indicator function, d ij = 1 for y i = y j , and d ij = À 1 for y i 6 ¼ y j . As seen in equation (9), the in-class and out-of-class instance pairs, denoted by C, are utilized to estimate the loss function. The optimal solution of equation (9) can be converted into as follows: where u is a nonnegative penalty coefficient, e i = 1 for i ł N S , and e i = 0 for N S \i ł N S + N T . Since the equation (10) is non-parametric, an alternating optimization algorithm is introduced to learn metric A and instance weights w x ð Þ alternatingly and iteratively. Fixed the metric A at the t-th iteration, then update the value of w t x ð Þ by gradient descent: Þ . After updating the value of w t + 1 x ð Þ, the metric A t + 1 is regenerated while the w t + 1 x ð Þ is fixed: The metric A and instance weights w x ð Þ are alternatingly and iteratively updated until the variation of objective function is less than the default threshold during the iteration procedure or the number of iterations has reached the maximum value, the metric A and instance weights w x ð Þ will be confirmed. The initial distance metric A 0 is learned from source domain, and the initial weights of instances w 0 x ð Þ in source domain can be obtained through Euclidean distance.

Process of fault diagnosis by distance metric transfer learning
The sensitive features of source domain and target domain are first extracted from original vibration signals by using a unified feature extractor. The vibration signals of fault bearings are generally non-stationary, and WPD is a powerful tool in dealing with non-stationary signals which can provide more meticulous analysis. 28 WPD can effectively decompose a signal into both high-and midfrequency information along with the corresponding frequency regions, which is widely used for fault diagnosis. 6,28,29 Therefore, the features related to WPD including the relative energy in a wavelet packet node (REWPN) and the entropy in a wavelet packet node (EWPN) is extracted, where the REWPN denotes the normalized energy of the wavelet packets node, and the EWPN indicates the uncertainty of normalized coefficients of the wavelet packets node. 6 For a give sample x(n), the jth wavelet packet coefficients of the i-th wavelet packet node is defined as C j i , then the REWPN and EWPN can be obtained as follows: where N is the total number of wavelet packet nodes, and K is the total number of wavelet packet coefficients in each wavelet packet node. After the construction of feature set, a transfer learning strategy of improving predictive performance is put forward. Since DMTL can effectively eliminate the error propagation across different components, a fault diagnosis model based on the DMTL is proposed in this study, which contains model training stage and diagnosis stage, and the flowchart of proposed fault diagnosis method is shown in Figure 1

Experiment design and datasets
To validate the performance of proposed method, two cross-domain roll bearing fault diagnosis scenarios are conducted. Rolling bearings are vulnerability components for rotation machinery, the frequent faults of rolling bearings are inner race fault, outer race fault and ball fault. 5 In engineering practice, the status of bearings is monitored by using vibration signals and temperature of system. Then a real-world run-to-failure bearing fault diagnosis are conducted to further demonstrate the performance of the proposed method. The detailed description is shown in Table 1.  The knowledge transfer performance from public datasets to real-world fault severity diagnosis is verified in this testing scenario. The CWRU datasets and MFPT dataset are adopted as the source domain, while the dataset (short for Hubt dataset) collected on the rotor rolling bearing and gearbox integrated fault test bench of Hubei University of Technology is utilized as the target domain. The experimental system is displayed in Figure 2,

Experimental results
The sensitive features of source domain and target domain containing REWNs and EWPNs are first extracted from original vibration signals. The wavelet packet node energy features obtained by Daubechies2 (db2) wavelet packet decomposition were discovered to attain better recognition performance for bearing fault diagnosis after a lot of experiments on a serials of Daubechies wavelets. 6 Herein, the db2 is adopted as the mother wavelet function to implement binary WPD for vibration signals, and the maximum decomposition level is set to 4. After the construction of feature set, a transfer learning strategy of improving predictive performance is put forward. Additionally, to further investigate the effect of the instance weights of sourcedomain labeled samples and learned Mahalanobis distance metric to the overall performance, a reduction of DMTL denoted DMTL_w only adopts instance weights, and a reduction of DMTL denoted DMTL_l only considers the learned Mahalanobis distance metric. Where the metric A is fixed to an identity matrix for DMTL_w, and the instance weights are fixed to one for DMTL_l.
To validate the advantage of the proposed DMTL method, several popular state-of-art supervised learning methods and transfer learning methods are conducted for comparison, including Support Vector Machine (SVM), 30 k-nearest neighbor (KNN), 31 Transfer Component Analysis (TCA), 32 Geodesic Flow Kernel (GFK), 33 Deep neural network for domain Adaptation in Fault Diagnosis (DAFD), 34 and TrAdaBoost. 20 SVM and KNN are conventional pattern recognition methods. TCA, GFK, and DAFD are subspace-based transfer learning methods, TCA is the representative method of feature-based transfer learning approaches, GFK learns the transferable features by constructing geodesic flow kernel, DAFD is a transfer learning method based on deep neural network which has been widely used in fault diagnosis. TrAdaBoost is the representative method of instancebased transfer learning approaches. Detailed parameter settings of above methods are described as follows. The Guassian kernel is applied in SVM classifier, and the tradeoff parameter is 1. For KNN, the number of nearest neighbor ranges from 1 to 63, and the optimal results are selected. For TCA, the optimal hyperparameters are obtained by Bayesian optimization approach, where the regularization tradeoff parameter m 1 ranges from 10 23 to 10 3 , and the subspace dimension l 1 ranges from 1 to 10. For GFK, the subspace dimension ranges from 1 to 5. For DAFD, three layers neural network are employed to extract sensitive features, and the SVM classifier is applied to identify different fault type. For TrAdaBoost, the linear SVM are adopted for base classifier, and the maximum number of iterations is set to 100. For the proposed DMTL method, tradeoff parameters l and b are selected in the range [10 23 , 10 3 ], and the maximum number of iterations is set to 100. In all the cross-domain adaptation tasks, the labeled samples in source domain and about 15% labeled samples in target domain are adopted for training, and the remaining samples in target domain are selected for testing. The evaluation metric is the classification accuracy on the testing samples in target domain, Acc = 1 , which is widely employed in literatures. 5 (1) Results for bearing fault diagnosis: The classification results of bearing fault diagnosis tasks are shown in Table 2. The average accuracies of SVM, KNN, TCA, GFK, DAFD, TrAdaBoost, DMTL_w, DMTL_l and DMTL method are 71.1%, 74.2%, 53.9%, 71.7%, 43.5%, 67.7%, 67.9%, 76.0%, and 80.5%, respectively. As seen, none of the comparison methods achieves the best classification results on all the tasks. DMTL outperforms than other compared approaches by a significant margin in cross-domain adaptation scenarios for A!C and B!A, which attains the highest average transfer recognition accuracies of 80.5%. The recognition performance of

Sensitivity analysis for different parameters in dmtl
Different parameters may influence the classification performance of DMTL, the sensitivity studies on the parameters of DMTL were conducted in this section. (1) Sensitivity analysis on K for KNN classifiers: In DMTL, the KNN classifier is adopted to recognize fault types by using adjusted samples of new subspace. The transfer learning performance may be affected by different numbers of nearest neighbors, and the potential impacts imposed by different K to the accuracy of DMTL on bearing fault severity diagnosis are shown in Figure 3.  Figure 3, a large value of K may decline the accuracy of cross-domain recognition performance, while a small value of K can obtain more stable and precise accuracy. Thus, the number of nearest neighbors K is set to a small integer, for example, K ł 3.
(2) Sensitivity analysis on number of instance pairs C: The in-class and out-of-class instance pairs C are used to estimate the loss function as shown in Eq. (9), which is randomly selected for optimization in recognition task. The impacts of different values of C to classification accuracy of DMTL in bearing fault severity diagnosis tasks are investigated in this section. Classification accuracies with varying values of C range from 50 to 1500 are displayed in Figure 4. Here, the number of nearest neighbors K is set to 3. As seen, the accuracies of DMTL on the nine transfer tasks vary slightly with increasing values of C. The accuracies of DMTL on the tasks A!serious, B!serious and C!serious are higher than that of other tasks with varying values of C, and the recognition rate is stable at 100%. There is a slight  The trade-off parameters l and b is used to balance impacts of loss function of prediction model and instance weights terms as shown in the objective function of DMTL. The sensitivity analysis on those two parameters in B!moderate and C!moderate tasks are conducted in this section, and the experimental results are exhibited in Figure 5. Here, the number of nearest neighbors K and the number of instance pairs C are set to 3 and 150, respectively. Obviously, the transform performance decrease sharply when parameters b>100, and DMTL performs well and stably when parameters b is set to be a relatively small value, for example b ł 1. The parameters l has little effect on the recognition performance, which can be set into a wide range [10 23 10 3 ].

Conclusions
In this study, a cross-domain fault diagnosis model based on distance metric transfer learning (DMTL) is proposed to recognition the operating condition of rolling bearing when the labeled samples in target domain is insufficient. DMTL reweights samples in source domain by maximizing the intra-class distances and minimizing the inter-class distances for target domain, and the objective function is defined on basis of Mahalanobis distance instead of Euclidean distance. The features of source domain and target domain are first extracted from original vibration signals by using wavelet packet decomposition (WPD). Then, the DMTL model is adopted to eliminate the error propagation across different components, which can weaken the weight of low-quality instances and enhance the weight of high-quality samples. Finally, the k-nearest neighbor (KNN) classifier is applied to accomplish the cross-domain intelligent fault-type classification. The effectiveness and superiority of proposed DMTL and WPD method is verified through two transfer recognition experiments. Compared with other peer methods, the proposed method has better fault diagnosis effect in cross-domain adaption tasks, which implies that the proposed method possesses accurate recognition performance in target domain than other compared ones by using only a few of labeled target samples and massive source samples.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.