Rolling Bearing Fault Diagnosis Based on Sensitive Feature Transfer Learning and Local Maximum Margin Criterion under Variable Working Condition

In real industrial scenarios, the working conditions of bearings are variable, and it is therefore diﬃcult for data-driven diagnosis methods based on conventional machine-learning techniques to guarantee the desirable performance of diagnosis models, as the models assume that the distributions of both the training and testing data are the same. To enhance the performance of the fault diagnosis of bearings under diﬀerent working conditions, a novel diagnosis framework inspired by feature extraction, transfer learning (TL), and feature dimensionality reduction is proposed in this work, and dual-tree complex wavelet packet transform (DTCWPT) is used for signal processing. Additionally, transferable sensitive feature selection by ReliefF and the sum of mean deviation (TSFSR) is proposed to reduce the redundant information of the original feature set, to select sensitive features for fault diagnosis, and to reduce the diﬀerence between the marginal distributions of the training and testing feature sets. Furthermore, a modiﬁed feature reduction method, the local maximum margin criterion (LMMC), is proposed to acquire low-dimensional mapping for high-dimensional feature spaces. Finally, bearing vibration signals collected from two test rigs are analyzed to demonstrate the adaptability, eﬀectiveness, and practicability of the proposed diagnosis framework. The experimental results show that the proposed method can achieve high diagnosis accuracy and has signiﬁcant potential beneﬁts in industrial applications.


Introduction
Rolling element bearings (REBs) are one of the most common machine elements of rotating machinery equipment in modern industry and smart manufacturing [1,2], and the health state of REBs can seriously affect the safe and stable operation of rotary mechanical equipment [2]. REBs often operate in harsh working environments, and their failure probability is therefore higher than that of other components [3,4].
us, REB fault diagnosis is of great significance for the guarantee of equipment safety and the reduction of maintenance costs [4]. In the past decade, because the vibration signals usually carry rich information about the machine operating conditions, the vibration signals collected from REBs have been commonly used as the analytical signals in many intelligent machine fault diagnosis systems [5]. In recent years, with the rapid development of signal processing, data mining, and artificial intelligence technology, data-driven fault diagnosis has become a popular research topic [5]. Data-driven fault diagnosis consists of four steps, namely, signal collection and processing, feature extraction, feature reduction, and pattern recognition [5][6][7][8], among which feature extraction is the crucial step for the extraction of more useful information of the original vibration signals for fault pattern recognition. However, most existing data-driven intelligent diagnosis methods have two main limitations that hinder their applicability in real industrial scenarios [5,6,9]: (1) most existing feature extraction and fault classification models assume that the training and testing data have the same distributions. Due to the harsh working environment and different working requirements in an industry, the working conditions are not consistent; this can therefore lead to differences between the distributions of the training and testing data [6]. (2) e variability of working conditions and the diversity of the types of failure of rotating machines often lead to insufficient labeled target fault data. erefore, diagnostic models based on conventional machine-learning techniques that are learned with training data do not guarantee the preferred diagnosis performance of the use of testing data collected from industrial scenarios. To overcome these two limitations, it is necessary to use an improved fault diagnosis framework in which the training data are the labeled data under one working condition, and the resulting model can be applied to the unlabeled data under other working conditions.
Signal processing is the first step in data-driven REB fault diagnosis methods and has been carried out in many previous investigations by numerous scholars. Because vibration signals collected from REBs generally have nonlinear and nonstationary features, a time-frequency analysis can be effective for feature extraction [10]. Some commonly used and representative conventional time-frequency domain analysis approaches include empirical mode decomposition (EMD), short-time Fourier transform (STFT), the Wigner-Ville distribution (WVD), and wavelet transform (WT) [11]. In addition, parameterized time-frequency transform (PTFT) methods [12,13] have been proposed to achieve a more accurate extraction of the instantaneous rotation frequency (IRF) from strong nonstationary vibration signals. In the work by Wang and Xiang [14], splinekernelled chirplet transform (SCT), one of the PTFT methods, was employed to calculate the time-frequency distribution and extract the instantaneous rotation frequency for REB fault diagnosis under varying speed conditions. In the work by Wang et al. [15], polynomial chirplet transform (PCT), another PTFT method, was employed to estimate the IRF of REBs from the vibration signals for fault diagnosis.
EMD is a common and effective time-frequency approach in the fault diagnosis of rotating machinery and can automatically decompose nonstationary and nonlinear signals into multiple modal compositions [16,17]. In some previous studies [18][19][20][21], EMD was employed to process the original signals and extract features for REB fault diagnosis. However, there are some limitations of EMD, such as overenveloping, end effects, and mode mixing [17,22]. STFT is also an effective time-frequency analysis approach that can be used to divide the entire time domain into numerous segments of the same length, and each time period is an approximately stationary process [23,24]. Some researchers [25][26][27][28] have used STFT for fault diagnosis, but its effectiveness is still hampered by the limitation of its single triangular basis [17,29]. e WVD is also a widely used nonlinear time-frequency distribution for signal processing due to its excellent resolution and localization in the timefrequency domain [30]; however, the presence of crossterms when they are applied to multicomponent signals can result in misleading interpretations [31,32]. WT, including continuous wavelet transform (CWT) and discrete wavelet transform (DWT), is an outstanding and powerful method in rotary machine diagnosis because its multiresolution capability is suitable for the analysis of nonlinear and nonstationary signals [33]. However, CWT can generate redundant data, has a huge operand, and is very time consuming [34,35]. DWT can overcome these drawbacks of CWT, but its limitations of shift variance and frequency aliasing may lead to the loss of useful information [36]. To address these shortcomings of DWT, dual-tree complex wavelet transform (DTCWT) was proposed by Tang et al. [37,38] and further investigated by Selesnick et al. [39] in the dyadic case. DTCWT possesses some advantageous properties [36,37,40], including its (1) near shift-invariance and reduced aliasing, (2) good directional selectivity, which can overcome the lack of the directional selection of DWT, (3) limited redundancy and efficient order, and (4) ability to acquire amplitude information and achieve perfect reconstruction.
ese properties are all beneficial for feature extraction in the task of mechanical fault diagnosis [31]. Dual-tree complex wavelet packet transform (DTCWPT) is an extension of DTCWT and can overcome the foremost limitation of DTCWT, namely, that it cannot realize multiresolution analysis in the high-frequency band. In previous research [36,37,40,41], DTCWPT has been employed to process signals and extract features for REB fault diagnosis. In this paper, DTCWPT is introduced to process the original vibration signals collected from REBs.
For the construction of a feature set for fault pattern recognition, the statistical properties of the signals in the time, frequency, and time-frequency domains can be extracted to represent feature information [9,[42][43][44], such as the peak value (PV), root mean square (RMS), variance (V), skewness (Sw), kurtosis (K), energy, and energy entropy. In [42], after the vibration signals were processed by wavelet analysis, single-branch reconstruction signals and the corresponding HHTenvelope spectrum (HES) were used to generate 192 statistical features using 6 statistical parameters for bearing fault diagnosis. In [43], vibration signals collected from REBs were decomposed into several different IMFs by EMD. e first four IMFs were selected to obtain the HHT marginal spectrum and HES, which were then used to calculate the original statistical properties. In [9], 29 statistical parameters were selected to extract 29 statistical features, which formed a high-dimensional original feature dataset for REB fault diagnosis. In [44], more than 30 feature indicators of vibration signals were calculated for axle bearings under different conditions, and the features that could more effectively and representatively reflect the fault features were selected for fault detection. In the research [45], the RMS and K were used to calculate fault features for wind turbine bearing fault diagnosis. It is often difficult to determine which statistical property can best reflect the nature of a fault from the feature space because of the complex mapping relations between some bearing faults and their signals [42,43]. us, when unsuitable statistical features are chosen for fault pattern recognition, it may lead to a decline in the accuracy and efficiency of fault diagnosis. According to some previous studies [11,42,43], the selection of a feature subset that is formed by fault-sensitive features is a crucial step for the achievement of the expected diagnostic accuracy.
As discussed previously, the two foremost limitations of most existing data-driven intelligent diagnosis methods [5,6] are that (1) they assume the uniform distributions of the training and testing data are the same and (2) the variability of working conditions and the diversity of failures often lead to insufficient labeled target fault data. Recently, these two problems have garnered considerable attention and have been further investigated by some researchers. An et al. [5] proposed a novel three-layer model inspired by a recurrent neural network (RNN) and TL for REB fault diagnosis under different working conditions. Ma et al. [9], aiming at overcoming the first limitation, proposed a transfer diagnosis framework based on domain adaptation for bearing fault diagnosis across diverse domains. In the work by Gao et al. [46], the finite element method (FEM) was employed to simulate samples with different faults to overcome the missing sufficient and complete fault samples. In a study by Liu et al. [47], aiming at the problem of the faulty samples of real-world running mechanical systems being difficult to obtain, a personalized fault diagnosis method for the detection of bearing faults was proposed for the activation of smart-sensor networks using FEM simulations. Some existing research has shown that TL [48] has broad application prospects and wide applicability in various fields [49][50][51]. Feature-based transfer, a mainstream branch of TL technology, has been used in image classification [52][53][54] and has inspired a novel idea for overcoming the two limitations of data-driven intelligent fault diagnosis. In this paper, the use of a novel feature extraction procedure, namely, transferable sensitive feature selection via ReliefF and the sum of standard deviation ratio (TSFRS), is proposed. TSFRS has the following two aspects: (1) it is characterized by the selection of fault-sensitive features and combines the ReliefF algorithm and the sum of within-class mean deviations (SMD) of feature data; (2) a feature-based TL method, namely, transfer component analysis (TCA), is used to reduce the differences between the marginal distributions of the training and testing data.
After the steps of signal processing and feature extraction, a high-dimensional feature set can usually be generated; if this feature set is used directly in fault pattern recognition, it will lead to very high computational complexity and the degradation of fault diagnosis accuracy [42]. Hence, dimensionality reduction is another key step that must be taken before fault pattern recognition. In fact, the dimensionality reduction of features can not only limit storage requirements and increase the algorithm speed, but can also improve the predictive accuracy of the classifier model by removing noisy and redundant features while retaining the most useful information regarding diverse bearing failures [55]. Dimensionality reduction methods can be classified into either linear or nonlinear methods. Principal component analysis (PCA) and linear discriminant analysis (LDA), as two classical linear dimensionality reduction methods, have been extensively used for linear data, but they may be invalid for nonlinear data [56]. erefore, some nonlinear dimensionality reduction methods, namely, kernel principal components analysis (KPCA), Isomap, Laplacian eigenmaps (LE), and local linear embedding (LLE), among others, have presented valid solutions for the dimensionality reduction of nonlinear data [56]. However, nonlinear dimensionality reduction methods have some limitations in practical applications, such as the problem of "out-of-sample" that has no explicit mapping matrix [57], the problem of the overlearning of locality [58], and high computational complexity. In recent years, some unsupervised manifold learning methods that preserve the local geometric structure on the data manifold using the linear approximation of the nonlinear mappings have been proposed, and some representative methods include localitypreserving projections (LPP) [59], neighborhood-preserving embedding (NPE) [60], and orthogonal neighborhoodpreserving projection [61]. Among these manifold learning methods, LPP has attracted attention in the fault diagnosis field [62][63][64], but it does not utilize the label information in dimensionality reduction. LDA is a supervised dimensionality reduction method that considers the label information in feature reduction, but it cannot be directly applied when the within-class and between-class scatter matrixes are singular because of the small sample size (SSS) problem [65]. Based on the respective dominant attributes of LPP and LDA, a novel dimensionality reduction method, namely, local Fisher discriminant analysis (LFDA), was proposed by Sugiyama [66]. LFDA takes into account the label information of data while simultaneously preserving the local geometric structures of the feature data. However, LFDA only considers the neighbor relationships between samples of the same class while ignoring those between samples of different classes. Aiming at the alleviation of the SSS problem of LDA, the maximum margin criterion (MMC), a supervised dimensionality reduction method, was proposed [65]. Inspired by the attributes of LFDA and MMC, this paper proposes a novel feature reduction method, namely, the local maximum margin criterion (LMMC), an improved MMC in which both the neighbor relationships between samples of the same class and those between samples of different classes are considered. erefore, the contributions of this paper are summarized as follows. To solve the problem of fault diagnosis via vibration data that are variably distributed under different working conditions, a novel intelligent fault diagnosis framework of REBs based on multidomain features that systematically combine statistical feature extraction, featurebased TL, feature reduction, and pattern recognition is proposed. TSFRS, a novel feature extraction procedure, is proposed for the selection of the transferable fault-sensitive statistical features as the basis of the subsequent fault analysis. LMMC, an improved feature reduction method, is proposed for the excavation of abundant and valuable information with low dimensionality, which is beneficial for fault diagnosis. e execution of the proposed fault diagnosis framework of REBs is divided into four steps, namely, signal processing, feature extraction, feature reduction, and fault pattern recognition. First, DTCWPT is performed on raw vibration signals collected from REBs, and different terminal Shock and Vibration 3 nodes can be obtained. Multidomain statistical features are then extracted from the reconstructed signals of the terminal nodes to construct the original feature set. Secondly, based on the ReliefF algorithm and mean deviation, a new evaluation index, namely, the ratio of the feature weight value and the SMD, is employed to indicate the sensitivity of statistical features; the most sensitive features can be selected to form a feature subset that represents the fault peculiarity of REBs. Additionally, TCA is used to reduce the differences between the marginal distributions of feature datasets under different working conditions. irdly, LMMC is performed on the original high-dimensional feature set to acquire a new lower-dimensional projection of it. Finally, vibration signals collected from two test rigs under different working conditions are employed to validate the effectiveness, adaptability, and superiority of the proposed method for the identification and classification of REB faults. e first test rig is from Case Western Reserve University, on which two cases with 12 fault types under different motor loads of 2 hp and 3 hp are employed for validation experiments. e second test rig is an SQI-MFS test rig, on which two cases with 10 fault types under different motor speeds of 1200 rpm and 1800 rpm are employed to further verify the adaptability of the proposed method. e remainder of this paper is organized as follows. In Section 2, the theoretical backgrounds of the DTCWPT technique, TCA technique, and MMC are summarized. In Section 3, a description of the proposed diagnosis technique is provided, and the fault diagnosis framework of REBs is illustrated. In Section 4, REB fault vibration signals collected from two experimental test rigs are investigated to verify the performance of the proposed method. Finally, the conclusion of this work is presented in Section 5. Some acronyms used in this paper are presented in Table 1. (DTCWPT). DTCWT, an enhancement of DWT, is characterized by some important properties including near shiftinvariance and the inhibition of frequency aliasing components [36]. However, DTCWT cannot be used for multiresolution analysis in the high-frequency band where useful fault feature information usually exists [67]. To address this limitation, DTCWPT, which is composed of two parallel discrete wavelet packet transforms with different low-and high-pass filters, can present a more precise frequency band partition over the entire analyzed frequency band [36,40]. DTCWPT is divided into real-and imaginarypart wavelet packet transforms, which can be, respectively, regarded as the real and imaginary trees. e real tree decomposition and the corresponding coefficients can be expressed as follows [37]:

Dual-Tree Complex Wavelet Packet Transform
where c R e l,N is the coefficients in the real tree at a scale l and node N and h 0 and h 1 are the low-pass filter and the high-pass filter, respectively. e imaginary tree decomposition and the corresponding coefficients can be expressed as follows: where c I m l,N is the coefficients in the imaginary tree at a scale l and node N and g 0 and g 1 are the low-pass filter and the high-pass filter, respectively. When the scale l is 0, the coefficients are both equal to the original signal x(t), namely, c  e reconstruction procedure of DTCWPT is as follows: where h 0 and h 1 are the wavelet packet reconstruction filters of the real and imaginary trees, respectively. DTCWPT is characterized by two prominent advantages: (1) it is beneficial to the detection of multiple harmonic signals and (2) it can help to extract the periodic impact features of signals. erefore, in this work, DTCWPT is used to process the original vibration signals, and the corresponding singlebranch reconstruction signals of the terminal nodes are used to extract original features.

Maximum Margin Criterion (MMC) and Local Fisher
Discriminant Analysis (LFDA). Linear discriminant analysis (LDA), one of the most popular methods for dimension reduction in statistics research fields [68,69], was proposed by Fisher [70] for the dimension reduction of binary classification problems and was further extended to multiclass cases by Rao [71]. However, LDA cannot be directly applied when the within-class and between-class scatter matrixes are singular because of the SSS problem [65]. To address this drawback, Li et al. [65] and Song et al. [72] used the difference of the between-class and within-class scatter matrixes as a discriminant criterion called the maximum margin criterion (MMC), which can make the inverse matrix not to be constructed. us, the SSS problem in traditional LDA is alleviated.
Let X � x 1 , x 2 , . . . , x N ∈ R M be the input data set and l i ∈ L � C 1 , C 2 , . . . , C c be the associated class label set, where x i (i � 1, . . . , N) is an M-dimensional sample, N is the number of samples, and c is the total number of classes. To reduce the dimensionality of a sample x ∈ R M , some measures are needed to be employed to assess the similarity or dissimilarity. We want to find a linear transformation W, After the dimensionality reduction, the similarity or dissimilarity information is preserved as much as possible. In the work by Li et al. [65], the Euclidean distance was applied to measure the dissimilarity, and the objective of the MMC is for a sample to be close to those in the same class but far from those in different classes. us, the MMC can be presented as follows: where p i and p j are the prior probability of the class C i and C j , respectively. e d(C i , C j ) is defined as the distance between mean vectors, that is, where m i and m j are the mean vectors of the classes C i and C j , respectively. However, due to the fact that (6) neglects the scatter of classes, (6) is not suitable. ough d(m i , m j ) is large, it is not easy to separate two classes that have the large spread and overlap with each other. For this problem, considering the scatter of classes, the between-class distance can be redefined as follows: where S b is the between-class scatter matrix, S w is the within-class scatter matrix, and tr(S b ) measures the between-class separation, while tr(S w ) measures the withinclass cohesion. Local Fisher discriminant analysis (LFDA), a linear supervised dimensionality reduction method, was proposed by Sugiyama [66]. LFDA can not only maximize between-class separability and preserve the within-class local manifold structure at the same time in a reduced dimensional space, but also inherits an excellent property from LDA; that is, it has an analytic form of the embedding matrix, and the solution can be easily computed by solving a generalized eigenvalue problem [66]. LFDA and LDA have the same optimization framework J(U). Furthermore, LFDA incorporates local information into the definition of weight. e objective of LDA is to maximize the ratio of the between-class scatter matrix S b to the within-class scatter matrix S w : where U is a projection matrix, and the definitions of S b and S w are as follows: where n l is the number of samples in class l, μ l is the mean of the samples in class l, and μ is the mean of all samples: where n is the number of samples. According to the literature [66,73], S b and S w also have equivalent form [66,73]: where LFDA incorporates local information into the definition of weight. us, S b has been replaced by S b and S w has been replaced by S w . S b and S w are presented as follows [66]: where where A ij ∈ [0, 1] can be defined as follows: where c i is the local scaling around x i , defined by ‖x i − x k i ‖ and x k i is the k-th nearest neighbor of x i . If x i and x j are close to each other in the feature space, A ij is large; otherwise, it is small [66]. According to (21), for the far apart sample pairs in the same class, it can be weighted and have less influence on S b and S w . Furthermore, the sample pairs in different classes cannot be weighted [74].

Transfer Component Analysis (TCA).
Transfer component analysis (TCA) [75] is a typical feature-based TL method. Given the source domain data that are the training dataset with corresponding labels and the target domain data that are the dataset without corresponding labels, TCA aims to reduce the difference between the marginal distributions of the different datasets by leveraging the transferable features or knowledge from the source domain [75].
A domain D consists of a D-dimensional feature space X, whose marginal probability distribution is P(X), where X � x 1 , . . . , x n is a training dataset, and the representation of D can be X, P(X) consists of a label space Y and a predictive function f(X), where Y � y 1 , . . . , y n is a training dataset label and f(X) � Q(Y | X) represents the conditional probability distribution. ere are two learning tasks, namely, task T S of D S and task T T of D T . Feature transfer is employed to facilitate the learning process of the target predictive function f T (X) in D T by using the knowledge and information in D S and T S , where D S ≠ D T or T S ≠ T T [75]. Given two datasets X S and X T , P S (X S ) ≠ P T (X T ), and a transformation ϕ exits such that where ϕ is a nonlinear mapping function in a reproducing kernel Hilbert space H. e learning objective of TCA is to find a domain-invariant feature space in which the marginal distribution distance between the source domain and the target domain is minimized. e distribution distance is measured using the maximum mean discrepancy (MMD) criterion, which is defined as follows [76]: where In equation (22), n S and n T represent the numbers of source domain samples and target domain samples, respectively, tr(·) represents the trace of the matrix, K is a kernel matrix, and K S,S , K S,T , and K T,T are the kernel matrices in the source domain, cross domain, and target domain, respectively. L can be calculated as TCA maps the features of two domain datasets into the same kernel space through the unified kernel function. e resultant kernel matrix can be calculated as follows: where W � K − (1/2) W, and the distribution distance between the different domain datasets can be defined as e complexity of W needs to be controlled by a regularization term tr(W T W), which is employed to avoid the rank deficiency of the denominator. us, the objective function of TCA can be rewritten as where μ is a trade-off parameter and can be used to guarantee that the optimization objective can be well defined. I ∈ R m×m represents an identity matrix. H is a centering matrix. W T KHKW � I can avoid the trivial solution W � 0.
According to the introduction of TCA, the optimization objective of TCA is that the latent space spanned by the learned samples preserves the variance of the data and minimizes the marginal distributions between the different domain datasets as much as possible.
e optimization problem of equation (27) can be efficiently solved by the trace optimization problem.

Feature Extraction Procedure TSFRS (Transferable Sensitive Feature Selection by ReliefF and the Sum of Mean
Deviation). TSFRS has two components: (1) the selection of fault-sensitive features, which combines the ReliefF algorithm and the sum of within-class mean deviations (SMD) of feature data, and (2) the feature-based TL method, in which TCA is used to reduce the difference in the marginal distributions between the training and testing data.
In this paper, it is suggested that the sensitive statistical features be selected before the implementation of fault pattern recognition. us, the ReliefF algorithm [77] and MD are employed for a dataset that includes different statistical features for the case of REB conditions. Each type of statistical feature is evaluated by the ReliefF algorithm to determine its weight value (WV). ReliefF, a supervised algorithm for feature ranking, is usually applied in data preprocessing as a feature subset selection method. e basic concept of ReliefF is to compute instances at random, compute their nearest neighbors, and adjust a feature weighting vector to give more weight to features that discriminate the instances from the neighbors of different classes. For each kind of the statistical feature, the MD of the feature data samples in each REB condition can be calculated, and the sum of MD in all REB conditions can be further calculated. Aiming at the evaluation of each statistical feature, the higher the WV, the greater the discriminative degree of the feature class. e lower the value of MD, the greater the class cohesion of the characteristic. erefore, the ratio of WV and SMD is selected to indicate the sensitivity of a statistical feature, based on which the sensitive feature subset can be selected from the original feature set.
Furthermore, the variable working conditions of REBs in industry scenarios can lead to distribution differences between the training and testing data [6]. erefore, after the construction of the sensitive feature subset, TCA is employed to reduce the difference between the distributions Shock and Vibration 7 of the sensitive feature training and testing subsets. e description of TSFRS is summarized as the following steps.
Step 1. In the training samples, there are M types of REB faults, N vibration signal samples in each type of REB fault pattern, and K types of statistical features. Via the processing of the vibration signals, original feature sets [FS 1 , FS 2 , . . . , FS K ] can be obtained, where FS k can be expressed by where S k ij is the k-th statistical feature of the j-th sample in the i-th type of REB fault. Next, FS k can be evaluated to obtain the corresponding feature WV using the ReliefF algorithm, and the WV of each statistical feature can be used to evaluate the distinguishability of the feature. e higher the WV, the greater the discriminative degree of the feature class.
e MD of feature samples of a type of statistical feature in each type of REB condition is calculated, i.e., the MD of the elements of row of FS k . erefore, an MD set, where Next, SMD(k) can be obtained, that is, the sum of the MD of feature samples of the k-th statistical feature for all cases of REB conditions; SMD(k) can be expressed by In this paper, it is presumed that the MD can be used to express the cohesion of data. us, there is a mean deviation sequence SMD � SMD(1), SMD(2), . . . , SMD(K) { }, which becomes another evaluation index for sensitive feature selection. e lower the value of SMD(k), the greater the class cohesion of the feature.
Step 3. A new sequence WSD � WSD(1), WSD(2), . . . , { WSD(K)} is obtained, where the definition of WSD(k) is as follows: In this paper, it is presumed that the greater the value of WSD(k), the better the fault sensitivity of the corresponding feature elements. erefore, the sorted ratio sequence of WV and SMD can be obtained by sorting the WSD in the descending mode.
Step 4. For the labeled training data under one working condition and unlabeled testing data under another working condition, based on the training data, the sorted sequence WSD in the descending mode is acquired and is used to select the most sensitive statistical features that can construct a sensitive feature set (SFS). e most sensitive statistical features will be directly applied to the extraction of features for testing data.
us, two sensitive feature sets can be obtained; the first is the SFS of training data, called SFS train , and the other is the SFS of testing data, called SFS test . Furthermore, SFS train and SFS test are used as the input of TCA, and a new feature set TSFS test , in which the difference in marginal distributions between SFS train and TSFS test is minimized, can be generated.

Local Maximum Margin Criterion (LMMC).
Although the MMC can avoid the SSS problem of LDA, it may be invalid for nonlinear datasets due to the lack of consideration of the local structure of the dataset. LFDA considers the neighbor relationships between samples of the same class while ignoring those between samples of different classes. Aiming at this problem and inspired by the attributes of the MMC and LFDA, this paper proposes a novel feature reduction method, LMMC, which is an improved MMC. e LMMC naturally inherits the merits of the MMC and LFDA, and the underlying idea of the solution to the problem mentioned previously is that the optimization objective of LFDA can be integrated into the MMC; in addition, the neighbor relationships between samples of different classes are taken into consideration.
Based on the descriptions of the MMC and LFDA provided in Section 2, the optimization objective of the LMMC can be obtained by combining the optimization objectives of the MMC and LFDA. In addition, the LMMC has an improvement on the local information of the definition of the weight. e LMMC and MMC have the same optimization framework, but S b and S w are, respectively, replaced by S L b and S L w . e objective function can be presented as follows: According to equations (16) and (17), the local S L b and the local S L w are defined as follows: where P Lb and P LW are the weight matrices and D Lb and D LW are the diagonal matrices. In P Lb , the means of is that j is the nearest neighbor of i, and they belong to different classes. According to equations (30)-(37), the local structure of the dataset, including the neighbor relationships between samples of the same class and the neighbor relationships between samples of different classes, can be considered into the dimensionality reduction by changing the WV.
Let W ∈ R M×L be a linear transformation that transforms the high-dimensional dataset from R M to R D , where D ≤ M.
us, in the lower-dimensional space, the scatter matrices, respectively, become It is assumed that W is constituted by the unit vectors, that is, W � [w 1 , . . . , w L ] and w T k w k � 1, k � 1, 2, . . . , L. us, W can be obtained by solving the following constrained optimization: e above constrained optimization can be transformed to the eigenvalue problem: us, According to equation (41), W is composed of the eigenvectors of (S L b − S L w ) corresponding to the first L largest nonnegative eigenvalues. Finally, with the utility of the LMMC, the low-dimensional feature matrices of the training and testing datasets can be obtained with more sensitive and less redundant information for REB fault diagnosis.

System Framework.
e implementation of the proposed fault diagnosis framework is presented in Figure 1, in which the statistical analysis and artificial intelligence approaches are systematically blended to detect and diagnose REB faults under different working conditions. e entire fault diagnosis procedure is divided into four steps, namely, signal processing, feature extraction, feature reduction, and fault pattern recognition.
In the signal processing step, vibration signals collected from REBs under different working conditions are decomposed into different wavelet packet nodes by DTCWPT.
e single-branch reconstruction signals of terminal nodes will be employed to extract statistical features. In the feature extraction step, with the utilization of the proposed TSFRS, the most sensitive statistical features can be selected based on the training dataset to construct a sensitive feature subset for the training classifier, and these most sensitive statistical features will be directly applied to the extraction of features for the testing dataset. e sensitive feature subsets of the training and testing datasets are, respectively, used as the source domain data and the target domain data. TCA is used to reduce the difference in marginal distributions of the source domain data and the target domain data. For the feature reduction, the low-dimensional training feature space is acquired by the proposed LMMC, which generates a projection that can be directly used for the dimensionality reduction of the testing feature dataset; thus, the low-dimensional testing feature dataset can be obtained. e WSD and projection matrix W are obtained by processing the training dataset and can be directly used for the testing dataset. In the final step, the low-dimensional training feature dataset is employed as the input of the fault type to train the classifier. In this paper, support vector machine (SVM) is used as the fault pattern recognition classifier. e trained classifier will be employed to conduct fault pattern recognition using the low-dimensional testing feature dataset. Finally, the procedure of the proposed method outputs the fault identification and classification accuracy.

Experimental Setup and Cases.
e REB vibration data from Case Western Reserve University (CWRU) [78], which reproduces several fault scenarios, were used to verify the effectiveness of the proposed methods. e experimental test rig is presented in Figure 2; the test rig was composed of an electric motor (left), a torque transducer/encoder (center), a dynamometer (right), and control circuitry (not shown). An SKF6205-2RS deep-groove REB was used in the test rig, and electro-discharge machining was employed to set singlepoint defects with different fault diameters, namely, 0.007, 0.014, 0.021, and 0.028 inches. e collected vibration signals of the REBs consisted of inner race fault signals, ball fault signals, outer race fault signals, and normal signals. e test rig supported a motor load of 0-3 horsepower (hp), and the corresponding motor speeds were 1730 to 1797 rpm. ree accelerometers were, respectively, placed at the 12 o'clock position. e sampling frequency was 12 kHz for the driveend and fan-end bearings.    group dataset includes two cases (cases 3 and 4), the samples of 3 hp are used as training samples. In case 3, the samples of 3 hp are used as testing samples. In case 4, the samples of 2 hp are used as testing samples. e detailed information of the two group experimental datasets is shown in Tables 2 and 3, respectively.

Analysis
Results. According to the diagnosis framework shown in Figure 1, each sample is decomposed into different wavelet packet nodes by DTCWPT, and the decomposition level is 4. us, 16 terminal nodes, namely, subband signals, can be obtained. en, 16 single-branch reconstruction signals of terminal nodes can be obtained, and 16 single-branch reconstruction signals are selected to generate 16 Hilbert envelope spectra (HES). By using the 6 statistical parameters shown in Table 4, each single-branch reconstruction signal can generate 6 statistical features by calculating 6 statistical parameters, and each HES can generate 6 statistical features by calculating 6 statistical parameters.
us, 16 single-branch reconstruction signals and 16 HES can generate 192 statistical features which compose the original feature set (OFS). en, the TSFRS is performed to select the sensitive statistical    Range Mean value Kurtosis Energy Energy entropy        feature reduction method LMMC is further performed to obtain a low-dimensional feature set which is used as the input of the SVM. In order to verify the effectiveness of the proposed TSFRS and LMMC, two group comparative experiments are performed. In addition, WPT is also used in experiments, and the results of experiments using WPT are compared with those of DTCWPT, which can help to verify the superiority of DTCWPT. In this paper, the training dataset is employed to train the fault diagnosis model, the testing   dataset is employed to test the fault diagnosis model, and the accuracy results presented in a series of tables and figures are the average diagnostic accuracy of 12 bearing conditions. us, we use the average diagnostic accuracy results for experimental analysis, and the detailed experimental analysis is described as follows.
In the first group of experiments, the TSFRS is not applied. e

Shock and Vibration 15
better than that of the model using WPT; for the cases 3 and 4, the performance of OFS-SVM, OFS-LFDA-SVM, and OFS-LMMC-SVM models using DTCWPT is better than that of these models using WPT. For the OFS-PCA-SVM, OFS-LDA-SVM, and OFS-MMC-SVM models using DTCWPT, the diagnosis results of case 4 are better than that of models using WPT. In general, the DTCWPT has more advantages than WPT. e detailed experimental results of all models using DTCWPT are presented below. For the testing set of case 1, all models can obtain preferable diagnosis accuracy. e maximum accuracy of each model can attain over 98%, and the highest accuracy can attain 100%, which is obtained by OFS-LMMC-SVM. For the testing set of case 2, the working condition is different from the training set. e diagnosis accuracy of OFS-SVM can only attain 83.33%, compared with OFS-SVM, and all models have enhancement in diagnosis accuracy. But, the performance of OFS-LMMC-SVM is better than that of other models, and the highest accuracy of OFS-LMMC-SVM can attain 93.75% when the dimension size is 11. For the testing set of case 3, the maximum accuracy of each model can attain over 96%, and the highest accuracy can attain 100%, which is obtained by OFS-LMMC-SVM. For the testing set of case 4, the diagnosis accuracy of OFS-SVM can only attain 78.54%, and the model using PCA, LDA, MMC, and LMMC has an obvious enhancement in diagnosis accuracy, respectively. e highest diagnosis accuracy can attain 92.08%, which is obtained by OFS-LMMC-SVM. According to the experimental results of the four testing cases under various models, it is evident that the fault diagnosis model using LMMC can achieve preferable diagnosis performance.
In the second group of experiments, the TSFRS is applied before the implementation of feature reduction and fault pattern recognition. OFS-TSFRS-SVM is an SVM-based diagnosis model, in which the most sensitive features can be selected from OFS according to WSD; in addition, the TCA is employed to reduce difference of distribution between the training sensitive feature subset and the testing sensitive feature subset. OFS-TSFRS-PCA/LDA/LFDA/MMC/ LMMC-SVM are also SVM-based diagnosis models that, respectively, use PCA, LDA, LFDA, MMC, and LMMC.
According to Tables 11-16 and Figures 6-19, the detailed experimental results of all models using DTCWPT are presented in the following.
For the testing set of case 1, all models can achieve preferable performance, which is reflected on diagnosis accuracy. e maximum diagnosis accuracy of each models can attain over 98%, and the highest diagnosis accuracy can attain 100%, which is obtained by both OFS-TSFRS-LFDA-SVM and OFS-TSFRS-LMMC-SVM. For the testing set of case 2, as compared with the experimental results of the first group, diagnosis accuracies of all models using TSFRS appear with an enhancement. Among the models mentioned above, the performance accuracies of OFS-TSFRS-LMMC-SVM and OFS-TSFRS-LFDA-SVM can attain over 98%, but the highest diagnosis accuracy of OFS-TSFRS-LMMC-SVM can attain 100%. For the testing set of case 3, the maximum accuracy of each model can attain over 96%, and OFS-LMMC-SVM can achieve 100% fault diagnosis accuracy, which is higher than that of the other models. For the testing set of case 4, as compared with the experimental results of the first group, diagnosis accuracies of all models using TSFRS appear with an enhancement. e performance of OFS-TSFRS-LMMC-SVM is better than that of other models, and the highest diagnosis accuracy can attain 99.79% (when the sfn (selected feature number) is 140).
According to the experimental results of the second group, when a suitable parameter sfn is selected, it can achieve a desirable improvement on the performance of the fault diagnosis model, which can attain a preferable diagnosis accuracy. According to Figures 6-19, it is evident that the fault diagnosis model can attain better diagnosis performance when a suitable sfn is selected. For example, for the testing sets of cases 1, 2, and 3, the diagnosis accuracy of OFS-TSFRS-LMMC-SVM can attain 100% when the sfn is between 35 and 42, and for the testing set of case 4, the diagnosis accuracy of OFS-TSFRS-LMMC-SVM can attain 99.79% when the sfn is between 137 and 146. In summary, the effectiveness of the proposed TSFRS and LMMC can be  demonstrated, and a diagnosis model trained by the proposed diagnosis framework using the training data of a single working condition can achieve a desirable accuracy for a testing set collected from bearings under different working conditions.  Figure 20, erefore, there are 10 bearing conditions which can correspond to 10 patterns for fault diagnosis. e bearing vibration signals are divided into several data segments, and each segment which is used as a sample has 5000 data points. Each bearing condition contains 60 samples, among which      Table 19. For the testing set of cases 1 and 3, the maximum accuracy can attain 99.67% and 99.17%, respectively. When the testing set is from cases 2 and 4, the maximum accuracy can attain 67.67% and 59.50%. When the TSFRS is applied, according to the experimental results of OFS-TSFRS-LMMC-SVM shown in Table 20, it is evident that the diagnosis accuracies appear with an enhancement by the use of TSFRS. OFS-TSFRS-LMMC-SVM can achieve    Figure 25.  Figure 26. According to the experimental results, especially for the testing set of cases 2 and 4, it is evident that the performance of OFS-TSFRS-LMMC-SVM is better than that of other models when a suitable sfn is selected. In summary, the effectiveness of the proposed TSFRS and LMMC can be further demonstrated, and the adaptability of the proposed diagnosis framework using four testing sets collected from different bearings under different working conditions is also verified.

Conclusions
Due to the harsh environment and variability of the working conditions in real industrial scenarios, data-driven fault diagnosis using traditional machine-learning methods has limited the performance of the model under different working conditions, in which the distribution between the training data and testing data is different. To address this problem, this paper proposed a novel intelligent fault diagnosis framework for REBs under different conditions, with systematically blending statistical analysis with artificial intelligence. In this framework, DTCWPT is used to process raw vibration signals and extract statistical features. A new feature extraction method, TSFRS, is used to select the most sensitive features to form a sensitive feature subset and reduce the difference of distribution between the training sensitive feature subset and the testing sensitive feature subset. A modified MMC, namely, the LMMC, is used as a feature dimensionality reduction method. Compared with the other dimensionality reduction methods, the advantages of the proposed LMMC is presented. SVM is used as an automated fault pattern recognition classifier. Finally, experimental datasets collected from two experimental test rigs contain samples of different bearing fault conditions such as ball fault, inner race fault, and outer race fault at different defect diameters. According to the experimental results, the proposed methods for REB fault diagnosis have great potential to be beneficial in industrial applications. For the experimental test rig 1, a set of comparative cases, namely, cases 1, 2, 3, and 4, are employed for experiments. e cases 1 and 2 select samples of the same motor loads of 2 hp as the training sets, and the samples of different motor loads of 2 hp and 3 hp are selected as the testing sets, respectively. e cases 3 and 4 select samples of the same motor loads of 3 hp as the training sets, and the samples of different motor loads of 3 hp and 2 hp are selected as the testing sets, respectively. e experimental results indicate that the fault diagnosis model using the proposed methods can achieve desirable performance, when a suitable parameter sfn is selected, and the maximum diagnosis accuracies of cases 1, 2, 3, and 4 can attain 100%, 100%, 100%, and 99.79%, respectively. e experimental test rig 2 is employed to further verify the adaptability and effectiveness of the diagnosis model using the proposed methods. e experimental results show that the proposed methods can help the diagnosis model to achieve preferable diagnosis accuracy, and at the same time, the desirable adaptability of the diagnosis model using the proposed methods is demonstrated.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments is work was funded by the Special Funds Project for Transforming Scientific and Technological Achievements in