Applications of Unsupervised Deep Transfer Learning to Intelligent Fault Diagnosis: A Survey and Comparative Study

Recent progress on intelligent fault diagnosis (IFD) has greatly depended on deep representation learning and plenty of labeled data. However, machines often operate with various working conditions or the target task has different distributions with the collected data used for training (the domain shift problem). Besides, the newly collected test data in the target domain are usually unlabeled, leading to unsupervised deep transfer learning based (UDTL-based) IFD problem. Although it has achieved huge development, a standard and open source code framework as well as a comparative study for UDTL-based IFD are not yet established. In this paper, we construct a new taxonomy and perform a comprehensive review of UDTL-based IFD according to different tasks. Comparative analysis of some typical methods and datasets reveals some open and essential issues in UDTL-based IFD which are rarely studied, including transferability of features, influence of backbones, negative transfer, physical priors, etc. To emphasize the importance and reproducibility of UDTL-based IFD, the whole test framework will be released to the research community to facilitate future research. In summary, the released framework and comparative study can serve as an extended interface and basic results to carry out new studies on UDTL-based IFD. The code framework is available at \url{https://github.com/ZhaoZhibin/UDTL}.

Abstract-Recent progress on intelligent fault diagnosis (IFD) has greatly depended on deep representation learning and plenty of labeled data.However, machines often operate with various working conditions or the target task has different distributions with the collected data used for training (the domain shift problem).Besides, the newly collected test data in the target domain are usually unlabeled, leading to unsupervised deep transfer learning based (UDTL-based) IFD problem.Although it has achieved huge development, a standard and open source code framework as well as a comparative study for UDTL-based IFD are not yet established.In this paper, we construct a new taxonomy and perform a comprehensive review of UDTL-based IFD according to different tasks.Comparative analysis of some typical methods and datasets reveals some open and essential issues in UDTL-based IFD which are rarely studied, including transferability of features, influence of backbones, negative transfer, physical priors, etc.To emphasize the importance and reproducibility of UDTL-based IFD, the whole test framework will be released to the research community to facilitate future research.In summary, the released framework and comparative study can serve as an extended interface and basic results to carry out new studies on UDTL-based IFD.The code framework is available at https://github.com/ZhaoZhibin/UDTL. Index Terms-Intelligent fault diagnosis; Unsupervised deep transfer learning; Taxonomy and survey; Comparative study; Reproducibility

I. INTRODUCTION
W ITH the rapid development of industrial big data and Internet of Things, Prognostic and Health Management (PHM) for industrial equipments, such as aero-engine, helicopter and high-speed train, is becoming increasingly popular, bringing out many intelligent maintenance systems.Intelligent fault diagnosis (IFD) is becoming an essential branch among PHM systems.IFD based on traditional machine learning methods [1], including random forest [2] and support vector machine [3], has been widely applied in research and industry scenarios.However, these methods often need to extract features manually or to combine with other advanced signal Z. Zhao, Q. Zhang, X. Yu, C. Sun, S. Wang, R. Yan and X. Chen are with the State Key Laboratory for Manufacturing Systems Engineering, Xi'an Jiaotong University, Xi'an 710049, China.E-mail: (zhaozhibin@xjtu.edu.cn;zhangqiyang@stu.xjtu.edu.cn;yxl007@stu.xjtu.edu.cn;ch.sun@xjtu.edu.cn;wangshibin2008@gmail.com; yanruqiang@xjtu.edu.cn;chenxf@mail.xjtu.edu.cn) This work was supported by the Natural Science Foundation of China (No. 52105116) and by the China Postdoctoral Science Foundation (No. 2021M692557 and No. 2021TQ0263).R. Yan is the corresponding author.
processing techniques, such as time frequency analysis [4] and sparse representation [5], [6].While, with the increment of available data, data-driven methods with the representation learning ability are also becoming more and more important.Thus, Deep Learning (DL) [7], which can extract useful features automatically from original signals, gradually becomes a hot research topic for many fields [8]- [11] as well as PHM [12]- [14].Effective DL models, such as Convolutional Neural Network (CNN) [15], Sparse Autoencoder (SAE) [16], etc., for tasks in PHM have been validated successfully in current research, and a benchmark study is also given in [17] for better comparison and development.
Behind the effectiveness of DL-based IFD, there exist two necessary assumptions: 1) samples from the training dataset (source domain) should have the same distribution with that from the test dataset (target domain); 2) plenty of labeled data are available during the training phase.Although the labeled data might be generated by dynamic simulations or fault seeding experiments, the generated data are not strictly consistent with the test data in the real scenario.That is, DL models based on the training dataset only possess a weak generalization ability, when deployed to the test dataset from real applications.In addition, rotating machinery often operates with varying working conditions, such as loads and speeds, which also requires that trained models using the dataset from one working condition can successfully transfer to the test dataset from another working condition.In short, these factors make models trained in the source domain hard to be generalized or transferred to the target domain, directly.
Shared features existing in these two domains due to the intrinsic similarity in different application scenarios or different working conditions allow this domain shift manageable.Hence, to let DL models trained in the source domain be able to be transferred well to the target domain, a new paradigm, called deep transfer learning (DTL) should be introduced into IFD.One of the effective and direct DTL is to fine-tune DL models with a few labeled data in the target domain, and then the fine-tuned model can be used to diagnose the test samples.However, the newly collected data or the data under different working conditions are usually unlabeled and it is sometimes very difficult, or even impossible to label these data.Therefore, in this paper, we investigate the unsupervised version of DTL, called unsupervised deep transfer learningbased (UDTL-based) IFD, which is to make predictions for arXiv:1912.12528v2[eess.SP] 20 Nov 2021 unlabeled data on a target domain given labeled data on a source domain.It is worth mentioning that UDTL is sometimes called unsupervised domain adaptation, and in this paper, we do not make a strict distinction between two concepts.
UDTL is widely used and has achieved tremendous success in the field of computer vision and natural language processing, due to the application value, open source codes, and the baseline accuracy.However, there are few open source codes or the baseline accuracy in the field of UDTL-based IFD, plenty of research has been published for UDTL-based IFD via simply using models that already have been published in other fields.Due to the lack of open source codes, results in these papers are very hard to repeat for further comparisons.This is not beneficial to identify the state-of-the-art methods, and furthermore, it is unfavorable to the advancement of this field on a long view.Hence, it is very important to perform a comparative study, provide a baseline accuracy, and release open source codes of UDTL-based algorithms.For testing UDTL-based algorithms, the unified test framework, parameter settings, and datasets are three important aspects to affect fairness and effectiveness of comparisons.While, due to the inconsistency of these factors, there are a lot of unfair and unsuitable comparisons.It seems that scholars are continuing to combine new techniques, and the proposed algorithms always have better performance than other former algorithms, which comes to the question: Is the improvement beneficial to IFD or just depends on the excessive parameter adjustment?However, the open and essential issues in UDTLbased IFD are rarely studied, such as transferability of the features, influence of backbones, etc.
There are already some good review papers about transfer learning in IFD.Zheng et al. [18] summarized the crossdomain fault diagnosis using the knowledge transfer strategy based on transfer learning and presented some open source datasets, which could be used to verify the performance of diagnosis methods.Yan et al. [19] reviewed recent development of knowledge transfer for rotary machine fault diagnosis via using different transfer learning methods and provided four case studies to compare the performance of different methods.Lei et al. [20] reviewed IFD based on machine learning methods with the emphasis on transfer learning theories, which adopt diagnosis knowledge from one or multiple datasets to other related ones, and also pointed out that transfer learning theories might be the essential way to narrow the gap between experimental verification and real applications.However, all above review papers did not focus on UDTL-based IFD and provide the open source test framework for fair and suitable comparisons.They all payed more attention to label-consistent (also called closed set) UDTL-based IFD, which assumes that the source domain has the same label space with the target domain, but many recent research papers focused on labelinconsistent or multi-domain UDTL, which is closer to the engineering scenarios.Thus, a comprehensive review is still required to cover the advanced development of UDTL-based IFD from the cradle to the bloom and to guide the future development.
In this paper, to fill in this gap, commonly used UDTL-based settings and algorithms are discussed and a new taxonomy of UDTL-based IFD is constructed.In each separate category, we also give a comprehensive review about recent development of UDTL-based IFD.Some typical methods are integrated into a unified test framework, which is tested on five datasets.This test framework with source codes will be released to the research community to facilitate the research on UDTL-based IFD.With this comparative study and open source codes, the authors try to give a depth discussion (it is worth mentioning that results are just a lower bound of the accuracy) of current algorithms and attempt to find the core that determines the transfer performance.
The main contributions of this paper are summarized as follows: 1) New taxonomy and review: we establish a new taxonomy of UDTL-based IFD according to different tasks of UDTL.The hierarchical order follows the number of source domains, the usage of target data in the training phase, the label consistence of source and target domains, inclusion relationship between label sets of source and target domains, and a transfer methodological level.We also provide the most comprehensive overview of UDTLbased IFD for each type of categories.2) Various datasets and data splitting: We collect most of the publicly available datasets suitable for UDTL-based IFD and provide a detailed discussion about its adaptability.We also discuss the way of data splitting and explain that it is more appropriate to split data into training and test datasets regardless of whether they are in source or target domains.3) Comparative study and further discussion: We evaluate various UDTL-based IFD methods and provide a systematic and comparative analysis from several perspectives to make the future studies more comparable and meaningful.We also discuss the transferability of features, influence of backbones, negative transfer, etc. 4) Open source codes: To emphasize the importance and reproducibility of UDTL-based IFD, we release the whole evaluation code framework that implements all UDTLbased methods discussed in this paper.Meanwhile, this is an extensible framework that retains an extended interface for everyone to combine different algorithms and load their own datasets to carry out new studies.The code framework is available at https://github.com/ZhaoZhibin/UDTL.The rest of this paper is organized as follows: Section II provides background and definition of UDTL-based IFD.Basic concepts, evaluation algorithms and comprehensive review of UDTL-based IFD are introduced in Section III to V.After that, in Section VI to VIII, datasets, evaluation results and further discussions are investigated, followed by the conclusion part in Section IX.

A. The Definition of UDTL
To briefly describe the definition of UDTL, we introduce some basic symbols.It is assumed that labels in the source domain are all available, and the source domain can be defined as follows: where D s represents the source domain, x s i ∈ R d is the i-th sample, X s is the union of all samples, y s i is the i-th label of the i-th sample, Y s is the union of all different labels, and n s means the total number of source samples.Besides, it is assumed that labels in the target domain are unavailable, and thus the target domain can be defined as follows: where D t represents the target domain, x t i ∈ R d is the i-th sample, X t is the union of all samples, and n t means the total number of target samples.
The source and target domains follow the probability distributions P and Q, respectively.We hope to build a model β(•) which can classify unlabeled samples x in the target domain: where ŷ is the prediction.Thus, UDTL is aimed to minimize the target risk ε t (β) using source data supervision [21]: Also, the total loss of UDTL can be written as: where L c is the Softmax cross-entropy loss shown in (6), λ is the trade-off parameter, and L UDTL represents the partial loss to reduce the feature difference between source and target domains.
where C is the number of all possible classes, E denotes the mathematical expectation, and 1 is the indicator function.

B. Taxonomy of UDTL-based IFD
In this section, we present our taxonomy of UDTL-based IFD, as shown in Fig. 1.We categorize UDTL-based IFD into single-domain and multi-domain UDTL according to the number of source domains from a macro perspective.In the following, we give a brief introduction of each category and detailed description is given in the next part.
1) Single-domain UDTL: These can be further categorized into label-consistent (closed set) and label-inconsistent UDTL.As shown in Fig. 2, label-consistent UDTL represents the label sets of source and target domains are consistent.According to Tan et al. [22], label-consistent UDTL can be classified into four categories: network-based, instancedbased, mapping-based, and adversarial-based methods from a methodological level.Additionally, We categorize labelinconsistent UDTL into partial, open set, and universal tasks based on the inclusion relationship between label sets.As shown in Fig. 2, partial UDTL means that the target label set is a subspace of the source label set; open set UDTL means that the target label set contains unknown labels; universal UDTL is a combination of the first two conditions.It is worth mentioning that three tasks can be further divided into the above four methods from a methodological level.
2) Multi-domain UDTL: These can be further categorized into multi-domain adaptation and domain generalization (DG) based on the usage of the target data in the training phase.Multi-domain adaptation means that the unlabeled samples from the target domain participate into the training phase, and DG is the opposite.Besides, these two conditions can also be further categorized into label-consistent and label-inconsistent UDTL.

C. Motivation of UDTL-based IFD
Distributions of training and test samples are often different, due to the influence of working conditions, fault sizes, fault types, etc.Consequently, UDTL-based IFD has been introduced recently to tackle this domain shift problem since there are some shared features in the specific space.Using these shared features, applications of UDTL-based IFD can be mainly classified into four categories: different working conditions, different types of faults, different locations, and different machines.
1) Different working conditions: Due to the influence of speed, load, temperature, etc., working conditions often vary during the monitoring period.Collected signals may contain domain shift, which means that the distribution of data may differ significantly under different working conditions [23].The aim of UDTL-based IFD is that the model trained using signals under one working condition can be transferred to signals under another different working condition.
2) Different types of faults: Label difference between source and target domains may exist since different types of faults would happen on the same component.Therefore, there are three cases in UDTL-based IFD.The first one is that unknown fault types appear in the target domain (open set transfer).The second one is that partial fault types of the source domain appear in the target domain (partial transfer).The third one is that the first two cases occur at the same time (universal transfer).The aim of UDTL-based IFD is that the model trained with some types of faults can be transferred to the target domain with different types of faults.
3) Different locations: Because sensors installed on the same machine are often responsible for monitoring different components, and sensors located near the fault component are more suitable to indicate the fault information.However, key components have different probabilities of failure rates, leading to the situation where signals from different locations have different numbers of labeled data.The aim of UDTLbased IFD is that the model trained with plenty of labeled data from one location can be transferred to the target domain with unlabeled data from other locations.
4) Different machines: Enough labeled fault samples of real machines are difficult to collect due to the test cost and security.Besides, enough labeled data can be generated from dynamic simulations or fault seeding experiments.However, distributions of data from dynamic simulations or fault seeding experiments are different but similar to those from real machines, due to the similar structure and measurement situations.Thus, the aim of UDTL-based IFD is that the model can be transferred to test data gathered from real machines.

D. The structure of backbone
One of the most important parts of UDTL-based IFD is the structure of the backbone, which acts as feature extraction and has a huge impact on the test accuracy.For example, in the field of image classification, different backbones, such as VGG [24], ResNet [25], etc., have different abilities of feature extraction, leading to different classification performance.
However, for UDTL-based IFD, different studies have their own backbones, and it is difficult to determine whose backbone is better.Therefore, direct comparisons with the results listed in other published papers are unfair and unsuitable due to different representative capacities of backbones.In this paper, we try to verify the performance of different UDTL-based IFD methods using the same CNN backbone to ensure a fair comparison.
As shown in Fig. 3, the CNN backbone consists of four one dimension (1D) convolutional layers that come with an 1D Batch Normalization (BN) layer and a ReLU activation function.Besides, the second combination comes with an 1D Max Pooling layer, and the fourth combination comes with an 1D Adaptive Max Pooling layer to realize the adaptation of the input length.The convolutional output is then flattened and passed through a fully-connected (Fc) layer, a ReLU activation function, and a Dropout layer.The detailed parameters are listed in Table I.

III. LABEL-CONSISTENT UDTL
Label-consistent (also called closed set) UDTL-based IFD assumes that the source domain has the same label space with the target domain.In this section, we categorize label-consistent UDTL into network-based, instancedbased, mapping-based, and adversarial-based methods from a methodological level.
A. Network-based UDTL 1) Basic concepts: Network-based DTL means that partial network parameters pre-trained in the source domain are transferred directly to be partial network parameters of the test procedure or network parameters are fine-tuned with a few labeled data in the target domain.The most popular networkbased DTL method is to fine-tune the trained model utilizing a few labeled data in the target domain.However, for UDTLbased IFD, labels in the target domain are unavailable.We use the backbone coming with a bottleneck layer, consisting of a Fc layer (out features=256), a ReLU activation function, a Dropout layer (p = 0.5), and a basic Softmax classifier to construct our basic model (we call it Basis), which is shown in Fig. 4. The trained model is used to test samples in the target domain directly, which means that source and target domains share the same model and parameters.
2) Applications to IFD: Pre-trained deep neural networks using the source data were used in [26]- [37] via frozing their partial parameters, and then part of network parameters were transferred to the target network and other parameters were fine-tuned with a small amount of target data.Pre-trained deep neural networks on ImageNet were used in [38]- [42] and were fine-tuned with limited target data to adapt the domain of engineering applications.Ensemble techniques and multi-channel signals were used in [43], [44] to initialize the target network which was fine-tuned by a few training samples from the target domain.Two-dimensional images, such as grey images [45], time-frequency images [46], and thermal images [47], were used to pre-train the specific-designed networks, which were transferred to the target tasks via fine-tuning.
Qureshi et al. [48] pre-trained nine deep sparse autoencoders on one wind farm, and predictions on another wind farm were taken by fine-tuning the pre-trained networks.Zhong et al. [49] trained a CNN on enough normal samples and then replaced Fc layers with support vector machine as the target model.Han et al. [50] discussed and compared three fine-tuning strategies: only fine-tuning the classifier, finetuning the feature descriptor, and fine-tuning both the feature descriptor and the classifier for diagnosing unseen machine conditions.Xu et al. [51] pre-trained the offline CNN on the source domain and directly transferred them to the shallow layers of the online CNN via fine-tuning the online CNN on the target domain for online IFD.Zhao et al. [52] proposed a multi-scale convolutional transfer learning network pre-trained on the source domain, and then the model was transferred to the other different but similar domains with proper fine-tuning.

B. Instanced-based UDTL
1) Basic concepts: Instanced-based UDTL refers to reweight instances in the source domain to assist the classifier to predict labels or use statistics of instances to help align the target domain, such as TrAdaBoost [53] and adaptive Batch Normalization (AdaBN) [54].In this paper, we use AdaBN to represent one of instanced-based UDTL methods, which does not require labels from the target domain.
BN, which can be used to avoid the issue of the internal covariate shifting, is one of the most important techniques.BN can promote much faster training speed since it makes the input distribution more stable.Detailed descriptions and properties can be referred to [55].It is worth mentioning that BN layers are only updated in the training procedure and the global statistics of training samples are used to normalize test samples during the test procedure.
AdaBN, which is a simple and parameter-free technique for the domain shift problem, was proposed in [54] to enhance the generalization ability.The main idea of AdaBN is that the global statistics of each BN layer are replaced with statistics in the target domain during the test phase.In our AdaBN realization, after training, we provide two updating strategies to fine-tune statistics of BN layers using target data, including updating via each batch and the whole data.In this paper, we update statistics of BN layers via each batch considering the memory limit.
2) Applications to IFD: Xiao et al. [56] used TrAdaBoost to enhance the diagnostic capability of the fault classifier by adjusting the weight factor of each training sample.Zhang et al. [57] and Qian et al. [58] used AdaBN to improve the domain adaptation ability of the model by ensuring that each layer receives data from a similar distribution.

MK-MMD:
To introduce the definition of MK-MMD, we briefly explain the concept of MMD.MMD was first proposed in [60] and was used in transfer learning by many other scholars [66], [67].MMD defined in Reproducing Kernel Hilbert Space (RKHS) is a squared distance between the kernel embedding of marginal distributions P (X s ) and Q(X t ).RKHS is a Hilbert space of functions in which point evaluation is a continuous linear functional, and some examples can be found in [68].The formula of MMD can be written as follows: where H k is RKHS using the kernel k (in general, Gaussian kernel is used as the kernel), and φ(•) is the mapping to RKHS.
Parameter selection of each kernel is crucial to the final performance.To tackle this problem, MK-MMD, which could maximize the two-sample test power and minimize the Type II error jointly, was proposed by Gretton et al [62].For MK-MMD, scholars often use the convex combination of m kernels {k u } to provide effective estimations of the mapping.
where {α u } are weighted parameters of different kernels (in this paper, all α u = 1 m ).Inspired by deep adaptation networks (DAN) proposed in [21], we design an UDTL-based IFD model by adding MK-MMD into the loss function to realize the feature alignment shown in Fig. 5.In addition, the final loss function is defined as follows: where λ MK-MMD is a trade-off parameter and L MK-MMD means the multi-kernel version of MMD.Besides, we simply use the Gaussian kernel and the number of kernels is equal to five.The bandwidth of each kernel is set to be median pairwise distances on training data according to the median heuristic [62].JMMD: MMD and MK-MMD, which are defined to solve the problem P (X s ) = Q(X t ), cannot be used to tackle the domain shift generated by joint distributions (e.g.P (X s , Y s ) = Q(X t , Y t )).Thus, JMMD, proposed in [65], was designed to measure the distance of empirical joint distributions P (X s , Y s ) and Q(X t , Y t ).The formula of JMMD is written as follows [65]: where is the feature mapping in the tensor product Hilbert space, L is the set of higher network layers, |L| is the number of layers, z s l means the activation of the l−th layer generated by the source domain, and z t l means the activation of the l−th layer generated by the target domain.
Inspired by Joint Adaptation Network (JAN) which uses JMMD to align the domain shift [65], we design an UDTLbased IFD method by adding JMMD into the loss function to realize feature alignment shown in Fig. 6.The final loss function is defined as follows: where λ JMMD is a trade-off parameter.Additionally, the parameter setting of JMMD is the same as that in JAN.

CORAL:
The CORAL loss, which aims to align the secondorder statistics of source and target distributions, was first proposed in [69] and was further used in UDTL [59].First of all, following [69] and [59], we give the basic definition of the CORAL loss as: where || • || F is the Frobenius norm and d is the dimension of each sample.C s and C t defined in (13) are covariance matrices.
where 1 represents the column vector whose elements are all equal to one.Inspired by Deep CORAL proposed in [59], we design an UDTL-based IFD method by adding the CORAL loss into the loss function to realize the feature transfer shown in Fig. 7. Also, the final loss function is defined as follows: where λ CORAL is a trade-off parameter.2) Applications to IFD: BDA was used in [70], [71] to adaptively balance the importance of the marginal and conditional distribution discrepancy between feature domains learned by deep neural networks for IFD.The CORAL loss [72], [73] and maximum variance discrepancy (MVD) [74] were used to reduce the distribution discrepancy between different domains.Qian et al. [58], [75] considered the higherorder moments and proposed an HKL divergence to adjust domain distributions for rotating machine fault diagnosis.The distance designed to measure source and target tensor representations was proposed in [76] to align tensor representations into the invariant tensor subspace for bearing fault diagnosis.
Another metric distance, called MMD, was widely used in the field of intelligent diagnosis [77]- [85].Tong et al. [86], [87] reduced marginal and conditional distributions simultaneously across domains based on MMD in the feature space by refining pseudo test labels for bearing fault diagnosis.Wang et al. [88] proposed a conditional MMD based on estimated pseudo labels to shorten the conditional distribution distance for bearing fault diagnosis.The marginal and conditional distributions were aligned simultaneously in multiple layers via minimizing MMD [89], [90].Yang et al. [91] replaced the Gaussian kernel with a polynomial kernel in MMD for better aligning the distribution discrepancy.Cao et al. [92] proposed a pseudo-categorized MMD to narrow the intraclass cross-domain distribution discrepancy.MMD was also combined with other techniques, such as Grassmann manifold [93], locality preserving projection [94], and graph Laplacian regularization [95], [96], to boost the performance of distribution alignment.
MK-MMD was used in [23], [97]- [101] to better transfer the distribution of learned features in the source domain to that in the target domain for IFD.Han et al. [102] and Qian et al. [103] used JDA to align both conditional and marginal distributions simultaneously to construct a more effective and robust feature representation for substantial distribution difference.Wu et al. [104] further used the grey wolf optimization algorithm to learn the parameters of JDA.Based on JMMD, Cao et al. [105] proposed a soft JMMD to reduce both the marginal and conditional distribution discrepancy with the enhancement of auxiliary soft labels.

D. Adversarial-based UDTL
1) Basic concepts: Adversarial-based UDTL refers to an adversarial method using a domain discriminator to reduce the feature distribution discrepancy between source and target domains produced by a feature extractor.In this paper, we use two commonly used methods including domain adversarial neural network (DANN) [106] and conditional domain adversarial network (CDAN) [107] to represent adversarial-based methods and test the corresponding accuracy.
DANN: Similar to MMD and MK-MMD, DANN is defined to solve the problem P (X s ) = Q(X t ).It aims to train a feature extractor, a domain discriminator distinguishing source and target domains, and a class predictor, simultaneously to align source and target distributions.That is, DANN trains the feature extractor to prevent the domain discriminator from distinguishing differences between two domains.Let G f be the feature extractor whose parameters are θ f , G c be the class predictor whose parameters are θ c , and G d be the domain discriminator whose parameters are θ d .After that, the prediction loss and the adversarial loss (the binary crossentropy loss) can be rewritten as follows: To sum up, the total loss of DANN can be defined as: where λ DANN is a trade-off parameter.
During the training procedure, we need to minimize the prediction loss to allow the class predictor to predict true labels as much as possible.Additionally, we also need to maximize the adversarial loss to make the domain discriminator difficult to distinguish differences.Thus, solving the saddle point problem ( θf , θc , θd ) is equivalent to the following minimax optimization problem: Following the statement in [106], we can simply add a special gradient reversal layer (GRL), which changes signs of the gradient from the subsequent level and is parameter-free, to solve the above optimization problem.
We design an UDTL-based IFD model via adding the adversarial idea into the loss function to realize the feature transfer between source and target domains shown in Fig. 8.It can be observed that we use a three-layer Fc binary classifier as our domain discriminator which is the same as [106].The output features of these Fc layers are 1024 (Fc1), 1024 (Fc2), and 2 (Fc3), respectively.The parameter of a Dropout layer is p = 0.5.
CDAN: Although DANN can align the distributions of two domains efficiently, there may still exist some bottlenecks.As stated in [107], DANN cannot capture complex multi-modal structures and it is hard to condition the domain discriminator safely.Based on this statement, Long et al. [107] proposed To briefly introduce the main idea inside CDAN, we first need to define the multilinear map ⊗, which means the outer product of multiple random vectors.If two random vectors x and y are given, the mean mapping x ⊗ y can capture the complex multimodal structures inside the data completely.Besides, the cross-covariance E xy [φ(x) ⊗ φ(y)] can be used to model the joint distribution P (x, y) successfully.Thus, the conditional adversarial loss is defined as follows: and the prediction loss is the same as that in (15).
To relax the influence with uncertain predictions, the entropy criterion H(p) = − C−1 c=0 p c log p c is used to define the uncertainty of predictions by classifiers, where p c is the probability of the predicted result corresponding to the label c.According to the defined entropy-aware weight function shown in (20), those hard-to-transfer samples are re-weighted with lower weights in the modified conditional adversarial loss (21): We design an UDTL-based IFD model via embedding the conditional adversarial idea into the loss function to realize the feature transfer shown in Fig. 9. Also, the final loss function is defined as follows: where λ CDAN is a trade-off parameter.
2) Applications to IFD: In [108]- [115], the feature extractor was pre-trained with the labeled source data and was used to generate target features.After that, features from source and target domains were trained to maximize the domain discriminator loss, leading to distribution alignment for IFD.Classifier discrepancy [116]- [118], which means using separate classifiers for source and target domains, was introduced in UDTL-based IFD via an adversarial training process.Meanwhile, adversarial training was also combined with other metric distances, such as L1 alignment [119], MMD [120], MK-MMD [121], and JMMD [122], to better match the feature distributions between different domains for IFD.Li et al. [123] used two feature extractors and classifiers trained using MMD and domain adversarial training, respectively, and meanwhile ensemble learning was further utilized to obtain the final results.Qin et al. [124] proposed a multiscale transfer voting mechanism (MSTVM) to improve the classical domain adaption models and the verified model was trained by MMD and domain adversarial training.In addition, Qin et al. also proposed the parameter sharing [125] and multiscale [126] ideologies to reduce the complexity of network structures and extract more domain-invariant features.The verified models were trained by domain adversarial training embedded with metric distances, like MMD and CORAL.Wasserstein distance was used in [127]- [130] to guide adversarial training for aligning the discrepancy of distributions for IFD.Yu et al. [131] combined conditional adversarial DA with a center-based discriminative loss to realize both distribution discrepancy and feature discrimination for locomotive fault diagnosis.Li et al. [132] proposed a strategy for bearing fault diagnosis based on minimizing the joint distribution domain-adversarial loss which embedded the pseudo-label information into the adversarial training process.Besides, another strategy using adversarial-based methods contained adopting GAN to generate samples for the target domain [133], [134].

IV. LABEL-INCONSISTENT UDTL
Considering that the label sets of source and target domains are hard to be consistent in real application, it is significant to study the label-inconsistent UDTL.In this paper, three labelinconsistent transfer settings, including partial UDTL, open set UDTL, and universal UDTL, are studied.
A. Partial UDTL 1) Basic concepts: Partial UDTL, which was proposed in [135], is a transfer learning paradigm where the target label set C t is a subspace of the source label set C s , i.e.C t ⊂ C s .
2) Partial adversarial domain adaptation (PADA): One of popular partial UDTL methods named partial adversarial domain adaptation (PADA) was proposed by Cao et.al [136].The model of PADA is similar to DANN and further considers that probabilities of assigning target data to the source-private classes would be small, and label predictions on all target data are average to quantify the contribution of each source class.
After normalizing γ by its maximum value, γ is served as a class-level weight: Via applying this class-level weight to the loss of the class predictor and domain discriminator, contributions of source samples belonging to source-private classes can be reduced.The prediction and adversarial losses are rewritten as follows: where y s i is the truth label of source sample x s i , γy s i is the normalized class weight, and λ PADA is a trade-off parameter.
We design an UDTL-based model via applying the classlevel weight to the loss function to reduce the influence of outlier source classes as shown in Fig. 10.The final loss function is defined as follows: 3) Applications to IFD: In [137], two classification networks were constructed and the class-level weights for the source domain were calculated using the target label predictions of two networks, and then weights were applied to the source classification loss to down-weight the influence of outlier source samples.Li et al. [138] added weight modules to the adversarial transfer network and a weighting learning strategy was constructed to quantify the transferability of source samples.Via filtering out irrelevant source samples, the distribution discrepancy across domains in the shared label space could be reduced.Li et al. [139] proposed a conditional data alignment technique to align distributions of healthy data and a prediction consistency technique to align the distributions of other classes in two domains.In [140], to facilitate the positive transfer of shared classes and reduce the negative transfer of outlier classes, the average domain prediction loss of each source class was used as the class-level weight.To avoid the potential negative effect and preserve the inter-class relationships, Wang et al. [141] proposed to unilaterally align the target domain to the source domain via adding a consistency loss which forces aligned source features to be close to pre-trained source features.Deng et al. [142] constructed sub-domain discriminator for each class to achieve better flexibility.A double layer attention mechanism was proposed to assign different attentions to sub-domain discriminators and different attentions to samples for selecting relevant samples.Yang et al. [143] proposed to learn domainasymmetry factors via training a domain discriminator and source samples were weighted in the distribution adaptation to block irrelevant knowledge.To correctly classify source samples, the feature extractor and classifier are trained using the prediction loss L c in (6).Moreover, the classifier is trained to recognize target samples as an unknown class via training the classifier to output p t C+1 = τ , where τ ranges from 0 to 1.While the feature extractor is trained to deceive the classifier via training the feature extractor to allow p t C+1 higher or lower than τ .In this way, a good boundary between known and unknown target samples can be constructed.A binary cross-entropy loss is used for the adversarial training: We design an UDTL-based model via introducing the C +1 classifier and adding the adversarial idea to the loss function to make a pseudo decision boundary for the unknown class as shown in Fig. 11.The saddle point ( θf , θc ) is solved using the following min-max optimization problem: 3) Applications to IFD: Li et al. [145] proposed a new fault classifier to detect the unknown class, and a convolutional auto-encoder model was further built to recognize the number of new fault types in [146].Zhang et al. [147] proposed an instance-level weighted UDTL method to apply similarities of target samples during feature alignment.To identify target samples with outlier classes, an outlier classifier was trained using target instances with pseudo outlier labels.
C. Universal UDTL 1) Basic concepts: You et al. [148] proposed universal domain adaptation (UDA) which imposed no prior knowledge on label sets.In UDA, for a given source label set and a target label set, they might contain a common label set and hold a private label set, respectively.UDA requires the model to either classify the target sample correctly if it is associated with a label in the common label set, or mark it as "unknown" otherwise.Let C s denotes the source label set, C t denotes the target label set, and C = C s ∩ C t denotes the common label set.C s = C s \C and C t = C t \C represent the source and target private label sets, respectively.
2) Universal adaptation network (UAN): You et al. [148] proposed UAN and designed an instance-level transferability criterion, exploiting the domain similarity and prediction uncertainty.The model of UAN is similar to that of DANN, while the difference is that UAN adds a non-adversarial domain discriminator G d .The non-adversarial domain discriminator G d obtains the domain similarity , where p Cs is the distribution of source data belonging to the label set C s and q Ct is the distribution of target data belonging to label set C t .p C and q C are the distributions of source and target data belonging to C, respectively.Considering that entropy can quantify the prediction uncertainty, they assumed that . The instance-level transferability criterion for source and target samples can be defined as follows: where ω (x s i ) and ω (x t i ) indicate the probability of a source sample x s i and a target sample x t i belonging to the common label set C. The loss of domain discriminator G d in ( 16) is modified to: The loss of non-adversarial domain discriminator G d is: ) where θ d is the parameters of non-adversarial domain discriminator G d .The saddle point ( θf , θc , θd , θd ) can be solved using the following min-max optimization problem: Via training UAN, distributions of source and target data in the shared label set can be maximally aligned and the category gap can be reduced.In the test phase, for a target sample x t i , if its ω(x t i ) is higher than the threshold ω 0 , it is regarded as the unknown class, otherwise it is predicted by its label prediction.
3) Applications to IFD: Zhang et al. [149] proposed a selective UDTL method.Class-wise weights were applied to the source domain and instance-wise weights were applied to the target domain.An outlier identifier was trained to recognize unknown fault modes.Yu et al. [150] proposed a bilateral weighted adversarial network to align feature distributions of shared-class source and target samples, and to disentangle shared-class and outlier-class samples.After model training, the extreme value theory (EVT) model was established on the feature representation of source samples and was further used to detect unknown-class samples in the target domain.

V. MULTI-DOMAIN UDTL
Considering that a single source domain might not be enough for UDTL in real applications, it is also important to consider multi-domain UDTL, which can help learn domaininvariant features.In this paper, two kinds of multi-domain UDTL settings, including multi-domain adaptation (using the target data in the training phase) and domain generalization (not using the target data in the training phase), are studied.
Because there are multiple source domains in multi-domain UDTL, we first need to redefine some basic symbols.Let {D s,n } n sd −1 0 denote the source domains, where n sd denotes the number of source domains.D t denotes the target domain.D s,n means the n-th source domain.x s i,n and y s i,n are the i-th sample and its corresponding label.Besides, d s i,n is the domain label of x s i,n and d t i (d t i = n sd ) is the domain label of x t i .
A. Multi-domain adaptation 1) Basic concepts: The traditional UDTL based on one single source domain cannot make full use of the data from multi-source domains, which might fail to find the private relationship and domain-invariant features.Thus, multi-domain adaptation aims to utilize labeled multi-source domains and unlabeled target domains to dig the relationship and domaininvariant features.
2) Multi-source unsupervised adversarial domain adaptation (MS-UADA): There are mainly two ways to realize multidomain adaptation.One is that features should be domaininvariant [151], that is, the gap between different domains, including source and target domains in the feature space should be as small as possible.The other way is to find a source domain, which is the most similar to the target domain [152], [153].The second way requires a distance to measure the similarity among domains.In this paper, we used the method proposed in [151] called multi-source unsupervised adversarial adaptation (MS-UADA) to realize multi-domain adaptation, and the structure is shown in Fig. 13.The loss of MS-UADA for the domain discriminator G d is defined as follows: To reduce the gap, the features from G f should confuse G d , which means G d cannot realize domain classification.Thus the training processing can be seen as a minimax game, and the total loss is: where λ MS-UADA is the trade-off parameter.The way to optimize L (θ f , θ c , θ d ) is consistent with the DANN.
3) Applications to IFD: Zhu et al. [154] proposed an adversarial learning strategy in multi-domain adaptation to capture the fault feature representation.Rezaeianjouybari et al. [155] proposed a novel multi-source adaptation framework, which could realize the alignment in both feature and task levels.Zhang et al. [156] proposed an adversarial multidomain adaptation according to a classifier alignment method to capture domain-invariant features from multiple source domains.He et al. [157] proposed a method based on Kmeans and space transformation in multi-source domains.Wei et al. [158] proposed a multi-source adaptation framework to learn domain-invariant features on the basis of distributional similarities.Zhang et al. [74] proposed an enhanced transfer

B. Domain generalization (DG)
1) Basic concepts: Domain generalization (DG) is to learn shared knowledge from multiple source domains and generalize knowledge to the target domain which is unseen in the training phase.The biggest difference of DG is that unlabeled samples in the target domain only appear in the test phase.Based on the discussion in [161], the core idea of DG is that learned domain-invariant features should satisfy the following two properties: 1) Features extracted by G f should be discriminative.2) Features extracted from different source domains should be domain-invariant.More detailed information can be referred to [161].
2) Invariant adversarial network (IAN): According to the above description, the performance of DG depends on discriminative and domain-invariant features.Domain-invariant features require diagnosis models to reduce the feature gap among different domains.As described in the previous section, adversarial training can reduce the gap of features among different domains.In this paper, a simple adversarial training method called invariant adversarial network (IAN) [162], [163] based on DANN is used in DG to help G f extract domaininvariant features via aligning the marginal distribution.The structure of IAN is shown in Fig. 13 and the loss of IAN for the domain discriminator G d is defined as follows: G f should confuse G d .Thus the total loss of IAN is a minimax game: where λ IAN is the trade-off parameter.The way to optimize 3) Applications to IFD: Zheng et al. [164] proposed a DG network based on a priori diagnosis knowledge and preprocessing techniques for IFD.Liao et al. [165] proposed a deep semi-supervised DG network to use unlabeled and labeled source data by the Earth-mover distance.Li et al. [162] proposed a DG method in IFD by a combination of data augmentation, adversarial training, and distance-based metric.Yang et al. [166] used a center loss to learn domaininvariant features across various source domains to realize DG.Zhang et al. [167] proposed a conditional adversarial DG method based on a single discriminator for better transfer and low computational complexity.Han et al. [168] proposed a DG based hybrid diagnosis network for deploying to unseen working conditions via the triplet loss and adversarial training.

A. Open source Datasets
Open source datasets are very important for development, comparisons, and evaluation of different algorithms.In this comparative study, we mainly test five datasets to verify the performance of different UDTL methods.The detailed description of five datasets is given as follows: 1) Case Western Reserve University (CWRU) dataset: The CWRU dataset provided by Case Western Reserve University Bearing Data Center [169] is one of the most famous open source datasets in IFD and has been already used by tremendous published papers.Following other papers, this paper also uses the drive end bearing fault data whose sampling frequency is equal to 12 kHz and ten bearing conditions are listed in Table II.In Table II, one normal bearing (NA) and three fault types including inner fault (IF), ball fault (BF) and outer fault (OF) are classified into ten categories (one health state and nine fault states) according to different fault sizes.
Besides, as shown in Table III, CWRU consists of four motor loads corresponding to four operating speeds.For the transfer learning task, this paper considers these working conditions as different tasks including 0, 1, 2, and 3.For example, task 0 −→ 1 means that the source domain with a motor load 0 HP transfers to the target domain with a motor load 1 HP.In total, there are twelve transfer learning tasks.
2) Paderborn University (PU) dataset: The PU dataset acquired from Paderborn University is a bearing dataset [170],   [171] which consists of artificially induced and real damages.
The sampling frequency is equal to 64 kHz.Via changing the rotating speed of the drive system, the radial force onto the test bearing and the load torque on the drive train, the PU dataset consists of four operating conditions as shown in Table IV.
Thirteen bearings with real damages caused by accelerated lifetime tests [170] are used to study transfer learning tasks among different working conditions (twenty experiments were performed on each bearing code, and each experiment sustained four seconds).The categorization information is presented in Table V (the meaning of contents is explained in [170]).In total, there are twelve transfer learning settings.
3) JiangNan University (JNU) dataset: The JNU dataset is a bearing dataset acquired by Jiang Nan University, China.JNU can be downloaded from [172] and scholars can refer to [173] for more detailed information.Four kinds of health conditions, including NA, IF, OF, and BF, were carried out.Vibration signals were sampled under three rotating speeds (600 rpm, 800 rpm, and 1000 rpm) with the sampling frequency 50 kHz.Four rotating speeds set to be 600 rpm, 800 rpm, and 1000 rpm are considered as different tasks denoted as task 0, 1, and 2. In total, there are six transfer learning settings.
4) PHM Data Challenge on 2009 (PHM2009) dataset: The PHM2009 dataset is a generic industrial gearbox dataset provided by the PHM Data Challenge competition [174].The sampling frequency is set to 200 KHz/3.Fourteen experiments (eight for spur gears and six for helical gears) were performed.
In this paper, we utilize the helical gears dataset (six conditions) collected from accelerometers mounted on input shaft retaining plates.PHM2009 contains five rotating speeds and two loads, but only data collected from the former four shaft speeds under a high load are considered.Four rotating speeds set to be 30 Hz, 35 Hz, 40 Hz, and 45 Hz are considered as different tasks denoted as task 0, 1, 2, and 3.In total, there are twelve transfer learning settings.

5) Southeast University (SEU) dataset:
The Southeast University (SEU) dataset is a gearbox dataset provided by Southeast University, China [33], [175].This dataset consists of two sub-datasets, including the bearing and gear datasets, which were both collected from Drivetrain Dynamics Simulator.Eight channels were collected, and we use the data from the channel 2. As shown in Table VI, each sub-dataset consists of five conditions: one health state and four fault states.
Two kinds of working conditions with rotating speed -load configuration set to be 20 Hz -0 V and 30 Hz -2 V are considered as different tasks denoted as task 0 and 1.In total, there are two transfer learning settings.

B. Data preprocessing and splitting
Data preprocessing and splitting are two important aspects in terms of performance of UDTL-based IFD.Although UDTL-based methods often possess automatic feature learning capabilities, some data processing steps can help models achieve better performance, such as Short-time Fourier Transform (STFT) in speech signal classification and the data normalization in image classification.Besides, there often exist some pitfalls in the training phase, especially test leakage.That is, test samples are unheedingly used in the training phase.
1) Input types: There are two kinds of input types tested in this paper, including the time domain input and the frequency domain input.For the former one, signals are used as the input directly and the sample length is 1024 without any overlapping.For the latter one, signals are first transformed into the frequency domain and the sample length is 512 due to the symmetry of spectral coefficients.
2) Normalization: Data normalization is the basic procedure in UDTL-based IFD, which can keep input values into a certain range.In this paper, we use the Z-score normalization.
3) Data splitting: Since this paper does not use the validation set to select the best model, the splitting of the validation set is ignored here.In UDTL-based IFD, data in the target domain are used in the training procedure to realize the domain alignment and are also used as the test sets.In fact, data in these two situations should not overlap, otherwise there would exist test leakage.Therefore, as shown in Fig. 15, we take 80% of total samples as the training set and 20% of total samples as the test set in source and target domains to avoid this test leakage.

VII. COMPARATIVE STUDIES
We will discuss evaluation results, which are shown in Appendix A. To make the accuracy more readable, we use some visualization methods to present the results.

A. Training details
We implement all UDTL-based IFD methods in Pytorch and put them into a unified code framework.Each model is trained for 300 epochs and model training and test processes are alternated during the training procedure.We adapt minibatch Adam optimizer and the batch size is equal to 64.The

B. Label-consistent UDTL
For MK-MMD, JMMD, CORAL, DANN, and CDAN, we train models with source samples in the former 50 epochs to get a so-called pre-trained model, and then transfer learning strategies are activated.For AdaBN, we update the statistics of BN layers via each batch for 3 extra epochs.
1) Evaluation metrics: For simplicity, we use the overall accuracy, which is the number of correctly classified samples divided by the total number of samples in test data, to verify the performance of different models.To avoid the randomness, we perform experiments five times, and mean as well as maximum values of the overall accuracy are used to evaluate the final performance because variance of five experiments is not statistically useful.In this paper, we use mean and maximum accuracy in the last epoch denoted as Last-Mean and Last-Max to represent the test accuracy without any test leakage.Meanwhile, we also list mean and maximum accuracy denoted as Best-Mean and Best-Max in the epoch where models achieve the best performance.
2) Results of datasets: To make comparisons clearer, we summarize the highest average accuracy of different datasets among all methods, and results are shown in Fig. 16.We can observe that CWRU and JNU can achieve the accuracy over 95% and other datasets can only achieve an accuracy of around 60%.It is also worth mentioning that the accuracy is just a lower bound due to the fact that it is very hard to fine-tune every parameter in detail.
3) Results of models: Results of different methods are shown in Fig. 17 to Fig. 21, and Fig. 21 is not set as the radar chart because two transfer tasks are not suitable for this visualization.For all datasets, methods discussed in this paper can improve the accuracy of Basis, except CORAL.For CORAL, it can only improve the accuracy in CWRU with the frequency domain input or in some transfer tasks.For AdaBN, the improvement is much smaller than other methods.
In general, results of JMMD are better than those of MK-MMD, which indicates that the assumption of joint distribution in source and target domains is useful for improving the performance.Results of DANN and CDAN are generally better than those of MK-MMD, which indicates that adversarial training is helpful for aligning the domain shift.
4) Results of input types: Accuracy comparisons of two input types are shown in Fig. 22, and it can be concluded that the time domain input achieves better accuracy in CWRU, JNU, and SEU, while the frequency domain input gets better accuracy in PU and PHM2009.Besides, the accuracy gap between these two input types is relatively large, and we cannot simply infer which one is better due to the influence of backbones.
Thus, for a new dataset, we should test results of different input types instead of just using the more advanced techniques to improve the performance of one input type, because using a different input type might improve the accuracy more efficient than using advanced techniques.
5) Results of accuracy types: As mentioned in Section VII, we use four kinds of accuracy including Best-Mean, Best-Max, Last-Mean, and Last-Max to evaluate the performance.As shown in Fig. 23, the fluctuation of different experiments is sometimes large, especially for those datasets whose overall accuracy is not very high, which indicates that the used algorithms are not very stable and robust.Besides, it seems that the fluctuation of the time domain input is smaller than that of the frequency domain input, and the reason might be that the backbone used in this paper is more suitable for the time domain input.
As shown in Fig. 24, the fluctuation of different experiments is also large, which is dangerous for evaluating the true performance.Since Best uses the test set to choose the best model (it is a kind of test leakage), Last may be more suitable for representing the generalization accuracy.
Thus, on the one hand, the stability and robustness of UDTL-based IFD need more attention instead of just improving the accuracy.On the other hand, as we analyze above, the accuracy of the last epoch (Last) is more suitable for representing the generalization ability of algorithms when the fluctuation between Best and Last is large.

C. Label-inconsistent UDTL
In these methods, the transfer learning strategies are activated from the beginning.For UAN, the trade-off parameter of the loss of non-adversarial domain discriminator is fixed as 1.The value τ of OSBP and the threshold ω 0 of UAN are both set to 0.5 for all tasks.
1) Evaluation metrics: For partial-based transfer learning, the evaluation metrics are the same as that of label-consistent UDTL, including Last-Mean, Last-Max, Best-Mean, and Best-Max.For open set and universal transfer learning, due to the existence of unknown classes, only the overall accuracy is not sufficient for evaluating the model performance.To clearly explain evaluation metrics, several mathematical notations are defined.M S and M U are the number of correctly       2) Dataset settings: CWRU is selected for testing the performance.Following recent works in [150], different classes are randomly selected to form transfer learning tasks to validate the effectiveness of models on different label sets.The fault diagnosis tasks for partial, open set, and universal transfer learning are presented in Table VII respectively.
3) Results of partial UDTL: For simplicity, as shown in Fig. 25, we only list Best-Mean and Last-Mean of PADA with the time domain input due to the similarity between time and frequency domain inputs.We can observe that PADA can achieve good performance on most tasks according to the overall training phase.But for tasks 3-1, 2-3, and 3-2, Last-Mean is obviously lower than Best-Mean, indicating that negative transfer resulting from extra source labels cannot be addressed totally by PADA and there exist the overfitting problem during the training procedure.-. -. -. -. -. -. -. -.
-.     Similar to the results of OSBP, the performance of UAN also degrades on the later stage because of the overfitting problem and wrong feature alignment.In addition, the sharedclass classification accuracy still need be improved.It is still difficult for the model to separate extra source classes and detect unknown classes from the target domain.
6) Results of multi-criterion evaluation metric: Due to the fact that we have five evaluation metrics for open set UDTL and universal UDTL, it is better to have a final score concerning different metrics for a better understanding of the result.Thus, we use Technique for Order Preference by Similarity to an Ideal Solution (TOPSIS), which is a famous method in the multi-criterion evluation metric, as the final score.Meanwhile, TOPSIS was also widely applied to the field of fault diagnosis [178]- [180].In this paper, we use ALL * , UNK, OS, ALL, and H-score to calculate TOPSIS for the multi-criterion evaluation.For the sake of simplicity, the weight of every index is set to 0.25 in TOPSIS.As shown in Fig. 28, we can observe the TOPSIS comparsions of OSBP and UAN for different transfer learning tasks.It is not unexpected that the evaluation using TOPSIS is similar to these metrics in Fig. 26 and Fig. 27.The overall performance also degrades at the later stage due to the overfitting problem and wrong feature alignment.The shared-class classification accuracy has a relatively large space to be promoted.In addition, separating extra source classes and detecting unknown classes in the target domain are still not well solved.

D. Multi-domain UDTL
For multi-domain UDTL, the evaluation metrics are the same as that of label-consistent UDTL, including Last-Mean, Last-Max, Best-mean, and Best-Max.1) Dataset settings: Similarity to label-inconsistent UDTL, CWRU is selected to test the performance of multi-domain UDTL, including MS-UADA and IAN.The types of inputs consist of the time and frequency domain inputs.The fault diagnosis tasks for multi-domain UDTL are listed in Table VIII.For example, 123-0 T means that task 1, 2, and 3 (shown in Table III) are used as multiple source domains; task 0 is used as the target domain; the time domain input is used as the model input.It should be mentioned that we do not use the target data in the training phase when testing the performance of DG.
2) Results of multi-domain adaptation: As shown in Fig. 29 (a) and (b), we can observe that MS-UADA can always improve the accuracy of CWRU compared with Basis which directly transfers the trained model using multiple source domains to the target domain.Performance of the time domain input is slightly better than that of the frequency domain input, but the overall difference is very tiny.3) Results of DG: As shown in Fig. 29 (a) and (b), we can observe that the performance of IAN for CWRU is similar to that of Basis in most tasks.However, for the task 012-3 F, the accuracy of IAN decreases greatly.The main reason might be that IAN only uses multiple sources to find the domain-invariant features, which is not suitable for the unseen target domain.Thus, more complex DG methods should be further designed to dig the discriminative and domain-invariant features.

A. Transferability of features
The reason why DL models embedded transfer learning methods can achieve breakthrough performance in computer vision is that many studies have shown and proved that DL models can learn more transferable features for these tasks than traditional hand-crafted features [181], [182].In spite of the ability to learn general and transferable features, DL models also exist transition from general features to specific features and their transferability drops significantly in the last layers [182].Therefore, fine-tuning DL models or adding various transfer learning strategies into the training process need to be investigated for realizing the valid transfer.
However, for IFD, there is no research about how transferable are features in DL models, and actually, answering this problem is the most important cornerstone in UDTLbased IFD.Since the aim of this paper is to give a comparative accuracy and release a code library, we just assume that the bottleneck layer is the task-specific layer and its output features are restrained with various transfer learning strategies.Thus, it is imperative and vital for scholars to study transferability of features and answer the question about how transferable features are learned.In order to make transferability of features more reasonable, we suggest that scholars might need to visualize neurons to analyze learned features by existing visualization algorithms [183], [184].

B. Influence of backbones and bottleneck
In the field of computer vision, many strong CNN models (also called backbones), such as VGG [24] and ResNet [25] can be extended without caring about the model selection.
Scholars often use the same backbones to test the performance of proposed algorithms and can pay more attention to construct specific algorithms to align source and target domains.
However, backbones of published UDTL-based IFD are often different, which makes results hard to compare directly, and influence of different backbones has never been studied thoroughly.Whereas, backbones of UDTL-based algorithms do have a huge impact on results from comparisons between CWRU with the frequency domain input and "Table II" in [111] (the main difference is the backbone used in this paper and [111]).We can observe that the accuracy related to the task 3 in CWRU with the frequency domain input is much worse than that in "Table II" [111].However, the backbone used in this paper can achieve excellent results with the time domain input and some accuracies are even higher than those in [111].
To make a stronger statement, we also use the wellknown backbone called ResNet18 (we modify the structure of ResNet18 to adapt one dimensional input) to test SEU and PHM2009 datasets for explaining the huge impact of backbones.From comparisons of PHM2009 shown in Fig. 30, ResNet18 can improve the accuracy of each algorithm significantly.Besides, from comparisons of SEU shown in Fig. 31, ResNet18 with the time domain input actually reduces the accuracy, and on the contrary, ResNet18 with the frequency domain input improve the accuracy significantly.In summary, different backbones behave differently with variant datasets and input types.
Therefore, finding a strong and suitable backbone, which can learn more transferable features for IFD, is also very important for UDTL-based methods (sometimes choosing a more effective backbone is even more important than using a more advanced algorithm) We suggest that scholars should first find a strong backbone and then use the same backbone to compare results for avoiding unfair comparisons.
In the top comparsion, we discuss the influence of backbones.However, in our designed structure, the bottleneck layer in the source domain also shares parameters with that in the target domain.Thus, it is necessary to discuss the influence of the bottleneck layer during the transfer learning procedure.For the sake of simplicity, we only use CWRU with two different inputs to test two representative UDTL methods, including -.

+&
We use Type I to represent original models in this paper, Type II to represent models without the bottleneck layer, and Type III to represent models with fixed parameters of backbones (their parameters are pretrained by the source data) when starting transfer learning (only updating parameters of the bottleneck layer during the transfer learning procedure).The comparison results are shown in Fig. 32.We can observe that for the time domain input, it is almost the same with and without the bottleneck layer.Likewise, for the frequency domain input, it is also difficult to judge which one is better.Thus, choosing a suitable network (according to datasets, transfer learning methods, input types, etc.), which can learn more transferable features, is very important for UDTL-based methods.In addition, it is clear that when parameters of backbones are fixed during the transfer learning procedure, the accuracy in the target domain decreases dramatically, which means that backbones trained using the source data cannot be transferred directly to the target domain.

C. Negative transfer
As we discussed in Section IV, there are mainly four kinds of scenarios of UDTL-based IFD, but all experiments with five datasets are about transfer between different working conditions.To state that these scenarios are not always suitable for generating the positive transfer, we use the PU dataset to design another transfer task considering the transfer between different methods of generating damages.Each task consists of three health conditions, and detailed information is listed in Table IX.There are two transfer learning settings in total.
The transfer results are shown in Fig. 33 and Appendix A called PU-Types.We can observe that each method has a negative transfer with the time or frequency domain inputs, and this phenomenon indicates that this constructed task may not be suitable for the transfer learning task.Actually, there are also some published papers designing transfer learning tasks which tackle transferring the gear samples to the bearing samples (it may not be a reliable transfer task) or transferring the experimental data to the real data (if structures of two machines are different, it also may not be a reliable transfer   task).Thus, it is very important to first figure out whether this task is suitable for transfer learning and whether two domains do have shared features.

D. Physical priors
In the field of computer vision and natural language processing, new transfer learning methods often use the existing knowledge or laws to provide a meaningful explanation, such as attention mechanism [185] and multi-modal structures [107].
However, for UDTL-based IFD, many scholars only introduce methods, which have already existed in other fields, to perform IFD tasks and pay less attention to the prior knowledge behind the data (lack of using special phenomena or rules in physical systems).Therefore, we suggest that scholars can learn from core ideas in the field of transfer learning (not just use the existing methods) and introduce prior knowledge of physical systems into the proposed method to construct more targeted and suitable diagnostic models with higher recognition rates in industrial applications.

E. Label-inconsistent transfer
Recently, some scholars have considered the labelinconsistent scenario and proposed some specific methods to allow the model to adapt to this situation (detailed references can be found in the above review).However, as discussed in comparative results of label-inconsistent transfer, selected methods often face the risk of overfitting.That is, although the best average accuracy is acceptable, the last average accuracy often has a big drop.The main reason might be that models cannot focus on shared classes effectively, leading to poor domain alignment.
Hence, more attention should be paid to the labelinconsistent scenario to realize effective extra source classes separation and unknown classes detection from the target domain.A possible solution is to combine other valid open set recognition algorithms for better unknown class detection [150], [186].For example, an EVT model using deep features of source samples, was applied to detect unknown-class samples [150].

F. Multi-domain transfer
Most of published papers are based on a single source domain, but in real applications, the labeled data might be from multiple source domains.These domains often follow different distributions, but shared or related features exist among multiple source domains.A common step is to align the shared features via multi-domain adaptation or DG.However, how to balance contributions of multiple source domains is still not well solved.For example, in comparative analysis, we simply assume that each domain contributes equally to transfer learning.Thus, suitable weights should be carefully designed and added into the process of multi-domain transfer.
Additionally, to make better use of some data in unlabeled source domains, semi-supervised multi-domain learning [165] might also be worth focusing on.To further improve the accuracy, minimizing the gap of conditional distributions might be an effective way to align shared features [163], [167].

G. Other aspects
Although a large amount of data in different conditions can be collected, fault data in some conditions are still scarce.Due to the fact that most machines operate in a normal condition, the class-imbalanced problem often naturally exists in real applications.Thus, imbalanced learning or few shot learning combined with transfer learning methods [187] might also be an important direction for better getting constructed algorithms off the ground.
Federated transfer learning (FTL) [188] provides a safer and more reliable approach for specific industries.At the same time, based on characteristics of transfer learning, FTL participants can own their own feature space without requiring all participants to own or use the same feature data, which makes FTL suitable for more application scenarios.FTL was initially used in IFD [189] and more in-depth research is required.
Uncertainty quantification plays a critical role in assessing the safety of DL models during construction, optimization, and decision making procedures.Bayesian networks [190] and ensemble learning techniques [191] are two widely-used uncertainty quantification methods, and their effectiveness has been verified by different kinds of applications, such as bioinformatics, self-driving car, etc.Thus, uncertainty as an auxiliary term can be used to further correct some inappropriate predictions or results during the transfer learning.For example, the prediction uncertainty is explicitly estimated during training to rectify the pseudo label learning for UDTL of semantic segmentation [192].

IX. CONCLUSION
In this paper, we construct a new taxonomy and perform a comprehensive review of UDTL-based IFD according to different tasks of UDTL.Five publicly available datasets are gathered to perform a comparative analysis of different UDTLbased IFD methods from several perspectives.Based on the systematically comparative study, we conclude that some useful results might be helpful for further research.Firstly, the accuracy of CWRU and JNU is larger than 95%.Secondly, results of different methods indicate that the assumption of joint distributions and adversarial training are two helpful techniques for promoting the accuracy.Thirdly, different input types often behave differently on each dataset, and choosing a suitable input type might also be important to improve the accuracy.Finally, the stability and robustness of UDTL-based IFD need to be taken seriously.To sum up, it might be useful for scholars to think ahead of these results before developing new models.Also, we release the code library at https://github.com/ZhaoZhibin/UDTL and try to give a basic performance of current algorithms to find the core that determines the transfer performance of algorithms to guide future research.

B. Open set UDTL 1 )
Basic concepts: Considering that the label space of target domain is uncertain for UDTL, Saito et al. proposed open set domain adaptation (OSDA) that the target domain could contain samples of classes which were absent in the source domain [144], i.e.C s ⊂ C t .The goal of OSDA is to correctly classify known-class target samples and recognize unknown-class target samples as an additional class.2) Open set back-propagation (OSBP): Saito et al. [144] proposed an adversarial-based UDTL method, named OSBP, which aimed to make a pseudo decision boundary for unknown class.The model of OSBP is composed of a feature extractor G f and a C + 1 classifier G c , where C denotes the number of source classes.The outputs of G c are then input into Softmax to obtain class probabilities.The probability of x being classified into class c is defined as p t c = exp(Gc(G f (x))) C k=0 exp(G k (G f (x))) .1∼ C and C + 1 dimensions indicate the probability of known and unknown classes, respectively.

Fig. 16 .
Fig. 16.The highest average accuracy of different datasets among all methods.

Fig. 17
Fig. 17.The accuracy comparisons of different methods in CWRU.
Fig. 17.The accuracy comparisons of different methods in CWRU.

Fig. 18 .
Fig. 18.The accuracy comparisons of different methods in PU.

Fig. 21 .
Fig. 21.The accuracy comparisons of different methods in SEU.

4 )
Results of open set UDTL: Best-Mean and Last-Mean accuracy of OSBP with the time domain input are shown in Fig.26.From Fig.26(a), it can be seen that OSBP can achieve relatively good performance on most transfer tasks.However as shown in Fig.26 (b), performance obviously degrades on the later stage, especially for UNK, which reveals that the model overfits on the source samples, and thus unknown-)!"'& &*"% -=.

Fig. 22 .
Fig. 22.The accuracy comparisons of two input types with different datasets.(F) means the frequency domain input, and (T) means the time domain input.

Fig. 23 .
Fig. 23.The difference between Max and Mean according to Best average.

Fig. 24 .Fig. 25 .
Fig. 24.The difference between Best average and Last average according to Mean.

Fig. 26 .
Fig. 26.The overall accuracy of OSBP with the time domain input: (a) Best-Mean and (b) Last-Mean.

Fig. 27 .Fig. 28 .
Fig. 27.The overall accuracy of UAN with the time domain input: (a) Best-Mean and (b) Last-Mean.

Fig. 29 .
Fig. 29.The overall accuracy of multi-domain UDTL (F and T mean the time domain and the frequency domain inputs, respectively): (a) Best-Mean, (b) Last-Mean.

Fig. 32 .
Fig. 32.Comparisons of three conditions related to the bottleneck layer.

Fig. 33 .
Fig. 33.The accuracy biases of these five methods corresponding to Basis.(F) means the frequency domain input, and (T) means the time domain input.

TABLE II THE
DESCRIPTION OF CLASS LABELS OF CWRU.

TABLE IV THE
TRANSFER LEARNING TASKS AND OPERATING PARAMETERS OF PU.

TABLE V THE
INFORMATION OF BEARINGS WITH REAL DAMAGES.

TABLE VII THE
FAULT DIAGNOSIS TASKS OF CWRU.

TABLE IX THE
INFORMATION OF BEARINGS WITH ARTIFICIAL DAMAGES.
Best average: the average of Best; Last average: the average of Last; Best bias: the Best bias between Basis and other methods; Last bias: the Last bias between Basis and other methods.* The maximum Mean value of each row is bolded.Best average: the average of Best; Last average: the average of Last; Best bias: the Best bias between Basis and other methods; Last bias: the Last bias between Basis and other methods.* The maximum Mean value of each row is bolded.JNU with the frequency domain input Best average: the average of Best; Last average: the average of Last; Best bias: the Best bias between Basis and other methods; Last bias: the Last bias between Basis and other methods.* The maximum Mean value of each row is bolded.CWRU with the time and frequency domain inputs for PADA Best average: the average of Best; Last average: the average of Last; Best bias: the Best bias between Basis and other methods; Last bias: the Last bias between Basis and other methods.* The maximum Mean value of each row is bolded.CWRU with the time domain input for UAN Best average: the average of Best; Last average: the average of Last; Best bias: the Best bias between Basis and other methods; Last bias: the Last bias between Basis and other methods.CWRU with the frequency domain input for UAN Best average: the average of Best; Last average: the average of Last; Best bias: the Best bias between Basis and other methods; Last bias: the Last bias between Basis and other methods.CWRU with the time and frequency domain inputs for multi-domain transfer Best average: the average of Best; Last average: the average of Last; Best bias: the Best bias between Basis and other methods; Last bias: the Last bias between Basis and other methods.* The maximum Mean value of each row is bolded.
* Best average: the average of Best; Last average: the average of Last; Best bias: the Best bias between Basis and other methods; Last bias: the Last bias between Basis and other methods.*Themaximum Mean value of each row is bolded.CWRU with the frequency domain input* * * * * * *