Partial Transfer Learning Method Based on Inter-Class Feature Transfer for Rolling Bearing Fault Diagnosis

Rolling bearing fault diagnosis methods based on transfer learning always assume that the sample classes in the target domain are consistent with those in the source domain during the training phase. However, it is difficult to collect all fault classes in the early stage of mechanical application. The more likely situation is that the training data in the target domain only contain a subset of the entire health state, which will lead to the problem of label imbalance compared with the source domain. The outlier classes in the source domain that do not have corresponding target domain samples for feature alignment will interfere with the feature transfer of other classes. To address this specific challenge, this study introduces an innovative inter-class feature transfer fault diagnosis approach. By leveraging label information, the method distinctively computes the distribution discrepancies among shared classes, thereby circumventing the deleterious influence of outlier classes on the transfer procedure. Empirical evaluations on two rolling bearing datasets, encompassing multiple partial transfer tasks, substantiate that the proposed method surpasses other approaches, offering a novel and efficacious solution for the realm of intelligent bearing fault diagnosis.


Introduction
Rolling bearings are an indispensable component of modern machinery, playing a crucial role in mechanical transmission systems by reducing friction, supporting rotating parts, and ensuring precise operation of equipment.The efficient functioning of rolling bearings is vital to the performance of the entire mechanical system, with their health status directly affecting the stability, reliability, and overall production efficiency of the equipment.When rolling bearings fail, it can lead not only to shutdowns or slowdowns of the production line, impacting the production schedule and economic benefits of the enterprise, but also, due to knock-on effects of the failure, cause damage to other components.In some cases, it may even trigger serious safety accidents.Therefore, ensuring the healthy operation of rolling bearings and timely detection of potential anomalies and faults is crucial for avoiding these risks.
To effectively monitor and diagnose the health status of rolling bearings, researchers and enterprises have invested substantial resources in developing intelligent monitoring technologies.With the rapid development of artificial intelligence technologies, especially breakthroughs in the field of machine learning, new possibilities for intelligent fault diagnosis of rolling bearings have been provided.Machine learning methods identify patterns and regularities in data by training algorithmic models, capable of learning fault characteristics from a vast amount of historical data and predicting future performance changes [1][2][3][4][5].
Widely used machine learning methods include decision trees [6], random forests [7], support vector machines [8], artificial neural networks [9], and deep learning, etc.In particular, deep learning can automatically extract complex features and process high-dimensional and unstructured data, such as raw vibration signals, by constructing deep neural networks.This technology eliminates the need for manual feature design, enabling it to learn useful information directly from raw data, greatly improving the accuracy and efficiency of fault detection.In addition, deep learning models have demonstrated exceptional capabilities in classification problems, accurately distinguishing between normal states and various types of fault conditions, aiding maintenance personnel in rapidly locating faults, and implementing appropriate repair measures.Algorithms frequently used in the field of deep learning include convolutional neural networks (CNN), deep belief networks [10], deep autoencoders [11], and long short-term memory networks [12], among others.Li et al. [13] researched a rolling bearing fault diagnosis algorithm using derivatives and enhanced discrete analytical wavelets.They obtained a morphological spectrum of the fault signal through multi-scale morphological opening operations.The morphological spectrum entropy was calculated from the morphological spectrum curve to describe the morphological characteristics of different signals of rolling bearings.The signal's morphological feature decomposition was carried out by approximate analytical complex wavelet transformation, wavelet packet space rearrangement, and cross-combination of wavelet packet spaces.Chen et al. [14] analyzed the lower boundary of the two-dimensional data input size by considering the relationship between the characteristic vibration frequency and the short-time Fourier transform window size, guiding the determination of the minimum input size.Then, they designed a universal adaptive CNN structure for different datasets.Hu et al. [15] proposed a method combining multi-scale autoencoders with generative adversarial networks to extract deep sensitive features of signals, which were then combined with a classifier for fault diagnosis.Gu et al. [16] proposed a multi-sensor fault diagnosis model based on long short-term memory and discrete wavelet transform.Initially, detailed fault information at different frequencies and time scales was obtained using wavelet transformation; subsequently, a long short-term memory network characterized the long-term dependencies hidden in the fault information time series.This method fully leveraged the advantages of feature extraction based on expert experience and deep network learning, enabling the discovery of complex patterns from a vast amount of data.Therefore, under the context of Industry 4.0, rolling bearing health monitoring and fault diagnosis technologies based on deep learning has become an essential tool for enhancing the level of equipment intelligence, ensuring production safety, and increasing production efficiency.
Despite this, applying these methods to real-world scenarios still encounters significant challenges.Ideally, the training and test sets should follow the same data distribution; however, in actual work environments, variability and noise often cause distribution discrepancies between datasets, leading to domain shift [17], severely compromising diagnostic accuracy.To address this issue, deep transfer learning (DTL) [18], a strategy that allows knowledge learned from other domains to be applied to new diagnostic tasks, is garnering increasing attention in the field of bearing fault diagnosis and has become an effective tool for intelligent diagnosis of bearings under variable conditions.Currently, researchers have developed various DTL models [19], which can roughly be categorized into model-based DTL [20], instance-based DTL [21], mapping-based DTL [22], and adversarialbased DTL [23].Model-based DTL assumes that a portion of the parameters from the model trained using source domain data can be fine-tuned and applied to classification tasks in the target domain.Cao et al. [24] proposed a transfer learning architecture composed of two parts; the first part consists of a pre-trained deep neural network capable of automatically extracting features from inputs, and the second part is a fully connected network that classifies the extracted features using gear fault experimental data.Instancebased DTL involves selecting specific samples from the source domain using a particular weight adjustment strategy, assigning appropriate weights to them, and then integrating these weighted samples into the target domain dataset for model training in the target domain.Chen et al. [25] introduced auxiliary labeled data and target labeled data to form a combined training set and applied heterogeneous distribution weighted random sampling to reorganize the training set, obtaining an optimized combined training set that is "approximately the same distribution" as the test set, achieving accurate classification of bearing faults under different working conditions.Mapping-based DTL projects features from the source and target domains into a shared space to identify and confirm the common latent attributes between the two domains, which then act as a bridge linking different domains.Maximum mean discrepancy (MMD) is a commonly used mapping method.Yang et al. [26] proposed a polynomial kernel-based MMD distance measure to overcome the shortcomings of the widely applied MMD.Adversarial-based DTL encourages the model to recognize transferable features applicable to both the source and target domains through the principle of adversarial learning.Adversarial-based DTL encourages the model to recognize transferable features applicable to both the source and target domains through the principle of adversarial learning.Han et al. [27] proposed a new deep adversarial convolutional neural network.Adding a discriminative classifier forces the feature extractor to extract common features shared by both domains, enhancing the model's generalization ability.
However, despite the notable advantages of deep transfer learning in the field of intelligent fault diagnosis, its application still faces significant constraints.These methods often rely on a core assumption: that the source and target data domains should share the same label set.This is frequently difficult to achieve in practice, especially during the initial operation phase of new equipment or under new working conditions, where there is a lack of sufficient representative monitoring data to depict the various possible health states, making it challenging to construct a complete target label space.Consequently, when the label spaces of the source and target domains are not consistent, models based on the shared label assumption encounter difficulties in adapting to the actual working environment.
To address this practical challenge, a partial transfer learning method for bearing fault diagnosis based on inter-class feature transfer is proposed in this paper, which uses label information from the target domain to perform partial domain adaptation.This enables the model to comprehensively cover and monitor all health states under new working conditions, adapting more effectively to the various needs and challenges of real-world environments.
The organizational structure of this paper is as follows.Section 2 describes the theoretical methods of JS divergence and AM-Softmax.The proposed approach is described in detail in Section 3. In Section 4, two datasets are used to verify the effectiveness of the proposed method.Finally, the conclusion is summarized in Section 5.

Preliminaries 2.1. Jensen-Shannon Divergence
The Jensen-Shannon divergence (JS) is a method for measuring the similarity between two probability distributions and is based on the Kullback-Leibler divergence (KL).However, unlike KL, it is symmetric and always bounded.Given two probability distributions P and Q, the JS is defined as the average of the KL between each distribution and their average M = 1 2 (P + Q): where D KL represents the KL divergence, and the formula for D KL (P ∥ Q) is as follows: Therefore, the specific formula for the JS is as follows: In summary, the JS was introduced to address the limitations of the KL.It holds significant value in measuring the differences between probability distributions, particularly within the field of machine learning.As a loss function, it aids in optimizing models to better match the target distribution.

Additive Margin Softmax
The traditional Softmax loss function is widely used in classification problems; it is essentially a combination of the Softmax function and the cross-entropy loss function.The expression for this loss function can be written as follows: where f i denotes the input of the i-th sample in the last fully connected layer, W j represents the j-th column of the last fully connected layer, e W T y i f i is the predicted score for the i-th sample being the correct class, and ∑ c j=1 e W T y i f i is the predicted score for all classes.The equation below is an expansion of the dot product formula from the above.A key advantage of the Softmax loss function is its ability to effectively differentiate between different classes.However, it performs poorly when dealing with distances between samples within the same class, sometimes even leading to intra-class distances that exceed inter-class distances.This can significantly affect the alignment of marginal distributions during feature transfer.The Additive margin Softmax (AM-Softmax) [28] introduces improvements upon the standard Softmax by normalizing both the weights and input features, incorporating two hyperparameters.The resulting loss function expression is as follows: where s is the scaling factor, and m is the scale factor that determines the degree of interclass clustering.Figure 1 compares the impact of AM-Softmax with standard Softmax on data distribution in feature space.With traditional Softmax, the decision boundary is at p 0 , whereas for AM-Softmax, the boundaries between class 1 and class 2 are at p 1 and p 2 , respectively, with a margin area in between ensuring discrimination between the two classes.Using the AM-Softmax loss can significantly enhance the aggregation of fault signals in feature space, thereby increasing the distributional differences between intra-class samples and inter-class samples.This improvement helps facilitate feature transfer for the same class within different distributions.

Problem Formulation
In the field of transfer learning, one of the core challenges is dealing with the differences in marginal distributions between the source domain and the target domain.The goal is to enable a model trained using a large amount of labeled data in the source domain to generalize more effectively in the target domain.Traditional transfer learning generally assumes that the source domain and the target domain share the same label space, meaning that the classes contained in the datasets are consistent, allowing for corresponding matching between each class in the source domain and the corresponding class in the target domain during training.However, in scenarios of partial transfer learning (PTL), unique outlier classes in the source domain may adversely affect the distribution matching between shared classes.
Figure 2 provides an intuitive example.In Figure 2a, a typical scenario is shown on how to align marginal distributions in traditional transfer learning, where each source domain class can find a corresponding target domain class.However, in the situation shown in Figure 2b, "Class 4" and "Class 5" in the source domain are outlier classes that, if left unchecked, may adversely affect the transfer learning process of other classes, which is undesirable.Under the PTL framework, it is often assumed that the source domain dataset is complete, or even over-complete, meaning that it contains a wide range of classes, a large number of samples, and accurate labels, forming an idealized dataset.In contrast, the label space of the target domain is merely a subset of the source domain's label space.Some research in PTL focuses on enhancing the model's classification ability for certain specific classes in new scenarios, which usually occurs when source domain information is too complex leading to negative transfer effects, and the target domain lacks labels making it impossible to manually select appropriate source domain data.Other studies focus on improving the model's classification ability across all classes, aiming to use an informationally complete source domain and an informationally incomplete target domain to enable the model to accurately classify all classes on the target domain distribution.The problem background focused on in this paper is the latter description.
This research is conducted under the following assumptions: 1.
The health status labels in the source domain are sufficient to cover all possible states in the target domain; 2.
Each class within the source domain has an ample amount of labeled samples; 3.
Each class within the target domain has an ample amount of labeled samples; 4.
There are differences between the source and target domains in terms of data distribution and label space.
Given two datasets i=1 , representing the source domain and target domain datasets, respectively, y s and y t are the labels of the source domain and target domain, respectively, while n s and n t represent the sample sizes of the source domain and target domain.P s (x, y) and P t (x, y) are the data distributions of the source domain and target domain, P s (x, y) ̸ = P t (x, y).C s and C t are the sets of classes for the source domain and target domain, C t ⊆ C s .

Inter-Class Feature Transfer Module
During the training process, since the sample labels of the target domain are obtainable, it becomes very clear which classes are outliers.The inter-class feature transfer module (ICFT) fully utilizes the label information from both domains to mitigate the potential negative effects of outlier classes from the source domain during the transfer process.ICFT utilizes the Jensen-Shannon divergence to describe the degree of similarity between pairs of classes from the two domains.The formula for the ICFT loss is as follows: Here, k refers to the number of classes shared between the source and target domains.n i represents the total number of samples for a specific class (i.e., class i).X s i and X t i represent the sets of samples from class i in the source and target domains, respectively.When calculating the ICFT loss, only the Jensen-Shannon divergence between the same classes in both domains is considered, ignoring the relationships between different classes.
The main purpose of this approach is to prevent confusion between features of different classes during the feature transfer process.In practice, to capture the similarity between two classes more accurately, the sample pairs from both domains are randomly shuffled each time when calculating the ICFT loss.The logical structure of ICFT is as shown in Figure 3.

General Framework
The overall framework of the method is shown in Figure 4.The original rolling bearing vibration signals are first processed through a fast Fourier transform and converted into two-dimensional grayscale images, which serve as the input data for the model.The feature extractor within the model utilizes a CNN network to extract features.Subsequently, the feature extractor derives high-dimensional feature vectors from samples of both the source and target domains, which are then transmitted to the inter-class feature transfer module and the classifier, respectively.When addressing the image classification problem of bearing health conditions, it is important to recognize that the similarity between different health states is generally higher than in other types of image classification tasks.This implies that in the feature space, the distributions of different health state classes are closer to each other.Therefore, to effectively distinguish these states, it is crucial to increase the margin between classes and enhance the compactness within the same class.The loss L AMS is calculated using AM-Softmax.Compared to the traditional cross-entropy loss, AM-Softmax can significantly enhance the clustering effect between classes.The designed inter-class feature transfer module utilizes label information to select samples of shared classes from both the source and target domains, and independently calculates the Jensen-Shannon divergence between the two domain samples for each pair of classes, thereby obtaining the loss L ic .This process eliminates the interference of outlier classes in the source domain, helping the model to more accurately align the relevant class features between the source and target domains.

Loss Function and Optimization
Combining Equations ( 5) and ( 6), the overall loss L can be derived.
To minimize the loss function (7) and update the parameters in the network model, a backpropagation algorithm combined with the Adam optimizer is employed here.Let θ f and θ c represent the parameters of the feature extractor and classifier, respectively; the parameter update process is described as follows: where θ represents the learning rate, which is set to a default value of 10 −3 .

Dataset Description
Experiments were conducted on two datasets to verify the proposed method's effectiveness.Dataset I: The dataset from reference [29] is used.Collected from the bearing test bench shown in Figure 5.The test bench mainly consists of a drive motor, coupling, bearings, load cell, and accelerometer.The bearing model used in the experiment is 6205-2RS (Svenska Kullager-Fabriken Co. Ltd., Gothenburg, Sweden).Different types of fault data were obtained by replacing bearings with various faults.The load size can be changed by adjusting the nut and measured by the load cell.The motor controls the rotation speed, which can be adjusted within 800 to 1200 revolutions per minute.To mimic partial failure conditions of the bearing, cracks were cut on the rollers, inner rings, and outer rings of the bearing using wire cutting technology, as shown in Figure 8, with crack widths of 0.2 mm, 0.4 mm, and 0.6 mm.In collecting experimental data, the rotation speed was set at 896.1 revolutions per minute, and the sampling frequency was 10.6 kHz.By adjusting the nut, loads were set at 1 kN, 2 kN, and 3 kN, respectively.The experimental design included 10 different sample classes: normal condition (H), inner ring fault (IF), ball fault (BF), and outer ring fault (OF), with each fault type further divided into three levels based on crack width: 0.2 mm, 0.4 mm, and 0.6 mm.The detailed information on the fault classes and labels of the two datasets is shown in Table 1.In this study, different domains are divided according to different load conditions.Dataset I covers six load domains, with 12 transfer learning tasks conducted between them for experimental research.Dataset II covers three load domains, with six transfer learning tasks conducted between them for experimental research.In the experiments, the source domain dataset contains all 10 sample classes, with 200 training samples provided for each class.However, the target domain dataset is limited to accessing only five classes of samples, specifically classes 0, 3, 7, and 9, with each class also having 200 training samples.During the testing phase, the remaining 200 samples from all 10 classes in the target domain are used for test evaluation.For network training, this study chose the Adam algorithm to update the network parameters.To ensure a fair comparison, all comparative experiments utilized the same CNN network as the backbone network and set a unified random seed to ensure consistency in the initialization parameters of the network, thereby reducing random errors in the experimental results.Each transfer task was independently repeated 10 times, and the final results were taken as the average of these 10 experiments to ensure the stability and reliability of the results.Detailed information on all transfer tasks on both datasets has been listed in Table 2.

Results and Analysis
To verify the effectiveness and superiority of the designed methods, several classic methods were selected for comparison, as shown in Table 3. M1 corresponds to a conventional supervised learning model that uses only source domain data and applies the standard cross-entropy loss function.M2 includes target domain samples in the training set while continuing to use the cross-entropy loss.M3, building on M2, adds domain adaptation loss by incorporating MMD to match the feature distributions of the two domains.M4 is a traditional adversarial-based transfer learning method.M5 and M6 are classical methods for partial domain adaptation.M7 is an advanced method for domain adaptation, and M8 eliminates the target domain labels based on M7.A1 replaces the loss function with AM-Softmax based on M2.ICFT integrates all the strategies proposed in this paper.Tables 4 and 5 compare the performance of different methods on two datasets.It can be seen that the proposed method has achieved leading results in most transfer tasks and has reached the highest average performance of 84.55% and 96.23% across all tasks on both datasets.The A1 algorithm, which only adopts the AM-Softmax loss function without using the inter-class feature transfer module, has an average performance of 83.00% and 94.70%, respectively, thereby confirming the importance of the inter-class feature transfer module in enhancing the overall model performance.The M2 method, which can access target domain data, outperforms the M1 method, which can only use source domain data, by 11.25% and 5.1% in performance, highlighting the necessity of addressing the domain shift problem.However, compared to M2 without feature transfer, M3 and M4 do not show significant performance improvements or even decreases.Especially in Dataset I, M4's performance decreases by 20.42% compared to M2.This may be due to the adverse effects of outlier class samples from the source domain during the transfer process.M8 is a newer and more advanced domain adaptation method than M3 and M4, yet it lags behind M3 by 5.18% in performance in Dataset I.The training process of M7 does not involve target domain labels and also suffers a significant performance loss compared to M8. Notably, although M5 and M6 are deep transfer learning methods optimized for label imbalance problems, they perform the worst among all methods.The reason for this is that the partial transfer issues focused on by these two methods differ from those defined in this paper.They focused solely on the model's ability to classify within the target domain's label space, overlooking its capability to classify outlier classes in the source domain.The strategy of M5 and M6 is to identify outlier classes during training and assign them lower weights to reduce their impact on the transfer process.However, when the input data include labels, the model quickly identifies outlier class samples and assigns them extremely low weights, causing the model to fail to learn the features of these classes.The information about outlier classes learned from the source domain in the early training stage suffers from catastrophic forgetting and is quickly overridden by information related to other shared classes.This contradicts the problem studied in this paper, where precise classification of even outlier classes in the target domain distribution is equally important.Figure 9 visually demonstrates the performance of M5 and M6 models using T-SNE technology and presents their confusion matrices.It can be observed that since classes 0, 3, 7, and 9 in the target domain are labeled during the training phase, their classification performance during the test phase is also quite good, as expected.However, the performance of the two models in classifying other classes is less than satisfactory.Through the T-SNE images, it can be seen that there are varying degrees of overlap and confusion among the features of these classes.Despite having a sufficient number of source domain samples for these classes during training, the poor classification performance seems to indicate that they were not adequately trained.

Conclusions
This paper focuses on addressing the challenge of rolling bearing diagnosis under different working conditions in real-world scenarios, starting from practical needs and deeply analyzing the shortage of data in real scenes, which reveals the limitations of current bearing fault diagnosis methods based on transfer learning.this paper focuses on exploring the scenario of label imbalance fault diagnosis, involving common challenges such as domain shift and label imbalance in bearing fault diagnosis, which collectively reflect the complexity of practical application scenarios.Faced with these challenges, an inter-class feature transfer fault diagnosis method is proposed in this paper, which aims to achieve efficient fault diagnosis performance across the entire label space by using a source domain dataset and a target domain dataset facing the label imbalance problem.Compared with a variety of methods on two datasets, ICFT both achieved the highest performance (84.55%, 96.23%).At the same time, the poor performance of IWAN and PADA in this paper (64.52%, 54% and 71.91%, 50.51%) reveals another partial transfer learning method that does not focus on outlier class information, which is different from the issue studied in this paper.The experimental results confirm the effectiveness and superiority of the proposed method, which provides a new effective solution for the field of rolling bearing intelligent fault diagnosis.
Ltd.The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figure 1 .
Figure 1.The difference between AM-Softmax and traditional Softmax lies in the introduction of a margin by AM-Softmax, which enhances the separability between classes and makes the samples within the same class more compact.

Figure 2 .
Figure 2. Negative impact of the outlier class.(a) Traditional transfer learning.(b) Partial transfer learning.

Figure 4 .
Figure 4.The framework of the proposed method.

Figure 5 .
Figure 5. Rolling bearing test bench.The bearing model used in the test rig is N205EU.Cracks were cut on the bearing rollers, inner rings, and outer rings using wire-cutting technology to simulate partial failures of the bearing, as shown in Figure 6.The experimental design included 10 different sample classes, normal condition (H), inner ring fault (IF), ball fault (BF), and outer ring fault (OF), with each fault type further divided into three levels based on crack width:

Figure 6 .
Figure 6.Three bearing failure types.Dataset II: Collected from the rolling bearing test bench made in the laboratory, as shown in Figure7.

Figure 9 .
Figure 9. Feature distribution and confusion matrix of PADA and IWAN.

Table 1 .
Fault classes and labels.

Table 2 .
Detailed information on each domain as the source and target domains.

Table 3 .
The method involved in the experiment.

Table 4 .
Fault diagnosis accuracy on Dataset I (%), Bold means best performance.