Semi-supervised learning for industrial fault detection and diagnosis: A systemic review

The automation of Fault Detection and Diagnosis (FDD) is a central task for many industries today. A myriad of methods are in use, although the most recent leading contenders are data-driven approaches and especially Machine Learning (ML) methods. ML algorithms fall into two main categories: supervised and unsupervised methods, depending on whether or not the instances are labeled with the expected outputs. However, a new approach called Semi-Supervised Learning (SSL) has recently emerged that uses a few labeled instances together with other unlabeled instances for the training process. This new approach can significantly improve the accuracy of conventional ML models for industrial environments where labeled data are scarce. SSL has been tested as a promising solution over the past few years for several FDD problems, although there have been no systemic reviews of this sort of approach up until the present review. In this study, an attempt to organize the existing literature on SSL for FDD using the taxonomy of van Engelen & Hoos is reported. The most and the least frequently used SSL algorithms are identified and considered in terms of different fault detection tasks and their most common dataset structure. Moreover, a set of best practices are proposed in the conclusions of this work for implementation under real industrial conditions, so as to avoid some of the most common faults.


Introduction
Nowadays, data is routinely gathered as part of the increasingly digitized industrial processes of Industry 4.0 [1].Industry can therefore benefit when data are used in several ways to assist with the automation or semi-automation of routine industrial processes and tasks, such as online process monitoring [2], fault detection, and diagnosis [3], and condition monitoring [4], among others.
ML algorithms are usually grouped into: supervised learning-when datasets 1 have feature(s) of interest, commonly referred to as class(es) -and unsupervised learning -when datasets have no class feature(s).Clustering is a popular unsupervised learning task, whereas classification and regression are common supervised learning tasks.On the one hand, the main drawback of supervised learning is that data with the right features are needed: there must be sufficient data that must be properly classified -labeled -in accordance with the model learning process and the purpose of the model, e.g., normal and abnormal (faulty) conditions for industrial-process monitoring, or the different types of faults for Fault Detection and Diagnosis (FDD).On the other hand, unsupervised learning yields less accurate models, due to an absence of labeling.Somewhere in between, there is Semi-Supervised Within the context of the FDD industrial problem, a set of articles describing methods capable of detecting and diagnosing faults were selected.Among this sample of articles, there are several industrial applications, as will be discussed in Section 6.4.Most of the articles were focused on the utilization of bearing datasets with faults such as ball, inner race, and outer race faults.Additionally, there were articles that addressed faults related to chemicals, surface defects, and wind turbines, among others.
The recent literature on SSL applied to FDD is reviewed in this paper.To do so, an in-depth search of related articles from 2011 to October 2022 was conducted.Scopus and Google Scholar databases were used for finding the articles related with the topic.The Scopus database was queried for articles containing the words 'fault detection' and 'semi-supervised' or 'semisupervised' in the title, abstract, or keywords.Google Scholar was also queried using similar terms that yielded most of the studies on SSL in the context of FDD (in addition, all the cited articles for each paper were checked, so as not to leave any studies overlooked).
More than 300 articles were obtained in an initial search.Their abstracts were read in a preliminary filtering process.Congress papers, articles that were not about industrial FDD, and articles not in English were omitted, leaving 200 articles, all of which were read in full.In that step, a further 50 articles were discarded, either because they were not on the subject of industrial FDD or because SSL methods were not mentioned.Finally, 137 articles were included in the final study, but more than 137 methods were considered, because various methods were presented in some articles.
The main purpose of this article is to address the increasing adoption of SSL methods in several industrial applications, including FDD.Despite their widespread use, the organization and information regarding these SSL methods are often confused.Furthermore, several instances of misconduct in the application of SSL methods, including their application to FDD problems, have been identified.This misuse of SSL methods has sometimes led to misguided outcomes, which emphasizes the need to rectify their use.
Taking into account the aforementioned motivations, a set of objectives have been established.The first one is to clarify the structure of SSL methods applied to FDD, categorizing each method within a wellestablished taxonomy.Once every method is categorized, a study of the use of every type of SSL method becomes mandatory, in order to assess the relevance of each SSL method type.Furthermore, defining a set of best practices to assist the development of future applications of SSL methods in FDD, as well as other industrial fields, was also a primary objective.Finally, a paramount objective of this article is to identify future trends in the application of SSL methods to FDD, highlighting areas where further research can be conducted to address gaps in the current state-of-the-art.
The remainder of this review is organized as follows: firstly, the main theoretical concepts, namely SSL, FDD, active learning, transfer learning, and safe SSL are described in Section 2; secondly, the reference taxonomy is introduced in Section 3; the articles related to the topic are reviewed, and a proposal for their classification is offered in Sections 4 and 5; the main results and a discussion of the questions that arise are presented in Section 6; and, finally, the conclusions, a set of best practices, and some promising future trends are discussed in Section 7.

Background
In this section, the necessary background for following the paper is presented.In particular, fault detection and diagnosis (FDD) and the different learning approaches relevant to the rest of the study are briefly described: i.e., semi-supervised learning (SSL), active learning, transfer learning, and safe SSL.

Fault detection and diagnosis (FDD)
Nowadays, FDD is one of the cornerstones to ensure the proper operation of complex industrial processes and equipment [8,9].FDD is one of several processes that can be automated or semi-automated in industrial environments [10].Among other examples, condition monitoring, predictive maintenance, and process monitoring may all be mentioned.Automated FDD is focused on detecting when an error or bad-function condition occurs within a process or system; once it occurs, the type of fault and even the specific part of the process that is failing may all be identified using FDD [11].
FDD is a difficult task because: (1) fault conditions data are rare compared with normal conditions data; (2) the normal variability of the industrial processes may be responsible for the observed deviations in the features; (3) the complex interrelations between process inputs can make it difficult to identify the source and type of some faults [12].For all these reasons, the problem has been solved in different ways.
For 20 years, FDD approaches were split into three categories: the model-based approach [13], the knowledge-based approach [14], and the history-based approach [15].A more recent review of FDD techniques for industrial maintenance [16] keeps this division, while changing the name of the third approach to data-driven [12] and focusing on the industrial requirements of Industry 4.0.Besides, recent reviews have been limited to specific processes where FDD plays a major role: chemical processes [9] and machinery failures [17], among others.In comparison with these previous works, the focus of this systemic review is placed on a promising new set of algorithms: the SSL techniques.
The FDD approaches are classified on the basis of data availability, because data structure, characteristics, and size all play major roles in FDD.The use of data collected for diagnosis from the process or system during its operation means that normal operating conditions may be interrupted and can vary when unexpected events occur, such as failure and even normal wear and tear [10].A model can therefore be constructed that associates the behavior observed in some features with normal operating conditions and with one or more failure conditions.
As previously noted, ML algorithms have normally been divided into two main approaches, namely supervised and unsupervised learning, depending on whether or not the instances are labeled with information on the expected outcome.However, output availability is often mentioned as a key factor when choosing the most suitable and accurate ML algorithm.For example, approaches using neural networks often produce very accurate models, but they often require a large amount of labeled data with sufficient fault data examples to achieve that accuracy, a rare case in real industrial conditions, as Jiang et al. [18] previously outlined.
In general, it can be said that a key point in the whole ML process is to gather a proper set of representative examples that are labeled with accurate outputs, which can then be used as inputs for training ML algorithms.Today, data from many processes can easily be captured and stored at little or no cost.However, the process of labeling instances with the expected output is usually laborious, time-consuming, and expensive [19].This industrial limitation is the key issue driving the introduction of SSL for use in FDD tasks.

Semi-supervised learning (SSL)
Zhou [20] defined supervised learning as a situation where there are sufficient accurately labeled instances available to train a model, while alternative situations are globally classified as weakly supervised learning.The author distinguished three main types of weakly supervised learning: incomplete supervision, inexact supervision, and inaccurate supervision.Incomplete supervision refers to a dataset with an insufficient number of accurately labeled instances, though unlabeled instances are available; in this case, active learning or SSL can be used, depending on whether an accurate label can be requested from a human operator.Inexact supervision refers to a dataset with accurately labeled instances, though the labels are coarse-grained, and not as precise as needed.Finally, inaccurate supervision refers to a training dataset with some erroneous labels, which must be taken into account during the learning process, usually by updating those erroneous labels.
A possible solution to the problem of not having enough labeled instances for training, an intermediate approach between supervised and unsupervised algorithms, has been proposed: SSL.The goal of SSL is to try to obtain an accurate model using a limited amount of labeled instances together with unlabeled instances, in order to improve the model that only the labeled instances can obtain.SSL is based on four main assumptions [7,21]: • Smoothness assumption: two instances that are close to each other in the input space have the same label.• Low-density assumption: the decision boundary should pass through a low-density space.• Manifold assumption: the high-dimensional input space is composed of multiple lower-density sub-spaces.Instances of the same lower-density sub-space must have the same label.• Cluster assumption: instances that are located in the same cluster should have the same label.
However, the semi-supervised approach is not often considered in literature reviews on applications of ML-based solutions within industrial environments.For example, Mowbray et al. [22] presented a review of ML approaches in the chemical process industry that included: supervised, unsupervised, and reinforcement learning, although SSL was not considered.
The use of unlabeled data can be done in several ways and at various stages of the process.For example, a common step before the training phase is to perform feature selection or dimensionality reduction of the available data.A common unsupervised method for this purpose is Principal Component Analysis (PCA), 2 which allows direct use of the unlabeled data in that step.
A supervised alternative is Fisher Discriminant Analysis (FDA), 3  which can only use labeled instances.
However, several semi-supervised versions of FDA have been proposed, e.g., Semi-supervised FDA (SFDA) [5], Ensemble Semi-supervised FDA (ESFDA) [23], and others with which both labeled and unlabeled instances can be used at the same time.Similarly, a multitude of algorithms have been developed to include the unlabeled instances in the training step of the model using different approaches.
SSL methods can initially be divided into transductive and inductive methods [7].On the one hand, transductive algorithms usually develop no model in the training phase and the goal is to label the set of unlabeled instances that are already available.Therefore, the transductive approach is not used with new and unseen instances, for example, in an on-line diagnostic system that performs detection and diagnosis as new data are collected.On the other hand, inductive algorithms usually develop a model during the training phase that can be used later on to label unseen data.The goal in this case is to use the unlabeled instances already available during the training stage to improve the model that could have been obtained, had only labeled instances been used for training.It has been addressed in many ways, as can be seen in [7,24].A complete explanation of SSL taxonomy will be detailed in Section 3.
2 PCA aims to transform the feature space by creating a new set of orthogonal variables, known as principal components, which capture the maximum variance in the data and improve the identification of data patterns. 3FDA aims to find a linear combination of features to maximize the separability between classes in a dataset.

Active learning
Active learning uses both labeled and unlabeled instances for the training, but instead of using an ML model to pseudo-label the unlabeled instances as SSL does, it selects the most relevant unlabeled items to be queried for labeling, e.g., by a human expert.
A typical criterion for the selection of unlabeled data is the least certain instances to be labeled, or the ones that provide the largest expected error reduction, among others [25,26].
Several active learning applications have been developed for FDD problems, such as in [27], where bearing faults are predicted using a residual network with active learning, enabling similar results to be achieved with 150 instances compared to 1 200.
Active learning is, nonetheless, often seen as a different, perhaps even contrary, approach to semi-supervised learning.Despite these different approaches towards learning, active learning can be applied together with SSL (e.g.[28][29][30]) in which an expert labels the most uncertain unlabeled instances.

Transfer learning
Transfer Learning [31], also known as Cross Domain, Domain Generalization, or Cross Machine, is a ML technique that is designed to improve models by adding information from another domain.Two different domains are then considered: the domain of the problem that is to be solved, called the target domain, and the domain of a similar problem where information can be obtained, called the source domain.
All instances of the source domain are usually labeled in transfer learning applied to SSL problems.So, if the source and target domains are similar, significant amounts of information may be obtained.
Several transfer learning applications have been developed for FDD due to the limited availability of labeled data in most datasets.Examples of transfer learning applied to FDD include [32], where different modes from the Tennessee Eastman Process dataset [33] were utilized as distinct views for transfer learning, and [34] where transfer learning was employed to enhance the information in chiller FDD problems.

Safe semi-supervised learning (SSL)
One of the problems of some SSL techniques is that they do not guarantee that the performance of the model using both labeled and unlabeled data is better, or at least not worse, than the model trained with labeled data alone [35].This possible collateral effect has meant that SSL is not as popular as other ML techniques in industrial applications.
Bearing the above in mind, the idea behind safe SSL is that it is meant to ensure that a model trained with unlabeled data will not impinge on the classification capabilities of a model trained only with labeled examples.Safe SSL is described in far greater detail in [36].
The number of studies on Safe SSL for FDD is very scarce, though a detailed explanation on one such study, [37], can be found among this manuscript.

Taxonomy of semi-supervised learning
In this section, a taxonomy of SSL is presented for categorizing the literature on FDD, which is applied in the following sections.An upto-date review of SSL algorithms can be found in [7], in which the taxonomy used in this paper is developed to classify those algorithms.
The taxonomy used as a reference for the classification is depicted in Fig. 1.In addition, three new elements have been added to the taxonomy that are related to SSL in several of the articles that were reviewed: active learning, transfer learning, and safe semi-supervised learning.In the literature, active learning and safe SSL have only been related to inductive learning methods, while transfer learning has been applied to inductive and transductive learning.Nonetheless, active Fig. 1.Taxonomy for semi-supervised classification methods.The original taxonomy proposed in [7], in addition to the three additional techniques found in the review.Fig. 2. The different articles under review are sorted by year and classified according to the taxonomy of Engelen & Hoos [7].The inductive methods are further grouped into the three subcategories: wrapper methods, unsupervised pre-processing, and intrinsically semi-supervised.learning, safe SSL, and transfer learning are transversal to SSL methods and can be applied to more than one method or type of method.
Although each semi-supervised category can be divided into further subcategories, as was done by Triguero et al. [24], the more general and well-established taxonomy proposed by Engelen & Hoos [7] was used in this review, as the taxonomy proposed in [24] was partial and referred only to wrapper methods.In Fig. 2, the articles are sorted by year and grouped under to their corresponding categories according to the aforementioned taxonomy.The evolution of articles in terms of category distribution will be further explained in subsequent sections.
The following sections compile and arrange the extensive literature on FDD in which SSL is used.Transductive and inductive methods are divided into different sections to enhance comprehension and organization, respectively, Sections 4 and 5.

Transductive
The initial group of reviewed SSL methods comprises transductive methods.As previously noted, the only label that transductive algorithms can predict is the label of unlabeled instances already available in the training phase; they are always based on graphs.Methods are typically composed of three phases [7]: graph creation, graph weighting, and inference (label propagation).Among the articles under review, the transductive proposals for FDD are summarized below.
Several authors have researched various ways of building a graph: Zhao et al. [38] used Non-negative Sparse Coding (NSC) to construct their graphs; k-Nearest Neighbors (kNN), which is based on smoothness assumption, was used to build the graph in [39]; while Wang et al. [40] used Non-negative Matrix Factorization (NMF).
Some research was focused on imbalanced datasets, and how to deal with this problem.Qian & Li [47] used the Synthetic Minority Oversampling TEchnique (SMOTE).Likewise, Fang et al. [48] used a Generative Adversarial Network (GAN) in a labeled dataset to correct its proportions.Some other research is also of interest.Chen et al. [49] applied, in addition to the transductive graph, a Random Forest (RF) classifier to verify the graph predictions.
Transfer Learning between datasets was used in [50] to improve graph accuracy.Zhao et al. [51] used a transductive graph method, called Semi-supervised Local Kernel Density Estimation (SLKDE), to classify historical off-line data and then to create a supervised classifier to detect and to diagnose on-line faults within wireless sensor networks.
As it can be seen, there are scarce few works related to transductive methods applied to FDD.This can be explained because in industrial applications is usually desired the ability of inductive methods to pseudo-label new instances that are not available on the training stage.This makes inductive methods more versatile and useful.

Inductive
Subsequently, the focus will shift towards exploring the realm of inductive SSL methods.Most of the articles under review used an inductive approach.Inductive algorithms generate a model, which once it has been trained, can be used to predict the label of new instances.Inductive algorithms can be further divided into three subcategories: wrapper methods (Section 5.1), unsupervised pre-processing (Section 5.2), and intrinsically semi-supervised (Section 5.3).

Wrapper methods
Wrapper methods are the first inductive method type to be reviewed.These methods first train a classifier (some wrapper methods train more than one classifier) using only the labeled instances for generating the predictions of unlabeled instances.Then, the classifier is (or the classifiers are) retrained using both the original labeled instances and the new labeled (also called pseudo-labeled) instances for improving the model, in a process that can be performed several times.With regard to the wrapper methods, Engelen & Hoos [7] identified three main groups that are listed below.

Self-training
Self-training is the simplest way to obtain a SSL algorithm from a supervised one.Self-training first trains a model with the labeled instances alone, and then the unlabeled instances are pseudo-labeled with the prediction of the trained model.Pseudo-labeled instances, whose prediction confidence is above a certain threshold, are added to the labeled dataset [52].The training and pseudo-labeling processes are repeated until either there are no unlabeled data or a predefined threshold is reached.The work of Triguero et al. [24] is recommended, for an exhaustive literature review of self-training methods.
Self-training has been applied in various ways, modifying different sections of its implementation.For example, Zheng & Zhao [53] changed the way confidence is calculated for pseudo-labels.The authors used a Temporal-Spatial Confidence Measure (TSCM) to obtain temporal and spatial information from unlabeled and pseudo-labeled data.In [54] , the same authors modified confidence, so that it could be calculated as a distance.
However, the threshold itself was changed in several papers.Zhang et al. [55] determined the threshold using Monte Carlo and Liu et al. [56] used a dual-threshold, in order to avoid overfitting.
Other authors modified the way their pseudo-labels are selected to be added to the labeled dataset.In [2,29] used Active Learning Semi FDA (ALSemiFDA), a special self-training method whose pseudolabeled instances are not fixed, and with which the pseudo-labeled instances could be corrected when mislabeled.Long et al. [57] presented two selection strategies for self-training method, Gradually Exploiting Mechanism (GEM) and Distance-Based Sampling Criterion (DBSC).
Other authors have focused on datasets of a specific type: imbalanced datasets were investigated in [58], the focus was on noisy datasets in [59], while datasets with new faults that were not available in the training phase were examined in [60].

Co-training
Co-training is the adaptation of self-training from the perspective of an ensemble. 4Co-training methods are composed of two or more supervised models trained with the same (single view) or different (multiple views) labeled instances [62].The models then predict the labels of the unlabeled instances that are updated (mainly by voting or by the sum of confidences).As in self-training, the process is repeated until all unlabeled instances have been pseudo-labeled in all views, or a predefined limit has been reached.
An example of single view co-training applied to FDD can be found in [23] where a GAN was used to generate unlabeled data which was then used in the co-training, combining the prediction, by voting, of a conditional Deep Convolutional GAN (cDCGAN) 5 and a Residual Network (ResNet18). 6Another example of the co-training method with two base estimators, Decision Trees (DT) and kNN, is presented in [63].
In Tri-training [64], as its name suggests, three classifiers are trained.It is a multi-view co-training method that has been extensively used with FDD, as described in [30,[65][66][67].
Other authors have proposed co-training methods with multiple (undefined) base estimators.On the one hand, Liu et al. [68] used Extreme Learning Machines (ELM) as a base estimator to detect and to diagnose milling faults.On the other, Huang et al. [69] proposed a co-training method based on fuzzy rules for demagnetization faults.

Boosting
Boosting [70] is an ensemble method that aims to distribute weights among the model predictions, so that greater weight is assigned to those instances that are mislabeled.When boosting is applied, several weak models are trained, which together generate a strong (good) classifier.The different base classifiers are combined based on weights that depend on their accuracy.Nonetheless, boosting has not been widely used in SSL for FDD.
Razavi-Far et al. [71] compared multiple SSL methods for feature extraction and for classification, noting that the ASSEMBLE [72] classifier achieved the best results.In [73] multiple feature extraction methods and SSL classification methods, among which SemiBoost [74], were also compared.

Unsupervised pre-processing
Having reviewed wrappers, the focus now shifts to the exploration of unsupervised pre-processing methods.Unsupervised pre-processing methods use unlabeled instances for different purposes, such as to extract features from the unlabeled data, pre-clustering the data, and setting the initial parameters of a supervised learning model in an unsupervised way.It must be noted that this type of methods performs these actions prior to the training of the final model.Unsupervised pre-processing can be further divided into three subcategories: feature extraction, cluster-then-label, and pre-training.

Feature extraction
Feature extraction is an attempt to transform the available data, either to improve the accuracy of the model, or to make its construction more efficient.These reasons are more important for the approaches that do not rely on neural networks to obtain the trained model and, therefore, a common step is to perform a semi-supervised feature extraction or dimensionality reduction, in an attempt to take advantage of both labeled and unlabeled observations.Different kinds of Auto-Encoders (AE) have been used for dimensionality reduction and for extracting features from high dimensional datasets as in [18,[75][76][77][78].Because EAs can be used for dimensionality reduction and are unsupervised in nature, they provide an easy way to account for unlabeled items.
The different types of FDA variations, as applied in [5,[79][80][81][82][83][84], form another typical approach for semi-supervised feature extraction and dimensionality reduction.Although FDA is a supervised dimensionality reduction algorithm and is therefore based on the output feature(s), there have been several proposals to obtain semi-supervised versions of FDA so that dimensionality reduction can be performed, taking into account both labeled and unlabeled data.
Yet another common approach is to use a manifold-based approach, such as the ones applied in [71,[85][86][87][88][89].Manifold learning methods try to reduce the dimensionality of the data while preserving the nonlinearities present in the data.
Some additional strategies to semi-supervised dimensionality reduction or feature extraction have been tested, such as the Support Vector Machine (SVM), as described in [90,91], and hypergraphs, as described in [92].

Cluster-then-label
A popular practice in SSL is to perform clustering first, either unsupervised or semi-supervised, hoping that the result of the clustering will serve as a guide in the subsequent classification process.
Proposals can be arranged into two main groups that use either any kind of semi-supervised approach or unsupervised clustering.
Semi-supervised clustering is a kind of clustering that at some point uses information (from labeled instances or constraints) that improves its performance.Semi-supervised clustering was used in the following proposals [4,45,[93][94][95][96][97].On the other hand, there are several proposals using an unsupervised (i.e., -Means, density peak clustering, among others) clustering algorithm [98][99][100][101][102].In this case, the unsupervised clustering process is usually a prior step that clusters the data before any semi-supervised labeling process is performed, or the process is used as a pseudo-labeling step performed after a semi-supervised process such as feature extraction or a dimensionality reduction step.
It is worth mentioning that new types of faults, which were not present in the training dataset, may be detected with some of these proposed methods, [95,96,100].

Pre-training
Pre-training uses unlabeled data primarily for approximating weights in deep learning networks, such as Deep Belief Networks (DBN) and stacked autoencoders, before applying supervised learning for fine tuning those weights.
Ding et al. [103] used a modified ResNet for incipient fault detection using vibration signals.The neural network was firstly pre-trained and then a semi-supervised fine tuning was performed.Finally, it was able to classify under either normal or (incipient) fault conditions.
Liao et al. [104] proposed the use of an optimized DBN through the particle swarm approach for avoiding the optimization problems associated with DBN.The DBN must be pretrained in an unsupervised way before fine tuning using labeled data.

Intrinsically semi-supervised
Continuing with the discussion, the focus will now shift to the examination of the last category of inductive methods, known as intrinsically semi-supervised.These methods incorporate unlabeled instances into the objective or optimization function of the learning method.Intrinsically semi-supervised methods can be further divided into: maximum-margin, perturbation-based, manifolds, and generative models.

Maximum-margin
Maximum-margin methods represent an attempt to maximize the distance between the data points and the decision boundary, based on the low-density assumption.As in a supervised learning approach to margin maximization, SVM adaptations are often used.Several adaptations of SVM to SSL have been proposed throughout the past few decades [105], in which semi-supervised SVMs leverage information from unlabeled data to achieve better class separation.
Wang et al. [106] used a tree structure created with the Semi-Supervised Gaussian Mixture Model (SGMM) in combination with a Semi-Supervised SVM (S3VM).Jia et al. [37] proposed a Dynamic Active Safe Semi-Supervised SVM (DAS4VM) with active learning and safe SSL to detect faults using PCA as pre-processing.Mao et al. [107] also applied a S4VM where an online perspective was used to distinguish between faulty and normal states, among others.Other authors applied the low-density assumption to their neural networks, in order to separate data of different classes [108,109].

Perturbation-based
Based on the smoothness assumption, perturbation-based methods try to add noise to the data or apply noise directly to the model, since if the noisy data and the real data are similar they should have the same label.
Neural networks are one of the most commonly used techniques in perturbation-based methods, for example in [110] a Semi-Supervised Deep Ladder Network (SSDLN) with information fusion on a gear failure dataset was tested.Chen et al. [111] also used the Ladder Network, 7 this time applied to photovoltaic systems.
Another type of neural network that has been modified to be an SSL perturbation-based method is Long Short Term Memory (LSTM). 8hang & Qiu [112], for example, applied LSTM to a chemical dataset; while Tang et al. [113] used LSTM with transfer learning to address bearing faults.Some authors used Data Augmentation/Generative methods to improve dataset information and then used the improved dataset to train a perturbation-based neural network [114][115][116].
Other types of modifications have been proposed in various studies, in order to apply perturbation-based methods to the FDD problem.Hu et al. [117] used two networks, one for inter-instance information and another for intra-temporal information, which were then mixed in the final loss function to classify bearing faults.Bearing faults were also taken into account in [118] where a Deep AutoEncoder (DAE) was used to preprocess bearing data, and in [119] where an attention method called squeeze-and-excitation was used.Shim et al. [120] proposed a three-phase framework for wafer semiconductor manufacturing.The framework was special, in so far as different types of ML (supervised, unsupervised, and semi-supervised) methods were applied in each phase, depending on the amount of labeled data available.

Manifolds
Centering on the manifold assumption, manifold methods usually modify the input feature space to calculate distances between data points, which can also be achieved with graphs.Manifolds methods have been widely applied to different FDD problems.
Most of the research on manifold SSL applied to FDD is graph-based.It is common to generate a graph and then calculate classes from a distance type [121][122][123][124][125]. Multiple graphs are used in some manifolds, in order to represent the information [126][127][128].While others used different techniques to add more information to the graph [129].Some research can be grouped by the way the graph that represents both labeled and unlabeled data is generated.The most common graph creation method, kNN, is used in several papers, such as [130][131][132].
Other methods are characterized in different ways.Kernel functions were used to modify input data, so that the data were distributed over the different manifolds, as described in [133] for the production process of fused magnesia and in [3] for the Hot Galvanizing Pickling Waste Liquor Treatment Process (HGPWLTP).Fan & Zhang [134] also approached the HGPWLTP problem, though they used Laplacian Regularization.
In some other studies, the techniques were applied to improve the original dataset, such as active learning, as used in [135].In yet others, transfer learning was used to add information from the source domain to improve the classification of the target domain [136][137][138][139].
Different methods to perform interesting types of pre-processing were used in some other investigations, which consisted of transforming the vibration data into images [19,128,140].
Liu et al. [141] focused on the Riemannian space instead of the Euclidean space that is typically used.Gao et al. [142] created a manifold method combining a Convolutional Neural Network (CNN) and AE to create Pseudo-Label CNN (PLCNN) for FDD.Razavi-Far et al. [73] proposed Semi-Supervised Smooth Alpha Layering (S3AL), performing a comparative study between different SSL algorithms.

Generative models
Generative models are designed to create new instances from labeled data.A discriminative function of the generative model classifies most of the new unlabeled data.Several generative methods were proposed in [7] to deal with FDD SSL, all of which rely on the three subsets: mixture models, generative adversarial networks, and variational AEs.
Mixture models are very useful when the distribution of the data is known and a mixture model is built based on several distributions.Several pieces of research have applied mixture models to FDD [143][144][145][146].
The most commonly used model is nevertheless the semi-supervised GAN.Typical unsupervised GANs are mainly composed of a generator model, which creates new data from training data, and a discriminator that serves to detect which data are real (an original labeled data) and which were created by the generator.In semi-supervised versions of GAN, the discriminator is modified, so that it can predict the data labels [147][148][149][150][151][152][153][154][155][156][157][158][159][160][161].
A Variational AE (VAE) [162] is a latent variable model in which data are treated as they are generated from a vector of latent variables [163][164][165][166].
As has been seen in other SSL method types, several works transform their data, mainly vibration-based, into images, in order to be able to use CNN in the generation process, some examples of which can be found in [148,151,154,155,157,160].
It is also remarkable that some of the methods proposed in these articles are able to detect new types of faults, not seen in the training phase [143,146,152].

Results and discussion
Following the thorough review of papers on SSL for FDD, it was found that the results could not be directly compared, due to differences between the experiments and the implementations.Instead, a journal analysis was performed, to show the top journals in which articles on SSL for FDD have been published (Section 6.1).In addition, the evolution of publications on SSL for FDD has been studied, to observe the patterns over time and the possible future of the research field (Section 6.2).Finally, a popularity comparison was performed, to find out which methods appear to be the most widely used (Section 6.3).

Journals
The papers under review were published in different journals, some of which were more generic while others were more specific.The journals were divided into three different groups: industry, artificial intelligence, and mixed.
As can be seen in Fig. 3, industry theme journals are the most popular (59.85%), which may show how industry issues are more relevant for selecting a journal in which to publish research than the ML methods themselves.But this approach also explains some limitations in the use of SSL in industrial environments; a field where most of the researchers involved in most of the works are closer to the industry than to computer science.
IEEE Transactions on Instrumentation and Measurement is the industry journal with the most publications (6.57%).At the other extreme, artificial intelligence journals hardly pay enough attention to real solutions for industry, restricting themselves to publications on the development of ad-hoc SSL solutions for FDD.Therefore, a major role is expected from hybrid journals, that can build a bridge between these two disciplines in the near future.It is worth mentioning that IEEE Transactions on Industrial Informatics and Chemometrics and IEEE Access are the journals with the most FDD SSL publications (8.76% and 8.03% respectively).

Publications throughout time
At the beginning, from 2011 to 2015, several SSL method types were used for FDD, as it can be seen in Fig. 2. Something that stands out from these years is the absence of wrapper methods, despite the fact that those methods are the easiest transition from supervised learning to SSL.This gap may be due to the late application of SSL to FDD, after the development of other types of methods [21].
From 2015 to 2018, feature extraction methods were the most popular.However, intrinsically semi-supervised methods have been the most widely used since 2018, and mainly methods using the manifold approach and generative type methods.The latter because of their ability to generate new data and their good performance and popularity in other fields [167].Furthermore, transductive and self-training methods have also gained popularity quite recently.
It is worth noting that there are some types of methods, such as manifolds and cluster-then-label, that have been widely used since first introduced.
Overall, focusing on Fig. 4, it is clear that the application of SSL methods to FDD problems has been an emerging topic, since its first application up until today.It is thought that this evolution will be maintained over coming years.

Popularity
As previously noted, inductive methods are by far the most frequently used compared with the transductive ones.Over 90% of the articles reviewed for this study are related with inductive methods.The possible main reason why inductive methods are more popular than transductive ones is because of their capability to predict new data that were not been used when training the model.For this reason, there is no need for retraining with inductive methods whenever new unlabeled data are processed.
As can be seen in Fig. 5, wrappers were the least used (18.18% of inductive methods) inductive methods, closely followed by unsupervised pre-processing (25.62%).Finally, intrinsically semi-supervised methods were the most popular among the inductive methods (56.20%).The reason why intrinsically semi-supervised models are so popular is because they are the most complex and can be adapted to different problems.
Regarding the wrapper methods, self-training and co-training were the most used (10.74% and 6.61% respectively).It was a bit surprising  that self-training was more popular than co-training, because the use of ensembles should bring better results [168].Nonetheless, boosting methods could be applied more often to FDD problems, to obtain better results and to add to the popularity of wrapper methods, as was shown in [71].
Not all unsupervised pre-processing methods have been equally applied to FDD, as few journal publications that used pre-training methods to approach this problem were found (1.65%).The scarce use of pre-training methods is probably due to the low impact that can be achieved compared to training a whole model based on SSL.The other unsupervised pre-processing methods are widely used: cluster-thenlabel methods were mentioned in 10.74% of the articles, meanwhile feature extraction methods, the third most used subgroup, were slightly more popular at 13.22%.The main reason for the popularity of feature extraction methods is their ease of application, in preparation for the use of SSL.
Intrinsically semi-supervised methods are the most widely used, but not all of their different types were applied with the same frequency.On the one hand, the maximum-margin and the perturbation-based methods are the least used (4.13% and 9.09% respectively).On the other hand, generative models and manifolds are widely used.The capability to generate new unlabeled data is the main reason why generative models are so popular (19.83%).Manifold methods that are based on the manifold assumption are the most popular subgroup (23.14%), due to their ability to modify the data space and to classify.
The results on the popularity of different SSL types are not random; instead, they are aligned with the advantages and disadvantages associated with each type.
A significant distinction arises between inductive and transductive methods.As mentioned earlier, over 90% of the articles in this review are based on inductive methods.The primary reason for such a high imbalance proportion is the key advantage of inductive methods over transductive methods -the ability to predict new data that was not seen during the training phase -.
Focusing more deeply on the popularity of the different types of inductive methods, a more comprehensive analysis can be obtained.To do so, a summary of the advantages and disadvantages of these types is presented in Table 1.As mentioned earlier, wrapper methods are the least used inductive methods, primarily due to the fact that most of them are prone to add noise, to misclassify pseudo-labels and to add them to labeled dataset and are unable to correct it.On the other hand, the use of unsupervised pre-processing methods is astonishing due to their lower potential impact when utilizing unlabeled data.Finally, the popularity of intrinsically semi-supervised methods is normal, considering their ease of development from supervised methods and the ability of unlabeled data to directly influence the objective function or the optimization process.
Taking into account the aforementioned popularity and considering the possible reasons (including the different pros and cons of the SSL methods), certain gaps may be identified in the application of SSL to FDD.One potential gap lies in the gain that can be obtained by applying boosting methods on wrappers: which can improve their performance through the use of ensembles and the weight distribution of these models.Another potential gap is the limited application of maximum margin methods within intrinsically semi-supervised methods, particularly the underuse of semi-supervised SVM.A broader use of semi-supervised SVM could be beneficial, given its good performance in applications across various fields [169].
Unsupervised preprocessing methods are not widely used, primarily due to their limited potential impact when utilizing unlabeled data.However, one approach to applying deep learning methods from a SSL perspective is to employ a pre-training method that enables the unlabeled data to influence the pre-configuration of the deep network.
One additional gap, irrespective of the method type, pertains to the application of methods capable of predicting new fault types that have not been observed in labeled data.

Industrial issues
The industrial applications of FDD collected in this review are varied, but most of them are focused on two main topics: chemical Fig. 5. Distribution of inductive SSL FDD publications found in the review, grouped by type of method as per the taxonomy in [7].

Table 1
Pros and cons of inductive methods.

Method Pros Cons
Wrapper methods 1. Easy to implement.1. Prone to add noise.2. Configurable, easy to change base estimator/s.
2. Dependent on supervised methods 3. Can be used with almost any supervised method.
Uns. preprocessing 1. Can be used with almost any supervised method.
2. Less impact of unlabeled data.
1. Unlabeled data is used on the lower level 1.More complex models Harder to train.(objective function or optimization procedure).
2. Most of them require large amounts of data.2. Usually easy to develop from its supervised version.and mechanical processes.The mechanical processes are more common and form part of different fields, from energy generation to machining.They all merge in so far as they serve: mechanical chain-based problems that are conventionally evaluated in terms of vibration analysis (e.g., bearing and gearbox wear and tear).More than 80% of the research has been directed at vibration-based problems , due to the high capability of vibrations to collect useful information on machine performance levels.In these cases, after an ordinary vibration analysis, it is very common to apply Fourier transform or discrete wavelet transform techniques to analyze vibration information.However, some new approaches, passing vibration behaviors from numbers to images use a different solution, such as Continuous Wavelet Transform.As both approaches are based on the extraction of information from vibrations, the richest source of information on mechanical chain dysfunctions, both should be considered optimal.
As previously noted in Section 2.1, FDD data are difficult to obtain.Most industrial FDD datasets are imbalanced, as data on normal/healthy state are easier to obtain than data on failures.There are even datasets that contain no data on all possible failures, but only on some of the most common ones.Other datasets only contain data on the early degradation states associated with catastrophic failures (e.g., data on high levels of imbalance in a wind-turbine gearbox are extremely rare).As the industry is reluctant to produce datasets with significant numbers of failures, researchers are obliged to generate those data in laboratories.Considering this industrial fact, the datasets tested in each piece of research can be classified into three types: open access benchmark, laboratory, and industrial datasets.
The first type, open access benchmark datasets, are publicly available datasets that refer to different industrial processes and include different types of failures.They are not only useful for the validation of new SSL techniques, but also to compare them with the accuracy of state-of-the-art SSL methods.Some of the most common benchmark datasets are: • CWRU 9 12 (Tennessee Eastman Process): a chemical industry dataset on a process that involves 5 main elements, namely a reactor, a vapor-liquid separator, a recycle compressor, a product stripper, and a product condenser.The dataset is composed of 52 features, and there are 21 different fault types, that are unbalanced represented on the dataset, 5 of which are unknown [33,172].• NEU 13 (Northeastern University): a dataset of steel surface defect images.There are 1 800 images with 6 different fault types (crazing, patches, rolled-in scale, inclusion, pitted surface, and scratches).Despite the presence of multiple fault types, it is remarkable that the classes are balanced.
It should be noted that the first three datasets refer to the degradation of bearings in mechanical chain analyzed on the basis of vibration data.This kind of industrial problem is usually disengaged (the degradation of one bearing will not generate degradation in other bearings of the mechanical chain) and each bearing will have a natural frequency that is easily isolated within the FFT spectrum.More complex problems in mechanical chains like axis misalignment are not considered in these datasets.Only the TEP dataset includes failure in continuous norotatory process, such as chemical reactions, with an interconnectivity between failures, revealing a more complex relation between failures, although they only take into account a limited set of possible failures.Finally, NEU dataset, based in images, is a very different approach to FDD and cannot be directly compared with the other benchmark datasets.These types of datasets are mainly close to balance between different classes, as was noted in their presentation.
The second type, laboratory datasets are designed to obtain data, mainly from a testbed, including as many machine states as possible.This dataset type is really useful in FDD problems where some kinds of fault are very rare, because under laboratory conditions, all kinds of faults can be caused and different levels of severity can be characterized and tested (e.g., imbalance in a rotor can be tested at different well-characterized levels).These datasets have three main drawbacks that limit the direct use of the results obtained under real industrial conditions.First, they are extracted from very well and over-sensorized testbeds, far away from industrial conditions where the sensorization is very much more limited (e.g., an accelerometer placed close to the cutting tool was used to evaluate cutting-tool wear in a machining process, and a Kistler table was included in the experiment to measure cutting forces; under industrial machining conditions, coolant fluids and chip production make it necessary to distance the accelerometer from the tool tip, while a Kistler table is usually never used, due to its cost).Second, the testbeds are a scale reproduction of real machines (e.g., a lab testbed for a windmill mechanical chain is usually around 1/10 of an industrial one) and their performance might differ from an industrial one.Finally, the cost of setting up these industrial experiments complicates the repeatability of the research.Besides, it should be outlined that these types of datasets are mainly balanced, because under laboratory conditions, testing failure conditions could be more easily performed than under real industrial conditions.This fact creates an extra-difference between industrial datasets and these datasets.
The third type, real industrial datasets, are the least common, because they are the most difficult to obtain and industries are very reluctant to let them be used for research.In addition, this type of dataset will not often have all possible failures represented or will have a very limited number of failures levels, which complicates the ability to create a model that is capable of predicting states across the entire data space (e.g., mechanical models usually use experimental data from no failure and light failure levels to fix thresholds, but ML models also require data from more severe failure states to achieve highly accurate models).Besides, this kind of datasets are usually strongly unbalanced, because under industrial conditions, test of failure conditions is difficult and very expensive.In most FDD research, at least one benchmark dataset and one laboratory or real industrial dataset in a second stage are usual, to overcome the limitations of the different type of datasets.If no extensive public industrial dataset is available for research within a specific field, this solution might be considered best practice.
When considering the level of labeled instances, usually the first two types have all their instances labeled, while the industrial ones are the only ones that have many unlabeled instances.Most researchers therefore only apply some of the labels when using open access benchmark or laboratory datasets under SSL conditions.These datasets are not actually semi-supervised, because all the instances are labeled (labels are removed from some usually randomly chosen instances).A rare counter example is found in [78], which used a completely unlabeled set of data collected over several years relating to the operation of an electrical network.
Fig. 6 shows the papers under review that are classified by the dataset used for either training or testing purposes.As can be seen, the most commonly used dataset for testing is CWRU.However, articles using other datasets are the most numerous with 113 appearances.Among these articles, some refer to publicly available datasets, while others use their own dataset for training and testing.Table A.2 in Appendix shows the references classified by the dataset in use.
When analyzing the use of SSL methods in terms of the industrial problem to be solved, we can see that there is no clear relationship between any particular method and a specific problem; a fact that might be expected, due to the novelty of SSL techniques and their application to FDD problem.The popularity of the methods was almost identical for each industrial issue, which shows that the decision to apply SSL methods to FDD problems is based on the most commonly used FDD methods, and not on specific methods for each industrial problem.Moreover, no baseline method has been found with which they may all be compared.These two conclusions demonstrate that the use of SSL techniques for FDD problems is still far from being a standard solution and will require extensive research in coming years to establish reliable reference solutions for each industrial process against which to test the new proposals.However, at the same time, that fact outlines the interest and possibilities of future research into the application of SSL to FDD.
Finally, two main industrial impediments have been found in this research.Firstly, neither the SSL methods nor the use of unlabeled data are clearly explained in several research papers, even though such an explanation might be essential for an understanding of SSL methods.Secondly, several pieces of research use high labeled percentages, which hardly correspond with the real state of an industrial problem.SSL is meant to be applied in situations where labels are scarce, and FDD industrial problems are one of these situations, so testing high labeled percentage makes no sense.For these reasons, neither strategy should be applied in any future research.

Conclusions, best practices, and future trends
In this section, the analysis of SSL for FDD is concluded and a set of best practices are proposed, which have been established as successful solutions within this field over the last decade.The future within this field is promising and some research trends that are likely to gain attention over coming years are also described.

Best practices
The first attempt to apply a new Machine Learning technique is often far removed from the most optimal application method.Therefore, during this first decade of using SSL for FDD, some of the proposed strategies have shown some inconsistencies.The most common ones are: (1) considering ML techniques and strategies that cannot be considered as SSL (e.g., mixing supervised and unsupervised methods and referring to them as semi-supervised), (2) the use of too-high a percentage of labeled instances makes it difficult to justify the use of any SSL method, for instance, in [88] 25% of the observations were unlabeled, and in [86] only 10% of the observations were unlabeled, whereas in other proposals fewer than 50% of observations were unlabeled, and (3) the presentation of new though somewhat vague SSL methods, reduced the chance of experimental repeatability.
The following set of good practices (see Fig. 7) are proposed, to overcome these limitations in future works: • Dataset related: Use of common datasets.The use of a publicly available standard or reference datasets, such as those discussed in Section 2.1 (CWRU, TEP, NEU, Paderborn. . .), makes it possible to compare the results obtained with different models and to replicate the experiments.
As different approaches and methods are usually proposed, and as there is no one algorithm better than all the others, the only way to compare in which specific cases or type of problem one proposal may be better or worse than another is to have a reference for their comparison.A natural benchmark would be the use of common datasets (problems).It can be especially important in industrial environments, where the presence of noise in the data, insufficient variables available in the observations, and the need to perform certain transformations on the data can ruin the learning capabilities of some algorithms.Use of a reasonable percentage of labeled data.Ideally, using a small number of labeled instances along with unlabeled instances (presumably easier to obtain and less expensive) is one way the problems associated with labeling the instances needed to train the models can to some extent be alleviated in SSL.However, some of the proposals under review use high percentages of labeled instances for training, e.g., in [81], the authors used 50% labeled instances for training, making it difficult to justify the use of SSL.

Use of a realistic
Although the size of the datasets for training can greatly affect the accuracy of the model obtained, it should be considered whether the use of a SSL algorithm makes sense or not.The use of small, but almost completely labeled datasets or the use of a large enough number of labeled instances to train a supervised learning algorithm -even if it represents a small percentage of a large dataset -, may be the main reason for using the SSL approach: increased accuracy while reducing the number of labeled instances is unfeasible.
Use of labeled datasets as a starting point.It is interesting that a fully labeled dataset could be used as a starting point.Although the process of selecting observations for label removal may introduce random behavior or be biased, it will allow the results predicted by the model to be checked against known results.An aspect that is especially important in industrial processes, where similar systems may behave very differently and identifying the presence of a fault, the specific way in which a system may fail, or the degree of failure might even be impossible for people with no industrial background.
• Comparative methods: Provide comparisons between the proposed method and supervised methods.
For reference purposes, the results should be obtained from a supervised version of the proposed method or from one or more typical supervised methods trained with only the corresponding percentages of labeled data.In this way, the improvement of incorporating unlabeled data over the use of labeled data alone can be tested.
Using the same amount of labeled data to compare.Using the fully labeled dataset to train a supervised model and comparing its results with a semi-supervised model trained with a low percentage of labeled data is not a fair comparison from which valid conclusions may not easily be reached.

Quality testing of the methods by using cross-validation and repetitions.
The use of -fold cross-validation and repeat experiments are common ML practices that yield better estimations of the results than the model can achieve in a real environment.Practices that, unfortunately, are not usually found in industry publications.
Assess the influence of the percentage of labeled data on the SSL method.
The number of labeled items can greatly affect the performance of learning methods, which often happens with industrial process datasets.It is interesting to vary the number/percentage of labeled data, in order to evaluate how the learning method behaves.
• Explanation of the method: Publish the source code.The availability of the source code is paramount in any method that is either proposed or compared.It facilitates any use of the methods and comparisons between proposals.Publishing the source code is a common practice in other research fields [173].
Explain the pseudo-code.Pseudo-code is a good tool for explaining new methods, but they must be accompanied by a proper explanation.
Use a taxonomy to categorize the proposed SSL method.Classifying the proposed methods according to a taxonomy facilitates the search for specific methods, helps readers to categorize the proposals, and facilitates the classification of the methods once reviewed.Some proposed methods may be difficult to classify and help from the authors might be appreciated.The use of a taxonomy (e.g., such as the one proposed in [7]) may assist authors with the selection of references and researchers with the selection of relevant articles.

Future trends
On the one hand, as has been previously discussed in Section 6, the most popular type of methods are generative, manifolds, and feature extraction.Methods of this type will continue to be widely used because of their proven effectiveness in the field.Nonetheless, other methods that are rarely used nowadays, might become more popular in the future.As with supervised learning [174], SSL can take advantage of the improvement offered by boosting to enhance the performance of co-training methods.An interesting boosting technique that could be applied to SSL is statistical boosting [175] where boosting is no longer a black-box method.
A potential future trend could be the extensive use of semisupervised SVM, which falls under the category of maximum margin methods.This would be a plausible development considering the relatively low popularity of maximum margin methods compared to other intrinsically semi-supervised methods, alongside the growing popularity of SVM.
Moreover, some other SSL methods that have not been applied to the field, such as pre-training, represent a a research pathway that, coincident with the growing popularity of deep learning methods, could yield promising results.A recent and up-to-date review of Deep Semi-Supervised Learning can be found in [176].
Recently, the use of meta-learning in SSL FDD has been presented in some works [119,137].Nevertheless, its use has not been properly explored and FDD can benefit from meta-learning via deep learning [177] among other approaches.
Besides, most of the reviewed articles are concerned with FDD in bearings and gearboxes.The conventional proposal for such problems is vibration measurement, frequency transformation, and statistical analysis, although complex failures and industrial conditions with unlabeled data are a natural border to this solution that new SSL-based approaches can overcome.As previously outlined, bearing analysis using frequency spectrum might be the easiest FDD task in mechanical chains, where coupled failures (e.g., axis misalignment) might lead to failure states of more complex diagnosis.However, SSL might be a proper solution, due to the lack of labeled instances in those complex failures.Besides, the extension of the process information to other sources of signals (e.g., electrical or power signals) of easy-to-measure under industrial conditions, rather than vibration data, which are usually more complex to be measure in industry, could open a more generalized use of machine learning techniques and SSL in Industry 4.0.As rotatory-based processes seem to be the most interesting and complex in industry, solutions to rotatory-based problems where failures are more difficult to detect also have a promising future (e.g., surface defect detection or breakage of machining cutters).Any industrial process working under conditions that are far from stable (e.g., windmill electrical generation under real-variable wind conditions and machining processes such as milling where the cutting paths cannot assure a stable material removal rate) are promising fields of application for SSL techniques.Those characteristics converge in other industrial tasks where SSL has recently been tested for the first time, such as failures in air separation units [28], automotive assembly [80], stirred tank heater processes [81], solar photovoltaic arrays [45], laser powder-bed fusion [145], chemical batch processes [143], and a wireless sensor network [51].It therefore shows that other industrial processes with highly variable working conditions, complex behaviors and industrial limitation on the labeling process, can benefit from the use of SSL for FDD.Besides, the consideration of data imbalance combined with unlabeled data will be a promising research line for FDD, due to the high industrial demand for solutions under these conditions, specially if costly and sensitive sensors, such accelerometers, can be avoided and more reliable and easy-to-measure signals can be used in their place.
Finally, a problem that has frequently arisen in this research is the difficulty of establishing a category for some methods.It shows that a further revision of the Engelen & Hoos [7] taxonomy may be necessary in this field for further development of the topic.

Fig. 3 .
Fig. 3. Distribution of papers, according to their aim and scope by journal of publication: industry, artificial intelligence, and a mixture of both.

Fig. 4 .
Fig. 4. Evolution of the number of publications of SSL applied to FDD over the years.

Fig. 6 .
Fig. 6.Frequency of use of the different datasets in the articles under analysis.The diagram depicts the most frequently used datasets, in addition to the other databases referenced in 113 articles considering the nature of the industrial problem.

Fig. 7 .
Fig. 7. Proposed set of best practices for Semi-Supervised Learning (SSL) in industrial scenarios.
[170] Western Reserve University): a bearing defect dataset whose features are vibrations in different (motor loads and different fault levels) study cases.There are 4 classes in total with balanced numbers of instances, 3 faulty (ball, inner race, and outer race), and the normal/healthy state[170].•IMS10(IntelligentMaintenance Systems): a bearing database available at the University of Cincinnati, consisting of 3 different datasets.As in CWRU, these datasets focus on vibration data and have 4 classes with the same 3 fault states as CWRU, but in this case classes are more unbalanced than CWRU dataset [171].• Paderborn University Dataset 11 : another bearing dataset with vibrations features.There are 32 different classes, of which 6 are healthy states, and they are roughly balanced.The damaged bearing classes can be joined into inner faults and outer faults.• TEP

Combine open access benchmark datasets with indus- trial/laboratory datasets to validate new SSL techniques.
, industrial-type dataset.