A Review of Statistical-Based Fault Detection and Diagnosis with Probabilistic Models

: As industrial processes grow increasingly complex, fault identification becomes challenging, and even minor errors can significantly impact both productivity and system safety. Fault detection and diagnosis (FDD) has emerged as a crucial strategy for maintaining system reliability and safety through condition monitoring and abnormality recovery to manage this challenge. Statistical-based FDD methods that rely on large-scale process data and their features have been developed for detecting faults. This paper overviews recent investigations and developments in statistical-based FDD methods, focusing on probabilistic models. The theoretical background of these models is presented, including Bayesian learning and maximum likelihood. We then discuss various techniques and methodologies, e.g., probabilistic principal component analysis (PPCA), probabilistic partial least squares (PPLS), probabilistic independent component analysis (PICA), probabilistic canonical correlation analysis (PCCA), and probabilistic Fisher discriminant analysis (PFDA). Several test statistics are analyzed to evaluate the discussed methods. In industrial processes, these methods require complex matrix operation and cost computational load. Finally, we discuss the current challenges and future trends in FDD.


Introduction
Modern industry has brought about more complex and high-dimensional industrial processes.There is less tolerance for potential safety hazards, which means performance degradation and productivity drawdown.FDD is a significant task to ensure product quality and process reliability in modern industrial systems.Traditional FDD methods are based on experiences and have met challenges with the expansion of plant scale and large numbers of process variables.Methods based on statistical analysis become a trend in industry applications.Recently, the probabilistic model based on statistical methods broadened the industrial application in cases of high dimensionality, non-Gaussian distribution, nonlinear relationships, and time-varying variables.This article aims to overview the statistical analysis of FDD methods, especially under the probabilistic framework.

Background
Compared to passive fault-tolerant control, taking fault into account as a system perturbation, FDD is an active strategy for detecting and identifying potential abnormalities and faults, providing early warning, and recommending corrective actions to prevent failure occurrence.Compared to prognostics, dealing with fault prediction before it occurs, diagnostics is a posterior event analysis and it is required after occurring a fault.In harsh working environments, such as extreme temperatures, high pressure, and underwater, sensors are prone to faults, while the sensor is an essential component of data acquisition systems, sensor faults, including incipient failure and abrupt failure, will affect the accuracy, stability, and reliability.It has become essential for industrial applications, especially for engineering systems such as mechanical engineering [1][2][3][4], electric vehicle dynamics [5,6], power electronic systems [7][8][9][10], electric machines [11][12][13][14], and wind energy conversion systems [15][16][17][18].The task of FDD is to spot process abnormalities promptly and identify their early causes [19].The elements of a general structure for fault diagnosis system and control system are shown in Figure 1.It demonstrates different components in a control loop, and failures could exist in actuators, dynamic plants, sensors, and feedback controllers.

Evolution of Fault Detection and Diagnosis
Generally, approaches to detecting and diagnosing faults can be divided into three categories, as shown in Figure 2.

Model-Based Methods
Model-based methods include state estimation, parameter estimation, and parity space.The model-based methods utilize physical and mathematical knowledge and they possess the explainability for making decisions in a transparent way rather than in a black box [20][21][22].However, the noisy operation environment hinders physics-based modeling and degrades the accuracy of complex dynamic modeling [23,24].

Knowledge-Based Methods
Knowledge-based methods include symptom-based methods and qualitative methods.The knowledge-based methods are usually implemented by experts, and the fault diagnosis relies on the accumulation of prior information and logical reference [25].These methods are efficient within the scope of existing knowledge and struggle to tackle unexpected failures.

Data-Driven Methods
Data-driven methods include statistical-based methods and transform-based methods.Compared to model-based methods and knowledge-based methods, data-driven approaches are efficient when confronted with high-dimensional data.They require a sufficient quantity of data and enhance accuracy by extracting information or features from large-scale datasets [26,27].Yet the lack of an accurate mathematical model makes it inadequate for detecting incipient faults.
The Internet of Things era revolutionizes industrial processes by collecting a large amount of information via a network of terminals, leading to a data explosion and the rocketing complexity of model construction.Troubleshooting highly complex systems involves multiple processes and multiple anomalies, making conventional physics-based deterministic modeling more challenging.Moreover, implementing the knowledge-based FDD methods relies heavily on expertise or prior information, which is time-consuming and labor-intensive especially when dealing with a high-dimensional process.Data-driven approaches are targeted for addressing large-scale data and extracting features from data.
The statistical approach is a branch of data-driven schemes, and extracts the process information from measurement or observation.Its significant advantage is that it can tackle many highly correlated variables without complex mathematical forms or costly design efforts.Statistical methods [28] have gained popularity in practical applications for their ability to directly analyze input and output data, particularly PCA [29][30][31][32][33] and PLS [34][35][36][37].PCA, PLS, and independent component analysis (ICA) are traditional statistical analysis methods based on deterministic models.The underlying theoretical foundation of traditional multivariate analysis is linear algebra.
In practice, industrial process data are polluted by missing data and outliers, which can significantly influence the accuracy of features and control thresholds.The probabilistic extensions of traditional statistical methods employ distributions to describe states to enhance their ability to process sampling data with disturbances, outliers, and missing values.In addition, the probabilistic form of statistical analysis can employ non-linear data and thus can be applied in industrial processes.Recently, the probabilistic counterparts of PCA [38] and factor analysis (FA) [39] have been generated for fault diagnosis.Furthermore, the extensions of their mixture form with multiple operation modes have been generalized [40].
The study on statistical-based fault diagnosis methods draws remarkable research attention.Due to their similarity in collecting, processing, and extracting information from data, these strategies are viewed as statistical-based methods.In this context, these schemes are mainly characterized as follows: (1) Without a complex model construction, a statistical-based FDD design can extract the information and make decisions directly on the sampling data.(2) These strategies are designed to address FDD in static or dynamic systems in a stable state with the flexible application of statistical tests and their mixed indices.

Motivation and Contribution
Statistical-based fault diagnosis has attracted attention in industrial applications and the academic community.The sensor technology gives rise to a data explosion, and the data quality significantly impacts the modeling of the process and thus influences the performance of fault diagnosis.Probabilistic extensions of conventional statistical methods spring up due to their robustness and advantages in treating outliers, disturbance, and missing values.There is no comprehensive review of statistical FDD methods under a probabilistic framework.Therefore, statistical methods with a probabilistic model are an unavoidable element that needs to be addressed in industrial applications and the academic community.Different from other reviews on FDD [41][42][43], this review focuses on detailed explanations to let readers understand the principle of each method and save some time searching a lot of references.The purpose of this review is to provide theoretical background and recent application instances of probabilistic-based statistical fault diagnosis.

Organization of This Paper
The remainder of this paper is organized as follows.The theoretical background of the probabilistic model is presented in Section 2. Section 3 gives a brief overview of the probabilistic extensions of statistical methods and their practical applications.The challenges and perspectives are demonstrated in Section 4. Conclusions are finally drawn in Section 5.

Theoretic Background
Maximum likelihood estimation (MLE) and Bayesian theory are employed when the probabilistic model is introduced into statistical methods.This section will briefly introduce the principles of MLE and Bayesian inference.

Maximum Likelihood Estimation
In statistics, MLE estimates the parameters of an assumed probability distribution with observation measurement and maximizes the probability [44].MLE is generated to find the probability density function (PDF) that is most likely to produce a data sample given the observation.The data y = (y 1 , y 2 , • • • , y m ) is a random sample from an unknown population.In practice, the model usually involves abundant parameters, and the likelihood function is probably nonlinear, making it difficult to obtain an analytic solution.A nonlinear model is established to estimate the remaining useful life of a system, where the unknown parameters are estimated with the help of MLE [45]. The ) is a vector defined on a multi-dimensional parameter space.p(y|v) denotes the probability of y given v.The likelihood function is defined by [46] L(v|y) = p(y|v). ( The MLE estimate is obtained by maximizing the log-likelihood function.Assuming that the log-likelihood ln L(v|y) is differentiable, if v MLE exists, it must satisfy Equation (2): L(v|y) and p(y|v) are defined on different axes, that p(y|v) is a function of the data given parameters, defined on the data scale, and L(v|y) is defined on the parameter scale.

Bayesian Learning
Bayesian learning provides a rigorous framework for complex nonlinear systems whose internal state variables are inaccessible to direct measurement.Given a general discrete-time state estimation equation and measurement function, where x k ∈ R n is the state at time step k, e k ∈ R n is the process noise, and g : R n → R n denotes the transition function.y k ∈ R p is the measurement, v k ∈ R p is the measurement noise, and h : R n → R p .e k and v k are independent.The Bayesian learning is to recursively estimate the PDF of x k given measurements y k .The initial density is determined beforehand, and p(x k |x k−1 ) denotes the transition probability density.The inference of the state x k relies on the marginal density p(x k |y 1:k ).The predictive function of x k at step k is estimated by [47] p(x k |y (5) Then, the marginal filtering density is computed where p(y k |y 1:k−1 ) is the normalizing parameter.
A Bayesian network is a probabilistic graphical model that illustrates the relationships between variables.
and the joint probability distribution for a Bayesian network with nodes a = {a 1 , • • • , a n } is given by where parents(a i ) is the parent set of node a i .
Research based on Bayesian learning has attracted huge attention in the field of fault diagnosis.Zhao proposed advanced Bayesian estimation algorithms to monitor the faulty sensors [48], then improved algorithms for online ability and correlated signals in nonlinear processes, respectively, [49,50].Bayesian network is adopted in fault diagnosis [51][52][53][54][55][56][57][58][59][60][61].Multivariate statistical analysis has been combined with Bayesian inference for fault detection and isolation [62].A Bayesian maximum likelihood classifier is validated as accurate for induction machine and stator short circuit fault diagnosis [63].The probabilistic Bayesian deep learning framework exploits the risk-aware model to identify unknown faults and enhance the trustworthiness of the diagnostic results [64,65].

Probabilistic Statistical-Based Approaches
This section discusses different kinds of static statistical-based approaches.Four probabilistic models applied in the field of FDD are illustrated, including probabilistic PCA, probabilistic PLS, probabilistic ICA, probabilistic canonical correlation analysis (CCA), and probabilistic Fisher discriminant analysis (FDA).PCA extracts the principal components that are retained to explain the majority of the variability in the data by maximizing the variance.Compared to PCA, the FDA maximizes the separation among classes while minimizing the separation between classes.The components after PCA decomposition are orthogonal and therefore irrelevant, but independence is not guaranteed.Compared to PCA, ICA can find the original components in the observed mixtures, and it is a linear transformation of the data in the original feature space.PCA involves only one set of variables, while CCA extends to the interdependence between two sets of variables, measuring the correlation between the two sets of variables.The probabilistic approaches of traditional statistical analysis are shown in Figure 3.They differ in the variable distribution, the application scenario, and whether the dataset is labeled.PPCA and PICA have the same characteristics, they can deal with non-Gaussian distribution data and stationary processes.PFDA and PPLS are supervised methods.PCCA excels in other methods of tackling dynamic processes.The probabilistic extensions of traditional multivariate statistical analysis still retain original characteristics and broaden the range of applications.

Probabilistic Principal Component Analysis
PCA is a technique targeted for dimensionality reduction, and it has wide applications, including data compression, image processing, data analysis, and pattern recognition [66][67][68][69].The probabilistic derivation of PCA is given by where y ∈ R d , independent unobservable variable t ∈ R q ∼ N (0, I), and q < d.The transition matrix is W ∈ R d×q , and the vector is m ̸ = 0.The key assumption for PPCA is that the noise in this probability model is likewise Gaussian ϵ ∼ N (0, Ψ) and the covariance Ψ = σ 2 I is constrained to be a diagonal matrix, so that y are conditionally independent given the values of t.The conditional PDF of y and the marginal PDF of t obtained by integration are implied by where A = WW T + σ 2 I, then the log-likelihood function is Estimates for W and σ 2 is obtained by iteratively maximizing Equation ( 12) by employing the EM algorithm [70].
where the first qth column vectors in U q ∈ R d×q are the qth principal eigenvectors of S, with corresponding eigenvalues λ i , • • • , λ q in the diagonal matrix Λ q ∈ R q×q , and R ∈ R q×q is an arbitrary orthogonal matrix.Note that U q and Λ q can be obtained by performing the singular value decomposition on S.
The conditional distribution of the latent variable t given the observed y is obtained with the help of Bayesian inference where B = W T W + σ 2 I. From Equation ( 16), the point-wise technique can be employed when the conditional distribution is generated Then, the high-dimensional observed data y can be condensed into a new distribution t which satisfies Gaussian.
The probabilistic PCA model has abundant modifications and extensions to be applied for fault diagnosis.Choi et al. proposed a fault detection scheme based on a maximumlikelihood PCA mixture model [40].To address the challenge of separating several factors that together cause a failure to occur, probabilistic PCA was created [71].An aligned mixture probabilistic PCA is proposed by Yang for fault detection of multimode chemical processes [72].Then, a reconstruction-based multivariate contribution analysis is applied to the PPCA mixture model for fault isolation [73].
The robust version of probabilistic PCA is modified to deal with outliers and missing data during the modeling stage [74].In addition, a variational inference process based on the Bayesian PCA model structure provides the foundation of a defect reconstruction method [75].Additionally, the hidden Markov model framework temporally extends the static mixture probabilistic PCA model-based classifier to the dynamic form [76,77].It was suggested to use a hybrid framework that takes into account moving window PCA and Bayesian networks to cope with barely accessible data in a fault state [78].
By thoroughly analyzing the principle and implementation of probabilistic PCA, the significant advantages can be summarized as follows.
(1) Enhanced Robustness: In practical applications, disturbances are unavoidable in a complex working environment.Probabilistic PCA disposes of the problem that sampling data are mixed with outliers and missing values by using distribution modeling of these data and enhancing robustness.(2) Increased Complex Data: The introduced latent variables enable probabilistic PCA to process non-linear data, improving the performance of dimensional reduction.(3) Probability Inference: Probabilistic PCA is a dimensionality reduction method based on probability models.It provides quantitative information on uncertainty and probabilistic inferences to obtain more accuracy and effectiveness.Ultimately, the ability to interpret data is substantially intensified.

Probabilistic Partial Least Squares
The core of the probabilistic PLS model is to use a part of the latent variables to explain the observed data set.The probabilistic PLS model is formulated by [79] where ; m x and m y are the mean of x and y; and ϵ x and ϵ y denotes measurement noises of x and y, respectively.
In the probabilistic PLS model, t s ∼ N (0, I), t b ∼ N (0, I), ϵ x ∼ N (0, Σ x ), and ϵ y ∼ N (0, Σ y ).Different from the probabilistic PCA model that assumes the error covariance matrix to be a diagonal matrix with a constant value, different noise variances have been assumed for different variables The optimal values of parameters are determined by the EM algorithm: The supervised model PPLS builds a regression model between two sets of variables.For further applications, the probabilistic PLS model has been modified.On the basis of this, the validity of the classification of an unknown item is assessed [80].Zheng adapted the probabilistic PLS model to the semi-supervised version for the creation of soft sensors [81].Data-driven fault identification and diagnosis techniques are proposed based on a novel locally weighted probabilistic kernel PLS [82].Botella described an improvement to discriminant partial least squares that use the kernel trick and Bayes rule to implement data classification [83].To further decompose the PPLS model, a concurrent probabilistic PLS approach is suggested, and monitoring statistics are created for assessment [84].
Compared with traditional PLS, probabilistic PLS treats independent and dependent variables as random variables and assumes that they satisfy the Gaussian distribution.The probabilistic model is capable of dealing with disturbance, outliers, and missing data and then improving the stability and prediction accuracy of the model.

Probabilistic Independent Component Analysis
ICA separates the dataset into linear combinations of statistically independent non-Gaussian sources.It is a significant application of the blind source separation method.The probabilistic ICA model is [85] x n = As n + ϵ n , (26) where x n is the observation, s n is a statistically independent non-Gaussian source, ϵ ∼ N (0, β −1 I) denotes the noise vector, and A represent a linear transformation.The likelihood can be given by Due to the adaptive tails in the Student's t probability model, the Student's t distribution can approach the distribution of non-Gaussian sources.The sources used by the Student's t might be described as where Ga(•) represent the Gamma distribution.To estimate the non-Gaussian variables, one can use the variational Bayesian EM approach.Defining F ∼ {S, U} for latent variables, the log-likelihood is given by log p(X|θ) = log p(F, X|θ) p(F|X, θ) .( 29) By introducing an auxiliary distribution q(F) as the approximation distribution, we have log p(F, X|θ) p(F|X, θ) (30) = log q(F) p(F, X|θ)q(F) p(F|X, θ)q(F) dF ( 31) ≥ F(q(F), θ).(34) For variational Bayesian, the latent distributions are assumed as independent q(F) ≈ q(S)q(U).Then, taking the derivative of the lower bound with respect to q(S), q(U), q(A), q(β), and q(ν j ). where , on the basis of conjugate exponential distribution, (s n |s n , Σn s ), the parameters can be obtained as Similarly, by defining q(U) We have Other parameters can be derived by taking the differentiation concerning A, β, and where x i n is the ith element of observation and a i is the ith column of A, while the degree of freedom ν j can be induced by solving the following nonlinear formula: The expectations involved in Equation ( 43) are given by Traditional ICA methods are limited in their ability to process non-Gaussian distributed signals, resulting in inaccurate decomposition results.Probabilistic ICA uses probability distributions to solve the above-mentioned problem.Probabilistic ICA can better process non-Gaussian signals because of its ability to model different probability distributions.In addition, probabilistic ICA employs the variational Bayesian method to estimate the uncertainty of the separation variable and improve the robustness and interpretability of the model.

Probabilistic Canonical Correlation Analysis
Given two random vectors, canonical correlation analysis (CCA) is concerned with finding projections such that the components within one set of projections are correlated with components in the other set.The probabilistic extension of CCA is given by [86].
where variables x 1 ∈ R m1 , x 2 ∈ R m1 , the latent variables z ∈ R d ∼ N (0, I), and min{m 1 , m 2 } ≥ d ≥ 1.Then, the conditional distribution is supported by The parameter set Θ = {W 1 , W 2 , m 1 , m 2 , Ψ 1 , Ψ 2 } can be determined by maximizing the likelihood.After implementing the EM algorithm, the optimal solution of parameters is given by μ1 = m1 , ( where The log-likelihood is given by According to Bayesian inference, the posterior expectations and variances of z given x 1 and x 2 are Unlike the sensitivity to noise and missing data in traditional CCA, the probabilistic CCA employs probabilistic models to describe the data generation process, which can naturally deal with noise and missing data, thus securing robustness.

Probabilistic Fisher Discriminant Analysis
FDA attempts to characterize or distinguish between two classes of objects by using a linear combination of features.Many machine learning and pattern recognition applications use this strategy [87][88][89][90].To identify mixed errors, FDA was integrated with a hybrid kernel extreme learning machine [91].The criterion of Fisher discriminant is [92] where S W is the covariance matrix within classes, and S B is the covariance matrix between classes, n k is the number of observations in the kth class, the mean of the observed column vector y i in the class k is denoted by n k m k is the mean column vector of the observations.The probabilistic framework of the FDA is where u ∼ N (u|v, I), v ∼ N (v|0, Ψ), and v represents the class center.The corresponding graphical model is displayed in Figure 4.In the PFDA model, m, Ψ, and A are unknown.The log-likelihood is given by where p(x 1 • • • x n ) is the joint distribution of a set of n patterns, provided they belong to the same class.By computing the integral, we have If Φ w and Φ b are both positive definite and Φ w and Φ b are both positive semi-definite, then we can maximize the value of L. Without these limitations, basic matrix calculus provides The EM method is then used to update the parameters m, A, and Ψ to maximize the PFDA model's likelihood

Test Statistics
A fault diagnosis differs from a classification problem and should detect abnormalities from sampling data.Test statistics construct a threshold for judgment.

T 2 Test Statistic
The false alarm rate (FAR) is an elementary concept in fault detection, and it displays the probability of a false alert signal, which is given by where J denotes a test statistic, J th denotes the threshold, and Equation ( 76) means the possibility that the decision logic may sound an alert for a malfunction even when one has not occurred.Then, the general formulation of the defect detection issue is provided by where E (ε) and Σ are unknown, and assuming that sampling data y 1 , • • • , y i , (i = where the corresponding threshold is represented by J th,T 2 , and F α (m, N − m) denotes F -distribution with m and (N − m) degrees of freedom.After obtaining each new measurement y k , the test statistics will be checked.After calculating the test statistic, the alarm is triggered to indicate the fault by

SPE or Q Statistic
The inverse matrix of Σ is necessary for the T 2 statistic, while numerical trouble may incur in the computation by a high-dimensional or ill-conditioned Σ. Q statistic can be alternatively chosen for detecting the fault The threshold J th,Q can be computed offline with the process data Then, set J th,Q for a given significance level α

KL Divergence
Kullback-Leibler divergence (KLD) is well-known for measuring the divergence between two PDFs.The KL divergence of two continuous PDFs p(x) and q(x) is given by KL(p(x), q(x)) = p(x) log p(x) q(x) dx.
The KL divergence is non-negative and zero if and only if p(x) equals q(x).For the PDFs p(x) and q(x), assume the random variable x satisfies Gaussian, p(x) = N (µ p , Σ p ), and q(x) = N (µ q , Σ q ), Equation ( 90) can be further written as Lei proposed a paper discussing the detection of an incipient fault condition in complex dynamic systems using the KL distance [93].This paper proposes a methodology that can detect incipient anomalous behaviors based on KL divergence [94].

Hellinger Distance
The traditional statistical test cannot be effectively applied to detect the incipient fault.Hellinger distance was first proposed [95] to measure the similarity of two probability distributions.Assuming that p(x) and q(x) represent two continuous PDFs, the Hellinger distance (HD) can be defined as HD is a symmetric bounded distance, and its possible values are between 0 and 1 as 0 ≤ HD(p, q) = HD(q, p) ≤ 1.Based on the Lebesgue metric, the square of HD is expressed as Given two PDFs that obey the normal distributions such that p(x) ∼ N (µ p , σ 2 p ) and q(x) ∼ N (µ q , σ 2 q ), HD 2 (p, q) of p(x) with respect to q(x) is given by This work combined HD, Bayesian inference, and ICA to monitor a multiblock plantwide process [96].HD and KLD are combined to explore the isolation capability of an FDI test by Palmer [97].Chen introduced HD into a multivariate statistical analysis framework to detect incipient faults for high-speed trains [98].

Recent Applications on Statistical Fault Diagnosis
Although statistical schemes for fault diagnosis have been widely applied, there are still many challenges in their practical applications.Abundant modifications are mounted on traditional statistical methods to ensure better performance in industrial applications with harsh working environments.Recent work and applications of statistical fault diagnosis are represented in several aspects in this section.

Approaches Targeted for Data with Outliers and Missing Values
The advanced sensor technique provides abundant information; however, the multirate sampling can lead to incomplete data entries, and the disturbances or measurement errors shall induce outliers [99,100].Although traditional statistical inference studies for process modeling and monitoring assume no missing data or outliers, industrial process data typically contain missing values, out-of-range values, and outliers.This greatly influences the statistical FD strategies for process modeling and monitoring.
Traditional statistical analysis would assume that the process data is clean since traditional statistics, such as mean and variance, are sensitive to outliers [101,102].Additionally, the data gathered from industrial processes tends to not be distributed normally.The representative methods of multivariate statistical analysis usually sink into poor performance due to insufficient quality process data.As a result of the fact that the most frequently used test statistics work under the presumption that the sample data meets Gaussian distribution requirements, data quality also affects the determination of whether a fault exists [103].The improved weighted k neighborhood standardization (WKNS)-PCA is applied to detect process outliers and its advantage lies in employing a single model for multi-mode industrial processes [104].Multi-PCA models are trained and integrated with a modified exponentially weighted moving averages (EWMA) control chart to improve robustness to outliers and sensitivity to small and sudden abnormalities [105].A novel dual robustness projection to latent structure method based on the L1 norm (L1-PLS) is proposed and illustrates its insensitivity to outliers [106].The robust mixture PPCA model incorporates with a Bayesian soft decision fusion strategy for handling the missing data problem.The mixture of probabilistic principal component analysis (MPPCA) models was trained under multiple operational conditions, such as healthy conditions and anomalies with missing measurement data [107].

Modifications Designed for Non-Gaussian and Nonlinear Processes
The retrieved latent variables are assumed to be Gaussian in traditional statistical methods like PCA and PLS.Additionally, they require a linear correlation between the variables.
While the multiple manufacturing phases or operating conditions often lead to non-Gaussianity [108,109].The control threshold and the boundary of normal operation may not be correct in non-Gaussianity-related situations.Therefore, the non-Gaussianity of industrial process data for traditional statistical analysis may cause false alarms [110].ICA has been used to extract components that are non-Gaussian and statistically independent [111,112], while this method is cumbersome for some practice applications due to a number of drawbacks, including the unstable monitoring results and uncertain number selection of retained independent components [110,113,114].
The nonlinearity of industrial processes is mostly shown in two aspects.On the one hand, the nonlinearity is embodied in the relationship among time series x k−1 → x k .On the other hand, the nonlinear relationship is in different variables x k → y k .With the ubiquity of nonlinearity in practical applications, these two aspects deserve attention.Compared with the technique for classifying a linear dataset [115], the feasibility of support vector machine [116] and proximal support vector machines [117] was illustrated.Probabilistic ICA serves as a statistical method for blind source separation and constructs a probabilistic model for uncertainty.It was extended to a variational Bayesian form to improve simplicity and robustness [118].A semi-supervised learning framework was delivered for ICA [119].Gaussian processes provided a probabilistic approach for dealing with nonlinearity, the model was derived by Lawrence [120], and the process monitoring was implemented by Ge and Song [121].The support matrix machine, as an extension of SVM, was developed under a probabilistic framework [122].Weighted difference principal component analysis (WDPCA) eliminates the multimodal and nonlinear characteristics of the original data by using the weighted difference method [123].As an extension of CVA, the canonical variate analysis integrated with a dissimilarity-based index, for process incipient fault detection in nonlinear dynamic processes under varying operating conditions [124].To overcome the poor prediction of PCA when inputs and outputs are nonlinear, a formalism integrating PCA and generalized regression neural networks (GRNNs) is in-troduced, and it is a one-step procedure, which helps in faster development of nonlinear input-output models [125].

Approaches for Non-Stationary Processes
PCA and CCA are two types of multivariate analysis that are commonly utilized on processes with a single stationary condition.At the same time, their performance may degrade when confronted with highly dynamic systems.When these methods are applied in practice, the highly autocorrelated and time-dependent measurement data can influence the troubleshooting accuracy.
For online operating systems, one should consider moderate adjustments because of the sequential relationships in dynamic process data.Given that time-wise sample points are auto-correlated, one of the most important qualities of industrial systems should be their dynamic characteristics.As a result, extending statistical modeling from static to dynamic representations is preferred [100].The state-space definition presents discretetime dynamic systems to monitor dynamic processes [126].The FIR-smoothing techniques obtain satisfactory performance regarding measurement with time-delay [127,128].The methods [129][130][131][132] also improve immunity to disturbances using Bayesian inference.Dynamic independent component analysis (DICA) is applied to the augmenting matrix with time-lagged variables to deal with dynamic processes [133].By combining the advantages of KPCA (Gaussian part) and KICA (non-Gaussian part), Zhang [134] developed a nonlinear dynamic approach to detect fault online compared to other nonlinear approaches.

Work on Robustness
Industrial process data are often mixed with disturbance, and measurement dimensions vary widely between scales.The data preprocessing involves data normalization to adjust ranges of values, while preprocessing methods need a large computation load, especially for voluminous industrial data.When processes are affected by disturbance, a robust strategy can tolerate unstable measurement quality.Different robust mechanisms of PCA have been researched as fundamental statistical tools for processing data and dimensionality reduction.By combining the projection pursuit (PP) with robust scatter matrix estimation, Hubert proposed a robust PCA [135].Li and Chen created the robust PCA with PP first [136], and the generalized simulated algorithm was carried out by Xie [137].Furthermore, the improvement was implemented to obtain stable numerical accuracy for the sake of high-dimensionality [138,139].The methods mentioned above are deterministic, and the Bayesian methods can be flexible and alternative.The modification of PCA under a probabilistic framework can represent uncertainty and thus be a popular method [140].To handle the heavy-tail distribution dataset, the Bayesian PCA employed Student's t distribution [141] and Laplace distribution [142].Recently, successful industrial applications have been implemented [143][144][145].

Artificial Intelligence Approaches
The construction of system models and feature extraction can be replaced by networkbased strategies [146].When it comes to coping with nonlinearity and non-Gaussianity, neural networks (NN) excel.Besides being equipped with the ability to discover dynamic behaviors, NNs will be promising in FD systems.This promising trend for FD techniques is built firmly on NNs' enhanced computational power and explicability.By altering their weights based on input and output data, artificial NNs mimic the organization of the human brain.The use of counter propagation NNs and recurrent NNs may be seen in [147,148].Convolutional NN combined with fractional Fourier transform and recurrence plot transform is applied under variable working conditions [149].A membrane learnable residual spiking NN for autonomous vehicle sensor defects was proposed by Wang and Li [150].

Challenges and Open Problems
In practice, some challenges and problems like data preprocessing, real-time ability of FDD methods, and multichannel data from multiple sensors need further research.The existing methods also can be enhanced to deal with non-Gaussianity and non-linearity.

Preprocessing High-Dimensionality Data
The data sampled in a harsh working environment with disturbance, possibly of poor quality, will reduce the accuracy and effectiveness of FDD schemes.The preprocessing of sampling data to remove outliers will be beneficial for subsequent FDD steps.Modern industrial processes consist of various components, and each part can have abundant measured variables.This benefits real-time monitoring but is accompanied by problems with storing, managing, and preprocessing big data.The statistical methods generally use matrices to compute corresponding statistics, and the high-dimension of matrices increases with burgeoning size data.The dimensional explosion problem will induce a large computation load, demanding hardware facilities.

Statistical FD Schemes Developed without Real-Time Ability
Unlike a simple classification problem, the design of the FDD scheme needs to consider the dynamic behaviors of practical application.For any online system, the FDD schemes are ultimately designed to detect and diagnose faults in real time.The fast and effective decision-making process of judging whether a fault is reflected in the collected data is significant, especially for high-sample-frequency systems.Most methods reviewed in this paper only applied to static processes and have no ability for online implementation.The online implementation is a research gap for statistical FDD methods.

Enhancement on Existing Methods
Further enhancement on works presented in this review are suggested: (i) use different statistical methods to improve fault diagnosis; (ii) enhancement of statistical methods not explored by authors in their original works; and (iii) modification of methods to handle non-Gaussian and non-linear data.These methods should confirm their robustness when implemented in industrial processes (mainly chemical, mechanism, and bioengineering).

Development on Fault Diagnosis
Most methods reviewed in this paper focused on fault detection or isolation.That points to the research niche of the development of fault diagnosis methods.

Processes with Multichannel Data from Multiple Sensors
Research on the monitoring and diagnosis of general multichannel profiles is still limited in processing a single profile's data.In industrial applications, product quality is often characterized by profile data collected from multiple channels.By taking crosscorrelations among multichannel profiles into consideration, profile monitoring is expected to become more sensitive to a variety of shifts.

Conclusions
Statistical methods have received increasing attention in recent years as an emerging and active research area in fault detection and diagnosis.This paper has sketched the probabilistic extensions of traditional statistical-based FDD schemes (such as PCA, ICA, CCA, and FDA) along with the test statistics.A brief review of challenges and perspectives on the statistical FDD strategy is presented.The review shows that each of the existing probabilistic approaches has its own strengths and limitations.In addition, several key open problems (such as non-Gaussian, non-linearity, non-stationary, and robustness) are discussed to demonstrate potential future research.

Figure 1 .
Figure 1.A fault-diagnosis system is targeted for detecting failures from collected information and improving robustness and accuracy.

Figure 2 .
Figure 2. Classification of fault diagnosis methods.

1 , x j 2
J d is the diagonal matrix of the first d canonical correlations, U 1d and U 2d are the first d canonical directions, m denotes the sample mean, and Σ = Σ11 Σ12 Σ21 Σ22 denotes sample covariance matrix obtained from data x j

Figure 4 .
Figure 4.In the latent space, which is the space where the variables are independent, PFDA models the class center v and examples u.The transformation A links the example x to its latent representation.
1, • • • , N) are available.Find a corresponding threshold J th based on online measurement data y k , • • • , y k+j with FAR ≥ α.Under the framework of model Equation (77), T 2 test statistic is defined by ȳN