Data-Driven Fault Diagnosis Analysis and Open-Set Classification of Time-Series Data

Fault diagnosis of dynamic systems is done by detecting changes in time-series data, for example residuals, caused by system degradation and faulty components. The use of general-purpose multi-class classification methods for fault diagnosis is complicated by imbalanced training data and unknown fault classes. Another complicating factor is that different fault classes can result in similar residual outputs, especially for small faults, which causes classification ambiguities. In this work, a framework for data-driven analysis and open-set classification is developed for fault diagnosis applications using the Kullback-Leibler divergence. A data-driven fault classification algorithm is proposed which can handle imbalanced datasets, class overlapping, and unknown faults. In addition, an algorithm is proposed to estimate the size of the fault when training data contains information from known fault realizations. An advantage of the proposed framework is that it can also be used for quantitative analysis of fault diagnosis performance, for example, to analyze how easy it is to classify faults of different magnitudes. To evaluate the usefulness of the proposed methods, multiple datasets from different fault scenarios have been collected from an internal combustion engine test bench to illustrate the design process of a data-driven diagnosis system, including quantitative fault diagnosis analysis and evaluation of the developed open set fault classification algorithm.


Introduction
Fault diagnosis of technical systems deals with the problem of detecting and isolating faults by comparing model predictions of nominal system behavior and data from sensors mounted on the monitored system [1]. Early detection of faults and identifying their root cause are important to improve system reliability and to be able to select suitable counter measures. Two common approaches of fault diagnosis are model-based and data-driven [2].
Model-based fault diagnosis relies on a mathematical model describing the nominal system behavior, where the model is derived based on physical insights about the system. Residuals are computed by comparing model predictions and sensor data to detect inconsistencies caused by faults [3,4]. Data-driven fault diagnosis uses training data from different operating conditions and faulty scenarios to capture the relationship between a set of input and output signals [5]. The output signal could be a feature or sensor value to be predicted, which is referred to as regression, or the class label that input data belong to, referred to as classification [6].
Fault detection and isolation are complicated by prediction inaccuracies and measurement noise [7]. For nonlinear dynamic systems, the sensor data distribution varies due to different operating conditions which requires complex data-driven classifiers to capture the distribution of different classes to distinguish between faults. One solution, see for example [8], is to use residual generators to compute residual outputs as features to filter out the system dynamics before classifying faults.
Even though system dynamics can be filtered out in feature data, the distribution of faulty data is not only dependent on fault class but also the realization of the fault, e.g., fault magnitude and excitation. Thus, collecting representative training data from various fault scenarios that can occur in the system is a complicated task, especially when developing a diagnosis system during early system life when failures are still rare [9]. General-purpose models for supervised multi-class classification, such as Random Forests and Neural Networks, have a risk of misclassifications when training data are limited since these methods assume that training data are representative of all tems is to track system degradation, for example by continuously estimating the size, or severity, of a fault that is present in the system. Accurate information about fault size is important for predictive maintenance [16], prognostics [17], and fault reconstruction algorithms [18]. Several model-based techniques for fault size estimation have been proposed, see e.g. [17]. Still, it is a non-trivial problem to estimate the fault size when no fault models are available.
The main objective of this work is to develop a framework for data-driven analysis and classification for fault diagnosis of dynamic systems. The first contribution is a quantitative analysis method for data-driven fault classification using the Kullback-Leibler (KL) divergence to model data from different fault classes and to analyze fault detection and isolation performance for a given set of features. The second contribution is a data-driven open set classification algorithm designed for fault diagnosis applications that can handle: limited training data, unknown fault classes, and overlapping regions of different classes. A third contribution is a data-driven fault size estimation algorithm, using the same modeling framework, when training data are available with known fault sizes. As a case study, real data are collected from an internal combustion engine test bench which is operated during both nominal and faulty operation [1].
The outline of this paper is as follows. First, the problem statement is presented is Section 2. Related research is summarized in Section 3 and some background to fault diagnosis and quantitative fault detection and isolation analysis is given in Section 4. Then, the proposed framework for data-driven quantitative analysis is presented in Section 5, the proposed open set fault classification algorithm in Section 6, and the fault size estimation algorithm in Section 7. The internal combustion engine case study is described in Section 8 and the results of the experiments are presented in Section 9. Finally, some conclusions are summarized in Section 10.

Problem Statement
The main objective is to develop a data-driven framework for supervised machine learning and data-driven analysis to systematically address the complicating factors of data-driven fault diagnosis of non-linear dynamic systems. The characterizing properties of the data-driven fault diagnosis problem are summarized in the following bullets: Even though there has been significant work done in data-driven fault diagnosis and machine learning, it is still an open question how to address all these complicating factors in a common framework. In the problem formulation, it is assumed that a set of n features has been developed to monitor a non-linear dynamic system. The set of computed features is time-series data, for example residuals computed from sensor outputs, that can be used to detect faults in the system. It is assumed that a set of training data has been collected from different nominal and faulty scenarios 5 where the samples are correctly labelled. Labelling mainly considers which fault class each sample belongs to, but it will also be investigated what can be done when information about fault sizes is available as well. It is assumed that the available training data fulfill the characterizing properties listed above.
The first objective of this work is to develop a data-driven framework for modeling of different fault classes. To analyze the properties of training data, a quantitative method will be developed that is able to analyze the ability to classify the different faults. The proposed method should be able to quantify how easy it is to distinguish between different fault classes which can give valuable insights about the nature of different faults.
The second objective is to develop a fault classification algorithm that is designed to detect and classify known fault classes that can explain the observations but also identify scenarios with unknown faults. Because feature data is assumed to overlap between different classes, the classifier should be designed such that the computed fault hypotheses are based on the characterizing properties of training data to avoid misclassifications.
The third objective considers the case when training data are collected from different fault realizations with known fault magnitudes, e.g., a given sensor bias or leakage diameter. The purpose is to use the proposed framework to estimate the magnitude of a detected known fault to track the system health when no mathematical model is available to estimate the fault.

Related Research
There has been an increasing focus on data-driven classification algorithms that can handle unknown classes, which is referred to as open set classification, see e.g. [19], and overlapping classes, see e.g. [13]. However, it is still an open problem how to address all these mentioned characteristics of data-driven fault diagnosis problems. Open set classification has been used in, e.g., computer vision, to deal with unknown classes not covered by training data [19]. Different algorithms have been developed to solve the open set classification problem, for example Weibull-calibrated support vector machines [20] and extreme value machines [21].
Different data-driven approaches have been proposed for open set fault classification to handle both known and unknown fault scenarios, for example, one-class support vector machines [1,22], conditional Gaussian network [23], ensemble methods [9], and Hidden Markov Models [24]. Several papers have proposed open set classification algorithms for machinery condition monitoring, see for example [25,26,27]. The authors of [25] propose a hybrid approach combining different data-driven methods and subspace learning. In [26], a neural network with residual learning blocks is proposed and [27] proposes a deep learning neural networks-based approach to handle the situations where training data and test data are collected from different operating conditions. In [28], a domain adaptation open set classification approach is proposed combining auto-encoders, a proposed homothety loss, and an origin discriminator, to monitor a system using training data from other similar systems. With respect to previous works, the proposed data-driven framework can also be used to evaluate fault diagnosis performance given a set of features. The same framework is used to design an open set fault classification algorithm that handles overlapping classes, including scenarios with unknown faults, and estimates the fault sizes of known fault classes by evaluating the probability distribution of feature data.
Quantitative fault detection and isolation analysis has been considered in, for example, [7,29]. In [7], the KL divergence is used to analyze time-discrete linear descriptor models. Similar approaches have also been used in e.g. [30] and [31]. In [29], diagnosis performance is measured based on the distance between different kernel subspaces. One application of quantitative analysis is sensor selection, see for example [32,33]. The KL divergence has also been used for fault detection, see for example [24,34]. With respect to these mentioned works, a data-driven framework is proposed here for quantitative fault diagnosis performance analysis, open set fault classification, and fault size estimation. Developing a mathematical model of complex systems with sufficient accuracy for fault diagnosis purposes is a time-consuming process [35,36]. This has motivated the use of machine learning and data-driven fault diagnosis methods to instead learn the system behavior from collected data. Still, model-based techniques can be used to derive useful features for system monitoring, see e.g. [16]. One example is model-based residuals that can filter out system dynamics and improve signal fault-to-noise ratio which reduces the need for complex data-driven classification models to distinguish between different fault classes [1].
Several recent papers consider hybrid diagnosis system designs for dynamic systems combining, e.g., model based residual generators and machine learning. In [37], model-based residuals and observers are used for fault detection in an automotive braking system where a hybrid fault isolation approach is proposed combining model-based fault isolation and support vector machines. The same braking system case study is also considered in [38] where an ensemble classifier is proposed combining both model-based and data-driven fault detectors. In [39], both sensor data and residual data are used as input to a tree augmented naive Bayes fault classifier. In [40], feature selection using neural networks is applied before training the fault classifiers. In [41], model-based residuals and sensor data are used as inputs to a Bayesian network to perform fault classification and in [42] model data features are extracted and fed into a neural network classifier. With respect to previous work, the proposed diagnosis system can identify unknown fault scenarios and estimate the fault size of known faults.
Fault size estimation is important for tracking system degradation [16], prognostics [17] and fault reconstruction [18]. In [43], a robust linear receding horizon model-based approach is proposed for fault estimation for linear discrete-time state space models with additive faults. In [44], a data-driven fault estimation algorithm is evaluated on an aircraft gas turbine case study where a linear state-space model, is estimated from data, and then used to estimate the fault size by assuming additive fault models. In these mentioned works, the faults are included as parameters in the system model and estimated using model-based techniques. With respect to these works, a data-driven fault size estimation algorithm is proposed when no model of the fault is available.
Some work on data-driven fault size estimation has been done, see for example [45,46]. In [45], faults are assumed to appear as pulses in the time domain data which is inherently tied to the bearing case. In [46], Paris' formula [47], estimating crack growth in bearings, is used to interpolate between distributions from known fault sizes. A hybrid method is proposed in [48] for structural health monitoring combining finite element models and machine learning to estimate fault location and severity. In [49], a fault detection and estimation scheme are proposed for incipient faults using the Jensen-Shannon divergence measure. With respect to previous work, fault size estimation is formulated as an optimization problem using the KL divergence and training data from different types of fault realizations.

Background
To address the fault diagnosis problem, a data-driven framework is proposed to model data from different fault classes and quantify fault diagnosis performance. A summary of the relevant results and definitions from previous work [7] is presented here.

Modeling Fault Classes
Letr = (r 1 , r 2 , . . . , r n ) denote a set of n signals or features, for example residuals. A sample ofr at time index t, denotedr t , belongs to one of m known fault classes {f 1 , f 2 , . . . , f m }. The fault-free class is denoted N F (No Fault). To capture the impact of model uncertainties and measurement noise, each dataset {r 1 ,r 2 ,r 3 , . . .} is partitioned into batches of size N , for example R = {r t−N +1 ,r t−N +2 , . . . ,r t }, where the distribution of data in each batch is modeled as a probability density function (pdf) p = p(R). An illustration of the one-dimensional case is shown in Figure 2. The figure shows partitioned time-series data and the estimated Gaussian distribution for each batch. For comparison, the distribution for each batch is also estimated using a kernel density estimator. It is assumed that all samples in one batch belong to the same fault class.
The pdf p(R) of each batch R varies depending on different system operating conditions, such as, operating point and fault realization. Let Ω i = Ω i (r) denote the set of pdfs, based on the feature setr, that can be explained by fault f i where p(R) ∈ Ω i (r) is used to denote one pdf p(R) in the set. For example, if p ∼ N (µ, Σ) is a multivariate normal distribution, the mean estimateμ and covariance estimateΣ can be obtained from a batch of N samples, R = {r 1 ,r 2 , . . . ,r N }, aŝ The following definition is used to model each fault class.
Definition 1 (Fault mode). Let {r 1 ,r 2 ,r 3 , . . .} be a set of time series data, when fault f i is present, that is partitioned into a set of consecutive batches where each batch is represented by a pdf p(R). A fault mode is defined by the set Ω i (r) with corresponding pdfs p(R) ∈ Ω i (r) for each batch that can be explained by fault class f i .  : An illustration of the one-dimensional residual which is partitioned into consecutive batches and the estimated data distribution of each batch. The distribution is estimated using a Gaussian distribution and a kernel density estimation. The upper plot shows the estimated Gaussian distribution using all samples in each batch and the lower plot the estimated distribution after removing 10% of the outliers in each batch.
To simplify notation, p and Ω i are used where the dependence on R and r, respectively, are omitted. Different fault classes f i are represented by different modes Ω i where Ω N F is used to denote the fault-free mode. Note that each fault mode Ω i is modeled independently and there can be pdfs that belong to multiple fault modes, i.e., different fault realizations can result in the same distribution of data in one batch. Detectability and isolability of different fault classes depend on if there are observations (pdfs) that can be explained by one fault class but not another, i.e., a fault class f i can be isolated from another fault f j if there is a p ∈ Ω i such that p ∈ Ω j .
If a fault f i is isolable from the fault-free class, then the fault is said to be detectable.
Using Definition 2, it is possible to analyze fault detection and isolation performance by comparing the different sets Ω i where i = 1, 2, . . . , m [50]. Note that, even though fault modes are isolable from each other it does not mean that all pdfs p ∈ Ω i can distinguish f i from f j . These ambiguities are caused by, for example, model uncertainties and sensor noise or lack of fault excitation. One illustration of fault excitation is leakage detection where the flow through the orifice, and thus detection performance, depends on the pressure difference since a small difference will result in a small leakage flow which can be difficult to detect.

Fault Classification by Rejection of Fault Hypotheses
Assuming that faults can be small or have varying impact on the feature setr, each fault class f j is modeled such that the nominal mode Ω N F ⊆ Ω j . An implication is that it is possible to distinguish faults from nominal behavior but not vice versa. This is consistent with the principles of consistencybased fault isolation algorithms, such as [51], where p ∈ Ω N F can be explained by a fault free system but also that the system is faulty (in case the fault has not yet been detected).
Instead of selecting the most likely target class, the set of plausible fault hypotheses is computed by rejecting fault classes that cannot explain data. A fault is detected when the fault-free class is rejected, i.e., when p ∈ Ω N F . Similarly, fault classification is not performed by selecting the most likely fault class but by rejecting fault hypotheses, e.g., a fault class f j is rejected if p ∈ Ω j . This principle will be used here when performing data-driven fault isolation.

Quantitative Fault Diagnosis Analysis
The similarities between the different fault modes {Ω 1 , Ω 2 , . . . , Ω m } can be used to analyze fault diagnosis performance for a given system. However, only analyzing qualitative performance, such as fault isolability in Definition 2, does not give sufficient information regarding how easy it is to detect and isolate different faults. If feature distributions are used to model each fault class, one way to quantify fault diagnosis performance is to use the KL divergence to measure the similarity between two pdfs [7]. The KL divergence can be used as a similarity measure between pdfs and is defined as [52] where E p [·] denotes the expected value given the pdf p. From a fault diagnosis perspective, (2) can be interpreted as the expected value of a log-likelihood ratio test determining ifr is drawn from a distribution with a pdf p or q when p is the true density function. If p and q are two pdfs representing two different fault realizations, the larger the value of K(p q) the easier it is to distinguish p from q when p is true. However, since each fault can have different realizations, quantitative isolation performance of fault f i with realization p from another fault f j is defined by the smallest value of K(p q) for all q ∈ Ω j . This measure is proposed in [7], called distinguishability, and is defined as If the distribution of batchr is described by pdf p when fault f i is present, i.e., p ∈ Ω i , the distinguishability measure D * i,j (p) quantifies how easy it is to isolate f i from f j . A large value of (3) corresponds to an easier isolation problem [32]. Fault detection performance is denoted D * i,N F (p). The distinguishability measure D * i,j (p) is non-negative and equal to zero if and only if p ∈ Ω j , i.e., when it is not possible to isolate from fault class f j . Another property of the distinguishability measure is that if Ω N F ⊆ Ω j , then [7] This result can be interpreted as that it is easier to detect a fault f i than to isolate it from another fault f j .
If p ∼ N (µ p , Σ p ) and q ∼ N (µ q , Σ q ) are two n-dimensional multivariate normal distributions with known mean vectors, µ p , µ q ∈ R n , and covariance matrices, Σ p , Σ q ∈ R n×n , K(p q) can be computed analytically as [53] can be computed analytically if p and q are multivariate normally distributed, other probability distributions, such as Gaussian mixture models [53], can be used to model the distribution ofr. Gaussian mixture models are more flexible than multivariate normal distributions but have no analytical expression for the KL divergence. Even though there are methods to numerically approximate (2), see for example [54], it is a tradeoff between model fit and evaluation cost of D * i,j (p). For illustration, Figure 2 shows an example of residual data from the case study. The upper plot shows that the normal distribution can approximate the mean and variation in batch data, except when there are outliers, and the variance is overestimated. One solution is to remove the extreme values, e.g., 10% of the samples in each batch, by considering them as outliers before estimating the normal distribution. The lower plot in Figure 2 shows the resulting normal distributions and it is visible that the estimated variance better captures the variation in data represented by a kernel density estimator. In this work, it is assumed that data in each batch can be represented by a multivariate normal distribution.

Approximated Distinguishability Measure Using Training Data
In many applications, the sets Ω j are either partially, or completely, unknown because training data only consist of a limited number of fault realizations. This means that the distinguishability measure (3) cannot be used. Instead, an approximated distinguishability measure is proposed where each fault mode is estimated from training data.

Distinguishability Measure for Data-Driven Analysis
If training data are correctly labelled, partitioning the data into batches can be used to estimate pdfs belonging to different fault classes. The estimated pdfs belonging to fault class f j can be used to make a lower approximation of the true fault mode Ω j denotedΩ j ⊆ Ω j . Then, an approximation 13 of (3) can be computed as i.e., distinguishability is computed based on the set of already observed realizations of each fault f j available in training data. The approximate distinguishability measure (6) is illustrated in Figure 3 where three pdfs, p 1 , p 2 , and p 3 , are compared to a set of pdfs that is used to represent a fault modeΩ j . The dashed lines show each pdf q that minimizes (6) for each p k . The lower plot shows the computed KL divergence from one pdf, p 2 , to all q ∈Ω j which, in general, increases for pdfs q that are located further away from p 2 .
SinceΩ j ⊆ Ω j , the relation between (3) and (6) is given by the inequality The approximation (6) gives an upper bound of how easy it is to reject f j for f i given p. Note that (4) also holds for (6), i.e., D i,N F (p) ≥ D i,j (p).

Open-Set Fault Classification Using Distinguishability
Since different faults can have similar impact on system operation, it is relevant to not only select the most likely fault class but to identify all fault classes that can explain a set of observations. Here, a set of m one-class classifiers is used to model data from each of the m fault classes to see if each class can explain the observation or not. If a pdf p cannot be explained by any of the known fault classes, i.e., p ∈Ω j for all f j , it belongs to an unknown fault class. Note that there can be two types of unknown faults [20]: • Data belong to a new unknown fault class, or • Data is a new realization of a known fault class f j , i.e., p ∈ Ω j \Ω j .
These unknown fault scenarios require extra attention, for example, by a technician, to identify the root cause and correctly label data to be used for updating the corresponding fault mode. The upper plot shows an illustration of the approximate distinguishability measure (6) from a set of pdfs, p 1 , p 2 , and p 3 , to a fault modeΩ j . The dashed lines show each pdf q that minimizes (6) for each p i . For illustration, the computed KL divergence for each q ∈Ω j is written in the lower plot given p 2 .

One-class Classification
Here, a one-class classifier is proposed using the approximate distinguishability measure (6) to tests if a new pdf p can be explained by a fault mode f j or not. The proposed classifier is denoted D j (p) = min q∈Ω j K(p q) with respect to (6) to emphasize that the class label of p is unknown. Hereafter, the classifier modeling a fault mode f j will be referred to in the text as D j .
An open set classification algorithm for fault diagnosis is then formulated using one D j classifier for each known fault class f j . Fault classification is then performed such that the fault hypothesis f j is rejected if where J j is a threshold. A small threshold J j increases the risk of falsely rejecting the true fault class while a large threshold J j means that fault f j is more likely to be a wrong hypothesis increasing fault classification ambiguity.

Tuning of The One-Class Classifier Thresholds Using Within-Class Distinguishability
A tuning strategy of the threshold J j in (8) is proposed such that most of the pdf's p ∈Ω j should be explained by fault class f j if p is removed from which is here referred to as within-class distinguishability. Analyzing the distribution of D j,j (p) for all p ∈Ω j can be used to select a threshold. The distribution will have non-negative support and is here approximated using a kernel density estimation method [6] as illustrated in Figure 4. Note that (9) does not state anything about the relation between the setsΩ j and Ω j but rather gives information about how scattered training data are from that fault class. Let Φ(x) denote the cumulative density function (cdf) of the estimated distribution and let α denote a desired false alarm rate. Then, the threshold J j is selected such that Φ(J j ) = 1 − α. The lower plot in Figure 4 illustrates selecting a threshold corresponding to α = 5%.

Computing Fault Hypotheses
If D j (p) is large, it means that p is not likely to be explained by fault f j . If D j (p) is large for all fault classes f j , this means that no known fault class can explain p, thus indicating the occurrence of an unknown fault class. These results give a systematic approach to compute fault hypotheses, including the unknown fault case, by evaluating and comparing D j (p) for each known fault class f j .
The classifier (diagnosis) output D given an observation p is determined using the following decision rule: , the output of the classifier is that the system is fault-free. An advantage of testing the fault hypothesis for each fault class individually, is that there is no bias in the diagnosis output if the fault models are trained using imbalanced datasets. A fault f j is a diagnosis candidate if D j (p) does not exceed the threshold J j . Note that when the output D contains at least one known fault class, it is still plausible that an unknown fault type has occurred, thus f x is always a plausible fault hypothesis. However, f x is only explicitly stated as a diagnosis output if D j (p) is large for all f j . If the N F class is not rejected, i.e., if D N F (p) ≤ J N F , there could still be an undetected fault present in the system sinceΩ N F ⊆Ω j for all f j . However, if the N F class is not rejected, it is more likely that the system is fault-free than faulty. Therefore, the only diagnosis output in this case is that the system is fault-free.

Data-driven Fault Size Estimation
The method presented in Section 6 provides a means to classify new data, but it does not give any information about the severity of these faults. If each pdf q k ∈Ω j has a known fault size θ k , this information can be utilized to estimate the size of new realizations of fault f j by comparing how similar the distribution of new data is to training data. One approach that has been suggested in [55] is to model faults into qualitative classes, such as {normal, slight, large}. Another way, which is a method that is largely unexplored, is to find a quantitative severity estimationθ. Here, it is assumed that two pdfs p and q are from the same fault class f i and have similar fault sizes θ p θ q should also be similar in a KL divergence sense, i.e., K(p q) should be small. Fault size estimation is formulated as a convex optimization problem by using the KL divergence as a dissimilarity measure.
The fault size of a pdf p is estimated by finding a representation of p using a set of training distributions. Assume that for a given p, f j ∈ D is a fault hypothesis. Then, the fault size θ of the corresponding fault class f j is estimated from a linear combination of pdfs {q 1 , q 2 , . . . , q M } =Ω j , where M = |Ω j |, by solving the following optimization problem: The fault size estimateθ is then computed as a weighted sum of the fault sizes θ 1 , θ 2 , . . . , θ M , corresponding to the pdfs q 1 , q 2 , . . . q M , asθ = M k=1 λ * k θ k . The estimatep in (10) is obtained by using all pdfs q k ∈Ω j in the optimization. This is an ineffective strategy since it increases the computational cost by adding numerous distributions q k likely to correspond to λ k = 0. If multiple datasets have been collected from a variety of severities and conditions, it is unlikely that batches of new data would have a distribution that is similar to all pdfs in the training set. Using this line of reasoning, only a small subset of all pdfs is reasonably of interest when formulating the optimization problem (10).
Let {q (1) , q (2) , . . . , q (M ) } =Ω j denote an ordered set of all elements such that K(p q (1) ) ≤ K(p q (2) ) ≤ . . . ≤ K(p q (M ) ). ThenΩ l j ⊆Ω j is defined as the first l elements in the ordered set to reduce the number of parameters in (10). The parameter l ≤ M is the cardinality ofΩ l j and is calibrated to include the subset of elements inΩ l j that are "similar" to p, i.e., K(p q) is relatively small and similar in value to K(p q (1) ).
If q (1) , q (2) , . . . , q (l) are multivariate normal distributions then l k=1 λ k q (k) is a Gaussian mixture model [6]. Here, Monte Carlo sampling is used to estimate K(p q), where q = l k=1 λ k q (k) , as by generating v samples {x γ } v γ=1 from p to approximate the integration in (2). By the law of large numbers lim v→∞ K M C (p||q) = K(p||q). A closer examination of the actual upper and lower bounds of this approximation is found in [56].
Applying (11) in (10) gives the updated algorithm: where {q (1) , q (2) , . . . , q (l) } = Ω l j and the fault size is estimated asθ = l k=1 λ * k θ k . Note that if the diagnosis output D contains multiple fault hypotheses f j , a fault size is estimated for each fault hypothesis by solving (12) for each f j ∈ D.

Case study
The diagnostic framework is evaluated by using experimental data collected from an engine test bench, see Figure 5. The engine is a commercial, turbo charged, four-cylinder, internal combustion engine from Volvo Cars. The sensor and actuator setup are the standard commercial configuration for the engine [1]. Figure 6 shows a schematic view of the engine along with the monitored signals where y denotes sensor measurements and u denotes actuator signals. The system represents the air path through the engine and is an interesting case study for fault diagnosis because of its non-linear dynamic behavior and wide operating range. In addition, the coupling from the exhaust flow to the air intake by the turbocharger complicates fault isolation since the effect of a fault anywhere in the system will affect the behavior of many other components in the whole system. speed at its highest possible level, which provides a fast transient response, or to lower the back pressure, which ensures good fuel economy. This leads to two different control strategies that will be described in section 6.
Matching up a compressor, a turbine, and an engine is a complex task that involves several

OPTIM FORMULAT
The brake-specifi fined as the fue generated power BSF where N is the second. One prob is that there is Therefore it is a T q 2 π N / * m f wh for best fuel effi scenario with co economy is thus

Data Collection
Engine sensor data are collected from various operating scenarios including different types of faults and fault magnitudes. The fault classes include four multiplicative sensor faults, a leakage in the intake manifold after the throttle, as well as nominal system operation, see Table 1. The sensor faults are introduced by altering the sensor output gain in the engine control system. Since the errors are injected in this way, the faulty signal output is used in the engine control scheme which gives a more realistic fault realiza-tion compared to if the error is simulated in the data using post processing. Each sensor fault is injected by multiplying the measured variable x i in each sensor y by a factor θ such that the resulting output is given as y = (1 + θ)x where θ is the fault size and θ = 0 corresponds to the nominal case. The leakage in the intake manifold is introduced by opening valves with different diameters during operation.  [58]. The cycle is shown in Figure 7 and is used since it covers a variety of operating conditions. One dataset has been collected for each fault class and fault size in Table 2 resulting in 26 datasets (24 fault scenarios and two faultfree datasets). Each fault is introduced in the dataset after approximately two minutes of the driving cycle. Table 2: Fault classes and known magnitudes (i.e., θ for multiplicative sensor faults and leakage diameters) represented in training data. Data from the leakage f iml have been collected from two known diameters of the orifice.

Residual Generation
The proposed method can be applied to any set of features to be used for fault diagnosis. In dynamic systems operated in various transient operating conditions, such as the engine, using raw sensor data as features requires that a data-driven classifier captures these dynamics since these signals can vary significantly over time. Here, a set of four residual generatorsr = (r 1 , r 2 , r 3 , r 4 ) is generated by comparing predictions from a set of Recurrent Neural Networks (RNN) with the corresponding sensor outputs, see Figure 8, that will be used as features for fault diagnosis in the case study. A summary of the set of residual generators used in the case study is presented here. For the interested reader, a more detailed description is given in [8].
The prediction performance of two of the four residual generators is shown in Figure 9 and Figure 10, respectively. The figures show that the residual generators filter out most of the system dynamics and have a small relative prediction error. To show the impact of different faults on the residual output, three of the four residuals are plotted against each other for different fault classes in Figure 1. The different faults are projected into different directions in the residual space which indicates that it is possible to distinguish between these faults. However, some fault classes are partially overlapping, e.g., a fault in the sensor measuring pressure after the intake manifold, f ypim , and a leakage in the intake manifold, f iml . It is expected that it is more difficult to distinguish between these two faults since they are related to the

Evaluation
The proposed methods for quantitative fault diagnosis analysis and openset fault classification are evaluated using data from the engine case study. Residual data from all fault scenarios in Table 2 are partitioned into batches where each batch is used to estimate a multivariate normal distribution. Different batch sizes are tested to evaluate the effect on classification performance. First, the distinguishability measure (6) is used to evaluate fault detection and isolation performance. Then, the proposed D j classifier is evaluated, including classification of unknown faults and fault size estimation.

Data Processing
To analyze how the batch size will impact fault diagnosis performance, outputs from the four residual generators are partitioned into batches of various lengths in the interval of 50 -300 samples. For each interval, the mean and covariance matrix of a four-dimensional multivariate normal distribution are estimated.
To evaluate the modeling assumption that batch data are multivariate normal distributed, the distribution of residual data is compared to the estimated normal distribution. In Figure 11, samples from the four residuals are plotted pairwise against each other together with an ellipse representing the estimated covariance with 95% confidence interval. Each column represents one batch of data and each row represent one combination of residual outputs. The blue ellipses represent the covariance estimated from all samples. To avoid an overestimation of the covariance due to outliers, the orange ellipses show the estimated covariances after removing 10% of the outliers in each batch. These two approaches of estimating the pdfs of each batch will be further discussed in later sections. It is visible that the assumption that data is multivariate normal distribution is an approximation, especially when there are model uncertainties and outliers in residual data. Still, it gives some information about data distribution and correlation between features which can be used for fault classification.
Before designing the set of D j classifiers, the set of estimated pdfs for each fault class is then randomly split into a training and validation set where 67% are used for training. The training set from each fault class is used to model each fault modeΩ i for the different fault classes represented in training data, see Table 2. The training set covers different realizations and magnitudes of each fault. Note that pdfs estimated from the fault-free case are included in all fault modes, i.e.,Ω N F ⊆Ω j for all j = 1, 2, . . . , m.

Evaluating Fault Diagnosis Performance
The first step of the analysis is to evaluate the set of modeled fault modes to quantify how easy it is to distinguish between the different fault classes. In the analysis of fault diagnosis performance using the distinguishability measure, all available datasets from the different fault classes are used. The distinguishability measure is evaluated for all pdfs p ∈Ω i with respect to all other fault modes and the distributions of D i,j (p) values for different fault magnitudes when the batch size N = 100 are plotted in Figure 12. The subplot at position (i, j) shows the distribution of D i,j (p) for all p ∈Ω i . The marks on each vertical line represent the 10%, 25%, 50%, 75%, and 90% quantiles. The results show that detection and isolation performance, in general, improves with increasing fault magnitude and that all faults are distinguishable from each other. In addition, fault f ypic should be the easiest of the three sensor faults to distinguish while, e.g., f ywaf is more difficult since the distinguishability measure is significantly smaller. Another observation is that the distinguishability measure is not symmetric, i.e., it might not be as easy to distinguish mode f i from f j as the other way around [7]. For example, it is easier to distinguish f ypic from f ypim than vice versa which is shown by that the distinguishability measure is larger, see Figure 12. Also, it is easier to distinguish each fault mode from the fault-free mode, see the leftmost column in Figure 12, than to distinguish from the other fault modes, which is consistent with (4).
Some of the results are summarized in Table 3 corresponding to detection performance of the different fault classes. The tables show the mean value of  ) shows that fault f i is easier to distinguish from fault mode f j with increasing fault size which is expected. Note that the distinguishability measure is non-symmetric, e.g., it is easier to distinguish f ypic from f ypim (or f ywaf ) than vice versa. D i,j (p) when p are estimated from batches of different sizes (between 50 and 300 samples). It is visible that D i,j (p) increases for fault sizes with higher distinguishability values when increasing the batch size, while it is stable or slightly decreasing when the distinguishability measure is small. In general, longer batch sizes make it easier to distinguish fault f ypic while it becomes more difficult for the other three fault classes. One explanation is that shorter batches can make it easier to distinguish the impact of the fault when fault excitation and residual noise level varies over time.

Fault Classification
The open set fault classification algorithm described in Section 6 is implemented where a D j classifier is trained for each known fault class f j , as described in Section 6. A threshold is selected based on the distribution of the within-class distinguishability measure (9) using kernel density estimation to have a 5% outlier rate. The calibrated thresholds for each fault class are presented in Table 4. For comparison, the kernel density estimations of both training and validation data for one fault class are shown in Figure 4.

. Classification of known fault classes
First, the set of D j classifiers are evaluated using pdfs from the known fault classes in the validation set. Since each batch of residual data is modeled as a multivariate normal distribution, estimation of the covariance matrix can be sensitive to outliers in the residual outputs. Therefore, an additional set of D j classifiers are trained using trimmed estimates of the covariance matrix by first removing 10% of the outliers, denoted D trim j . For comparison, two sets of one-class support vector machines (1SVM) [59] are trained. The 1SVM classifiers are implemented using the function fitcsvm in Matlab and their kernel parameters are fit to training data using a subsampling heuristic [60]. When analyzing the results of the D j classifier, the false alarm rate (< 2%) was lower compared to the outlier rate of 5% when selecting the threshold J j . Therefore, an outlier rate of 2% was selected when training the 1SVM classifiers to give more comparable results. The first 1SVM classifier uses the mean of the pdfs as input, referred to as 1SVM-µ and the second set uses the raw residual data as input, referred to as 1SVM-r. Results are presented when the pdfs are estimated using two different batch sizes: 50  Ideally, probability of rejection should be 100%, for all non-zero fault sizes except when classifying data from the same fault class that the classifier has been trained on. In that case, probability of rejection should be as small as possible because this would otherwise mean that the true fault class is rejected. Classification performance is consistent with the previous results in Figure 12 showing f ypic is easiest to diagnose, since the probabilities to reject the wrong fault hypotheses are higher compared to the other fault scenarios. The most difficult fault to diagnose is f ywaf . The results in Table 3 are also consistent with the analysis in Figure 12  the D j classifier is better for smaller batch sizes. When comparing the results between the four algorithms in Figure 13 and Figure 14, the 1SVM-r classifier has the overall worst performance while 1SVM-µ improves with increasing batch size. However, using trimmed estimates of the covariance matrix in the D trim j classifiers improve classification performance for both shorter and longer batch sizes. Table 5 shows the detection performance values to compare D j , D trim j , and 1SVM-µ, for the two different batch sizes. These results indicate that outliers in residual data could explain the reduced performance of the original D j classifiers for longer batch sizes. Thus, one solution to improve classification performance is to apply robust estimation of the normal distribution parameters.  Figure 13 and Figure 14 of D j classifier, D trim j , and 1SVM-µ, respectively, when a fault is present, and false alarm rate when fault size is 0%. Another interesting observation is that D j and D trim j are significantly better at distinguishing fault f iml from f pim compared to the 1SVM classifiers. These faults are expected to be difficult to distinguish from each other since data from the two classes are overlapping, as illustrated in Figure 1. However, classifying distributions instead of only sample means, makes it possible to distinguish between the two classes, which is expected from the quantitative analysis in Figure 12. The results from the experiments show that approximating residual data using a multivariate normal distribution is sufficient in this case study to distinguish between the different fault classes. However, it is likely that classification accuracy can be improved by using a more flexible model to estimate the distribution, at the expense of increased computational cost. Another approach to improve classification accuracy, especially detection of small faults, is to compare the diagnosis output over consecutive batches. Since more and more data will be collected over time, the classification accuracy in Figure 13 and Figure 14, respectively, can most likely be improved by allowing a longer time to detect.
An advantage of using the D j classifiers is that less memory is needed to store information, since it is sufficient to store distribution parameters and not the original batch data. If different remote diagnosis solutions are used which have access to more computational capabilities for data analysis compared to what is available in an on-board diagnosis system, it is also relevant to minimize the amount of transmitted data [61,62].
A limitation of the D j classifier is that evaluating (6) corresponds to a nearest neighbor search problem, which can be computationally heavy when the cardinality of the setΩ j is large. Parallelization and the use of different heuristics could help to prune the search space and thus, significantly reduce the computation time, see for example [63].

Classification of unknown fault class
Fault classification is here performed by rejecting fault hypotheses, as described in Section 6.3 where each D j classifier is used to determine if fault class f j can explain the distribution of batch data or not. Note that each pdf in the validation set is evaluated against each known fault class, independently of the other fault classes. Thus, the performance of classifying unknown fault classes is based on the probabilities that all known fault classes are rejected. This can be evaluated by using the previous results in Figures 13  and 14. A scenario where one of the fault classes is assumed to be unknown can be evaluated by ignoring the results in the column that corresponds to the D j classifier which models the unknown fault class.
In the first example, an unknown fault scenario is evaluated using validation data from fault f ypic , i.e., it is assumed that there are no training data from that fault. To evaluate the probability that the known fault classes, N F , f ypim , and f ywaf , are correctly rejected in this scenario, the first row is analyzed in Figure 13 (and Figure 14) while ignoring the second column which corresponds to the D j classifier modeling fault f ypic . In this case all classifiers have more than 80% probability of rejecting the N F class for all realizations of f ypic in validation data. Similarly, the other known fault classes have a high probability of correctly being rejected, except for the 1SVM-r classifier modeling fault f ypim which has a low probability of rejecting f ypim for scenarios when f ypic is of size 5%. This shows that f ypic is likely to be correctly classified as an unknown fault since all known fault classes will be rejected with high probability. Note that the probabilities of rejecting each fault class are evaluated by classifying one pdf. The probability of correctly classifying smaller faults can be significantly improved by evaluating multiple pdfs corresponding to consecutive batches and compare the rate that each fault class is rejected with respect to the probability that it is falsely rejected when it is the true fault class.
As a second example, fault f iml is simulated as an unknown fault while the other fault classes in the training set are known. Then, the last row is analyzed in Figure 13 (and Figure 14) while ignoring the last column. When comparing the probability of correctly rejecting each known fault class in the last row the 1SVM-µ performs better than the D j (p) and D j (p) trim classifiers to detect the fault (rejecting the N F class). The 1SVM-r classifier modeling the N F class has significantly worse performance, especially for the 4mm leakage. However, when analyzing the probability of rejecting the other known fault classes, f ypic , f ypim , and f ywaf , the D j (p) and D j (p) trim classifiers have an overall better performance compared to 1SVM-µ and 1SVM-r, especially for the 4mm leakage. The 1SVM-r is not able to reject f ypic and f ypim , for any of the leakage sizes since the probabilities of correctly rejecting those fault classes are not significantly higher than the probability that each classifier is falsely rejecting the true fault class. For the larger batch size, the 1SVM-µ performs significantly better for the 6mm leakage compared to the 4mm leakage but is not able to reject f ypim .
The two examples show the ability of the proposed open set fault classification algorithm, using D j (p) or D j (p) trim classifiers, to identify unknown fault classes. The examples also illustrate the computational benefits since there is no need of recalibrating the whole set of D j classifiers for the existing set of known fault classes when updating one classifier D j with new training data or when including a new classifier for a new fault class.

Fault Size Estimation
The fault estimation algorithm (10) described in Section 7 is applied to the validation set when the corresponding fault size of each pdf in the training set is known. The numerical KL divergence (11) is evaluated using 1000 Monte Carlo samples where the cost function is minimized using the 10 pdfs in training data with the smallest D j (p) values. The fault size estimation for each pdf p using (10) is denotedθ KL .
The prediction results for each fault class are shown in Table 6 by presenting the 10% and 90% quantiles of the fault size predictions. To evaluate the proposed data-driven fault size estimation algorithm (10), it is compared to estimating the fault sizeθ mean by computing the mean fault size of the 10 pdfs with the smallest D j (p) values. For the proposed algorithm, the true fault sizes are within the intervals while for the estimateθ mean , the true fault sizes are outside the intervals for the largest realizations of f ywaf . The algorithm (10), in general, has a narrower interval compared to using the mean estimate. Note that a limitation of the evaluated methods is that they are not capable of estimating fault sizes beyond what is available in training data.
The prediction error correlates with the analysis of distinguishability between different fault modes in Figure 12. The intervals are smaller for f ypic while f ywaf has the largest intervals. One solution to improve the estimation accuracy over time is to look at the distribution of the estimated fault size over consecutive batches.

Concluding Remarks and Future Works
A data-driven framework for fault diagnosis of technical systems and time-series data is proposed that can handle imbalanced training data and unknown faults. The KL divergence is used as a similarity measure when evaluating if new data can be explained by that class or not. An advantage of the proposed D j classifier, with respect to other black-box models, is interpretability, where the quantitative performance analysis and modeling of different fault modes can give valuable insights about the nature of different faults, and the ability to integrate fault size estimation within the proposed framework. The open set fault classification algorithm consists of a set of one-class classifiers modeling each fault class which makes it possible to identify all plausible fault hypotheses including scenarios with likely unknown faults. Instead of sample-by-sample classification, the KL divergence is used to classify if a batch of data can be explained by a given fault class or not. Experiments using real datasets from an internal combustion engine test bench show that the proposed framework can predict which faults that are easy to classify. They also show that the D j classifiers can classify faults that were difficult to distinguish using the 1-SVM classifiers and that it is possible to give an accurate estimation of the fault size without the need of a parametric model of the fault. Table 6: Results from using the fault size estimation algorithm (10) on engine residual data. For each fault size θ, the intervals represent the 10% and 90% quantiles of the estimateŝ θ KL that are computed using (12) andθ mean which is the mean fault size computed from the 10 pdfs with smallest D j (p) values. Estimating pdfs from training data becomes complicated when the number of features grows. For future works, the objective is to adapt the proposed methods for large-scale problems by using distributed fault diagnosis techniques but also reduce computation complexity of, e.g. (6) when training data grows, by using different heuristics or systematic search algorithms. Another interesting continuation of this work is to extend the proposed methods for applications in condition monitoring and prognostics to be used for predicting system degradation rate by using batch data to track changes in fault size.