A Deep Adversarial Approach Based on Multi-Sensor Fusion for Semi-Supervised Remaining Useful Life Prognostics

Multi-sensor systems are proliferating in the asset management industry. Industry 4.0, combined with the Internet of Things (IoT), has ushered in the requirements of prognostics and health management systems to predict the system’s reliability and assess maintenance decisions. State of the art systems now generate big machinery data and require multi-sensor fusion for integrated remaining useful life prognostic capabilities. When dealing with these data sets, traditional prediction methods are not equipped to handle the multiple sensor signals in unison. To address this challenge, this paper proposes a new, deep, adversarial approach to any remaining useful life prediction in which a novel, non-Markovian, variational, inference-based model, incorporating an adversarial methodology, is derived. To evaluate the proposed approach, two public multi-sensor data sets are used for the remaining useful life prediction applications: (1) CMAPSS turbofan engine dataset, and (2) FEMTO Pronostia rolling element bearing data set. The proposed approach obtains favorable results when against similar deep learning models.


Introduction
Reliability is defined as the ability of a product or system to perform its required functions without failure for a specified time, and when used under specified conditions. Therefore, reliability engineering has long been tasked with predicting the remaining useful life of systems by incorporating all available data. Reliability engineering has been given technologies incorporating cheap sensing with the Internet of Things (IoT) generating multi-dimensional data sets through Industry 4.0 [1]. With this new data at the engineer's fingertips, more sophisticated methodologies to handle this data have been developed and expanded within the prognostics and health management (PHM) field.
These data sets are often costly and time-consuming to label [2]. The engineer therefore must make an economic decision on how much data to label. Therefore, the greatest economic benefit would be to take advantage of unsupervised learning-based methodologies. To understand relevant system health states without labeling, deep learning methodologies have been shown to be a technique employed without the need for previous knowledge of degradation processes [3].
Most recently, remaining useful life (RUL) research focused on fully supervised deep learning methodologies has had success in RUL prediction [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. These models depend on the analyst having access to a fully labeled dataset. Therefore, these RUL prediction accuracies require the use of accurate training data labels. Moreover, this previous research does not attempt to develop the underlying generative or inference model. A reliability engineer does not always have the resources to label all the data necessary to train a deep learning model. A valuable methodology would provide the flexibility to include a small percentage of labeled data as it becomes available and as resources allow. Generative modeling is a class of modeling techniques which provides the ability to predict RUL without having to label what could be massive multidimensional sensor data.
There have been recent efforts in generative modeling research, although it has yet to be adapted and applied to reliability and machine health prognostics. Indeed, Bayer and Osendorfer [20] and Chung et al. [21] both employed the variational autoencoder (VAE) principles to times series observations. Krishnan et al. [22] encodes the state space assumptions from within their proposed structure inference deep Kalman filter-based methodology. Karl et al. [23] proposes a VAE principled state-space filtering methodology in which the latent space is forced to fit the transition. Mescheder et al. [24] present VAEs based on adversarial training, and they achieve the flexibility to represent families of conditional distributions over latent variables. Hu et al. [25] combine generative adversarial networks (GANs) with VAE by proposing a new interpretation of adversarial domain adaptation (ADA) and a unifying generative modeling framework named through comparisons with the wake-sleep algorithm [26]. These methods, while suited for their applications in computer science, lack the requirements for RUL predictions, such as time series applications. The Markovian assumption is also utilized, where it is assumed that all information of past observations is contained within the last system state; however, for PHM, this is insufficient. Multiple operating conditions increase the degradation complexity of the RUL predictions, and some degradation paths are inherently non-Markovian (e.g., crack growth). VAE on their struggle with low probability events, like curb strike events, inherent in large systems [27]. Additionally, for PHM applications with unsupervised RUL, these methods lack the VAE combined with the adversarial training of a GAN on time-series data to provide predictions.
To address these problems, this paper proposes a deep generative state-space modeling methodology for the remaining useful life prognostics of physical assets. The mathematical framework underpinning the proposed methodology delivers the following novel contributions for RUL predictions: (i) Non-Markovian transitions from multi-dimensional sensor data by generalizing a deep generative filtering approach for remaining useful life estimation of the system; (ii) a modeling approach that incorporates both variational and adversarial mechanisms; (iii) flexibility to handle both unsupervised and semi-supervised learning for the estimation of the remaining useful life. This method has vast applications for RUL predictions on both new and existing system assets.
The rest of the paper is organized as follows. Section 2 provides a brief overview of GAN, VAE and state-space modeling. Section 3 presents the proposed methodology and the underlying mathematical framework. Section 4 overviews the experimental results. Section 5 concludes with discussions and future work.

Background
The generative modeling research as mentioned above [20][21][22][23][24][25], all aims to tackle the problem of both a generative manifold space and inference modeling for prediction. There are slight differences between generative and inference modeling, but fundamentally they aim to solve the same problem: black-box neural transformations for implicit distribution modeling between the latent and visible spaces. For RUL estimation, reliability and PHM, this is equivalent to modeling the underlying degradation space, z, that is a result of the acquired observed sensor data set, x.
Traditional generative modeling approaches tend to distinguish between latent and visible variables clearly, and to treat them differently. However, a key aspect of generative modeling is that a clear boundary between the latent and visible variables (as well as generation and inference) is not necessary. Instead, viewing generative modeling as a symmetric pair helps in modeling and understanding as shown in Figure 1.

Generative Adversarial Networks
Generative Adversarial Networks (GANs) are a class of generative modeling techniques where two neural networks compete via a minimax game [28]. This game's objective is to develop/learn a generator distribution ( ) able to generate fake data identical to the real data distribution ( ). However, the generator does not directly have access to the real data. Instead, the generator distribution, ( ), transforms a vector of random noise, ( ), with an objective function, ( ). The generator is then trained against an adversarial discriminator network parameterized by a separate neural network whose objective, ( ), is to classify the data as real or fake, as shown in Figure 2.

Real Data
Fake Data z~p z (z) There is no mechanism within the GAN training to constrain and control the Nash Equilibrium point; however, the optimal discriminator ( ) should converge to equilibrium [29]. Formally, Equation (1) shows this value function: where G(z) is the generator objective function, D(x) is the discriminator objective function, Pdata(x) is the data distribution and Pz(x) is the noise distribution.

Variational Autoencoders
Variational autoencoders (VAEs) are a class of generative models which develop both an inference and a generative model [27]. VAEs attempt to develop a model of latent variables, , which can generate the observed data, . Formally, this is expressed as:

Generative Adversarial Networks
Generative Adversarial Networks (GANs) are a class of generative modeling techniques where two neural networks compete via a minimax game [28]. This game's objective is to develop/learn a generator distribution P G (x) able to generate fake data identical to the real data distribution P data (x). However, the generator does not directly have access to the real data. Instead, the generator distribution, P G (x), transforms a vector of random noise, P z (z), with an objective function, G(z). The generator is then trained against an adversarial discriminator network parameterized by a separate neural network whose objective, D(x), is to classify the data as real or fake, as shown in Figure 2.

Generative Adversarial Networks
Generative Adversarial Networks (GANs) are a class of generative modeling techniques where two neural networks compete via a minimax game [28]. This game's objective is to develop/learn a generator distribution ( ) able to generate fake data identical to the real data distribution ( ). However, the generator does not directly have access to the real data. Instead, the generator distribution, ( ), transforms a vector of random noise, ( ), with an objective function, ( ). The generator is then trained against an adversarial discriminator network parameterized by a separate neural network whose objective, ( ), is to classify the data as real or fake, as shown in Figure 2.

Real Data
Fake Data z~p z (z) There is no mechanism within the GAN training to constrain and control the Nash Equilibrium point; however, the optimal discriminator ( ) should converge to equilibrium [29]. Formally, Equation (1) shows this value function: where G(z) is the generator objective function, D(x) is the discriminator objective function, Pdata(x) is the data distribution and Pz(x) is the noise distribution.

Variational Autoencoders
Variational autoencoders (VAEs) are a class of generative models which develop both an inference and a generative model [27]. VAEs attempt to develop a model of latent variables, , which can generate the observed data, . Formally, this is expressed as: There is no mechanism within the GAN training to constrain and control the Nash Equilibrium point; however, the optimal discriminator D(x) = P data (x)/[P data (x) + P G (x)] should converge to equilibrium [29]. Formally, Equation (1) shows this value function: where G(z) is the generator objective function, D(x) is the discriminator objective function, P data(x) is the data distribution and P z(x) is the noise distribution.

Variational Autoencoders
Variational autoencoders (VAEs) are a class of generative models which develop both an inference and a generative model [27]. VAEs attempt to develop a model of latent variables, z, which can generate the observed data, x. Formally, this is expressed as: Sensors 2020, 20, 176 4 of 18 It is common for p(x|z) ≡ p θ (x|z) to be developed and parameterized by a neural network with parameters θ. For most cases, the posterior distribution p(z|x) is intractable.
However, an approximate posterior distribution, q φ (z|x), can be used to maximize the evidence lower bound (ELBO) on the marginal data log-likelihood: From this, the VAE objective is equivalent to minimizing the Kullback-Liebler (KL) divergence between q φ (z|x) and p θ (x|z), where p θ (x|z) and q φ (z|x) are parameterized by two neural networks with parameters φ and θ, as shown in Figure 3. It is common for ( | ) ≡ ( | ) to be developed and parameterized by a neural network with parameters . For most cases, the posterior distribution ( | ) is intractable.
However, an approximate posterior distribution, ( | ), can be used to maximize the evidence lower bound (ELBO) on the marginal data log-likelihood: From this, the VAE objective is equivalent to minimizing the Kullback-Liebler (KL) divergence between ( | ) and ( | ) , where ( | ) and ( | ) are parameterized by two neural networks with parameters and θ, as shown in Figure 3.

Encoder Decoder
Input Hidden Output q ϕ (z|x) p θ (x|z) The training of a VAE involves the training of two neural networks, the encoder, sometimes referred to as the recognition model, and the decoder, ( | ) sometimes referred to as the generative model. The encoder learns the relevant features of the input data and compresses the information to the latent hidden space. The decoder then attempts to generate signals (e.g., images) identical to the input data, and the reconstruction error is then minimized. Within the computer vision community, VAEs tend to produce blurred images that are not as sharp as those produced by other generative models. Within an engineering context, VAEs on their own can result in a common issue with particle filtering algorithms: Without a fully expressive generative model capable of handling extremely low probability events or sensor reading interactions, the resulting prognosis model may not have considered these non-Markovian events.

Proposed Methodology
Given the complexities and associated uncertainty of the fault diagnostic and prognostic problem, a proposed methodology would be one that is flexible enough to include new sets of information as they become available. Expert opinion, black swan events, abnormal operating conditions, knowledge of the underlying failure modes, physics of failure models and partially relevant information, can all be included within the remaining useful life estimation. While this information can be valuable, the methodology should also adequately generalize this data. For example, extracting relevant features, which may be known, may not be able to account for noisy sensor signals or operating conditions outside the norm. With this end, we propose the methodology shown in Figure 4. The training of a VAE involves the training of two neural networks, the encoder, q φ (z|x) sometimes referred to as the recognition model, and the decoder, p θ (z|x) sometimes referred to as the generative model. The encoder learns the relevant features of the input data and compresses the information to the latent hidden space. The decoder then attempts to generate signals (e.g., images) identical to the input data, and the reconstruction error is then minimized.
Within the computer vision community, VAEs tend to produce blurred images that are not as sharp as those produced by other generative models. Within an engineering context, VAEs on their own can result in a common issue with particle filtering algorithms: Without a fully expressive generative model capable of handling extremely low probability events or sensor reading interactions, the resulting prognosis model may not have considered these non-Markovian events.

Proposed Methodology
Given the complexities and associated uncertainty of the fault diagnostic and prognostic problem, a proposed methodology would be one that is flexible enough to include new sets of information as they become available. Expert opinion, black swan events, abnormal operating conditions, knowledge of the underlying failure modes, physics of failure models and partially relevant information, can all be included within the remaining useful life estimation. While this information can be valuable, the methodology should also adequately generalize this data. For example, extracting relevant features, Sensors 2020, 20, 176 5 of 18 which may be known, may not be able to account for noisy sensor signals or operating conditions outside the norm. With this end, we propose the methodology shown in Figure 4.  The methodology has two distinct phases: (1) Unsupervised learning assessment of RUL, (2) Semi-supervised learning assessment of RUL. It starts with the raw data signal fed into the unsupervised variational adversarial filter. Without knowledge of labeling (e.g., the system health states) at the start of operation of the system, this stage of development requires the use of unsupervised remaining useful life estimation. Once the system has had operational time, the engineer can start labeling data in a semi-supervised iterative loop, i.e., to identify the system's health states with corresponding input sensor data patterns. As it may not be feasible (time and cost-wise) to do so for all the available data, experiments have shown that semi-supervised methodologies with only a few percentages of the data set labeled can substantially improve the unsupervised methods [30]. Therefore, as the engineer labels data, the framework is robust enough to handle this percentage of labeled data, as shall be demonstrated later in Section 4.

Unsupervised Remaining Useful Life Formulation
In this work, we propose a mathematical formulation that encapsulates the following features: Both unsupervised and semi-supervised feature learning, adversarial-variational state-space modeling with non-Markovian transitions (i.e., it is not assumed that all information regarding past observation is contained within the last system state), adversarial training mechanism on the training of the recognition ( | : ), and variational Bayes for the inference and generative model ( | : ). As shown in Figures 5 and 6, we set as the observed sensor data, as the latent system health state (e.g., crack length, degradation), and is the target domain relevant to the adversarial training ∈ 0,1, … , . Blue lines represent the adversarial mechanism, dashed lines indicate inference processes and solid lines indicate a generative process. The transition parameters, , are inferred via a neural network. Past observations are directly included in the inferential model output. The proposed mathematical framework does not assume that all the information relevant to parameters is encoded within . The methodology has two distinct phases: (1) Unsupervised learning assessment of RUL, (2) Semi-supervised learning assessment of RUL. It starts with the raw data signal fed into the unsupervised variational adversarial filter. Without knowledge of labeling (e.g., the system health states) at the start of operation of the system, this stage of development requires the use of unsupervised remaining useful life estimation. Once the system has had operational time, the engineer can start labeling data in a semi-supervised iterative loop, i.e., to identify the system's health states with corresponding input sensor data patterns. As it may not be feasible (time and cost-wise) to do so for all the available data, experiments have shown that semi-supervised methodologies with only a few percentages of the data set labeled can substantially improve the unsupervised methods [30]. Therefore, as the engineer labels data, the framework is robust enough to handle this percentage of labeled data, as shall be demonstrated later in Section 4.

Unsupervised Remaining Useful Life Formulation
In this work, we propose a mathematical formulation that encapsulates the following features: Both unsupervised and semi-supervised feature learning, adversarial-variational state-space modeling with non-Markovian transitions (i.e., it is not assumed that all information regarding past observation is contained within the last system state), adversarial training mechanism on the training of the recognition q φ (z t |x 1:t ), and variational Bayes for the inference and generative model p θ (x t |z 1:t ). As shown in Figures 5 and 6, we set x t as the observed sensor data, z t as the latent system health state (e.g., crack length, degradation), and y t is the target domain relevant to the adversarial training y ∈ 0, 1, . . . , RUL. Blue lines represent the adversarial mechanism, dashed lines indicate inference processes and solid lines indicate a generative process. The transition parameters, θ t , are inferred via a neural network. Past observations are directly included in the inferential model output. The proposed mathematical framework does not assume that all the information relevant to parameters φ t is encoded within z t .  ... z 1 z 2 z T z t Figure 6. Inference training model.
To establish the training optimization, we denote the latent sequence z ∈ ⊂ ℝ as a set of real numbers and observations as x ∈ ⊂ ℝ . Now can be, but is not limited to, a multidimensional sensor data set from a large asset. The observations, x , are not constrained to a Markovian transition assumption. For engineering problems (e.g., crack growth and environmental effects on RUL) these transitions can be complex non-Markovian. Therefore, the degradation sequence ( | : ) generated by the discrete multi-dimensional sensor data sequences = ( , , … , ) and latent sequences : = ( , , … , ) are of interest to the engineer. This is shown in Equation (4): where : , ∈ ⊂ ℝ denotes the latent sequence. The basis of the latent dynamical system is assumed to have an emission model ( | : , : ) and transition model ( | : ) . Two assumptions are classically imposed on the emission and transition models as shown in Equations (5) and (6), These equations capture the assumption that the current state, , holds complete information for the observations , and the subsequent state . For noisy multidimensional sensor data sets with complex non-Markovian transition, this assumption is insufficient. The proposed mathematical formulation characterizes the state-space model without these assumptions.   ... z 1 z 2 z T z t Figure 6. Inference training model.
To establish the training optimization, we denote the latent sequence z ∈ ⊂ ℝ as a set of real numbers and observations as x ∈ ⊂ ℝ . Now can be, but is not limited to, a multidimensional sensor data set from a large asset. The observations, x , are not constrained to a Markovian transition assumption. For engineering problems (e.g., crack growth and environmental effects on RUL) these transitions can be complex non-Markovian. Therefore, the degradation sequence ( | : ) generated by the discrete multi-dimensional sensor data sequences = ( , , … , ) and latent sequences : = ( , , … , ) are of interest to the engineer. This is shown in Equation (4): where : , ∈ ⊂ ℝ denotes the latent sequence. The basis of the latent dynamical system is assumed to have an emission model ( | : , : ) and transition model ( | : ) . Two assumptions are classically imposed on the emission and transition models as shown in Equations (5) and (6), These equations capture the assumption that the current state, , holds complete information for the observations , and the subsequent state . For noisy multidimensional sensor data sets with complex non-Markovian transition, this assumption is insufficient. The proposed mathematical formulation characterizes the state-space model without these assumptions. To establish the training optimization, we denote the latent sequence z t ∈ Z ⊂ R n z as a set of real numbers n z and observations as x t ∈ X ⊂ R n x . Now X can be, but is not limited to, a multi-dimensional sensor data set from a large asset. The observations, x t , are not constrained to a Markovian transition assumption. For engineering problems (e.g., crack growth and environmental effects on RUL) these transitions can be complex non-Markovian. Therefore, the degradation sequence p(x t |z 1:t−1 ) generated by the discrete multi-dimensional sensor data sequences x t = (x 1 , x 2 , . . . , x t ) and latent sequences z 1:t−1 = (z 1 , z 2 , . . . , z t−1 ) are of interest to the engineer. This is shown in Equation (4): where z 1:t−1 , z t ∈ Z ⊂ R n z denotes the latent sequence. The basis of the latent dynamical system is assumed to have an emission model p(x t |x 1:t−1 , z 1:t ) and transition model p(z t |z 1:t−1 ). Two assumptions are classically imposed on the emission and transition models as shown in Equations (5) and (6), Sensors 2020, 20, 176 7 of 18 These equations capture the assumption that the current state, z t , holds complete information for the observations x t , and the subsequent state z t+1 . For noisy multidimensional sensor data sets with complex non-Markovian transition, this assumption is insufficient. The proposed mathematical formulation characterizes the state-space model without these assumptions.
Therefore, to derive the proposed mathematical framework of the proposed methodology, we first put forward the variational lower bound objective function from Equation (4), given that we do not make the Markov assumption from Equations (5) and (6). Thus, we have: As we know, Substituting into Equation (7) we get, Rearranging (9) we get, Applying the product rule on (10) we have, Applying the quotient rule on (11) we have, Separating (12) we have, However, we know that, Therefore, we have, where we simultaneously want to minimize the Kullback-Liebler (KL) divergence and maximize the variational (evidence) lower bound (ELBO), L(θ, φ; x 1:t ), as shown in Equation (16): Sensors 2020, 20, 176 8 of 18 Now, rearranging Equation (16), we have the non-Markovian variational lower bound derived for time series data in Equation (17): To add in adversarial training, we follow Goodfellow, et al. [28] and rewrite the optimization function from Equation (17) to Equation (18) as follows: We now have an objective function which gives us an expressive q φ (z t |x t , z 1:t−1 ), that is, we have a mathematical framework the characterizes the state-space model without the restrictive assumptions outlined in Equations (5) and (6). Additionally, this mathematical framework contains both the generative and inference models of the system state that allows us to perform fault diagnostics and prognostics as well as the RUL of the system assessment.

Semi-Supervised Loss Function
Semi-supervised initialization involves training of the chosen model's architecture with an incrementally increasing set of labeled data. This is an important aspect to explore, because as the engineer gains more knowledge about a new system, one can label small sets of data, which are known to be system degradation versus healthy operation to increase the system's health state prediction [29]. This approach can improve the quality of the results via a semi-supervised loss, L, function given by Equation (19): In the context of the proposed adversarial framework, during the unsupervised training, the discriminator learns features to avoid classifying the generated data as real data, but these features might not be the best representation. To improve the discriminator and develop more meaningful features for the system's health states over time, labels are used. This is possible by writing the loss function, L, within training to some predetermined number of epochs as follows: L supervised = −E x 1:t ,y 1:t ∼p data (x 1:t ,y 1:t ) logp model (y 1:t x 1:t , y 1:t < K + 1) (20) where x and y are the same as defined previously. p model corresponds to the trained model. This cost function adds a cross-entropy loss for the first K discriminator outputs. The unsupervised cost is the same as the original GAN (see Equation (2)). However, there is a slight change, as now K + 1 corresponds to the probability of the sample being false [31]. The discriminator is used as a competent classifier, given a subset of the dataset. In the context of the proposed mathematical framework, the discriminator will be used as a feature extractor, given a subset of the dataset to improve the system's health state identification results.

Results and Discussion
The Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) multi-sensor data set was used to evaluate the proposed methodology [32]. C-MAPSS is a simulation tool developed in MATLAB and Simulink environment for commercial turbofan engines. The model outputs multiple sensor signal values corresponding to the input parameters of an engine component degradation level or health indicator. In order to adjust to a specific problem that the user is trying to solve, operational profile, closed-loop controllers and environmental conditions, can all be adjusted. The Sensors 2020, 20, 176 9 of 18 term "multi-sensor" for this data set references multiple types of sensors, operating conditions, environmental conditions, flight numbers and trajectories. The 90,000-pound thrust class engine and the CMAPSS simulation package allows operations at (1) Mach numbers from 0 to 0.90, (2) Altitudes measuring sea level to 40,000 feet, and (3) sea-level temperatures measuring −60 to 103 • F. Figure 7 shows the engine's main elements with the following abbreviations: fan speed (N1), low-pressure turbine (LPT), low-pressure compressor (LPC), high-pressure compressor (HPC), core speed (N2), high-pressure turbine (HPT). The PHM 2008 competition data set developed from using the C-MAPSS program is used as an example of application in this paper [33]. Four data sets, FD001 through FD004, are available, and have the properties shown in Table 1  The four data sets have a combination of two fault conditions: high-pressure compressor (HPC) degradation and fan degradation. The data set includes the true RUL to measure prediction performance against. The data set is separated into training and test sets consisting of 26 different sensor measurements, three conditions of operation, flights and mission times. Each of the engines within the dataset initiates with different levels of manufacturing variation and initial degradation. This information is hidden from the engineer, and is not considered a fault condition. The three operational settings do have a substantial effect upon engine performance. These settings are known. Finally, the sensor data is contaminated with noise. Figure 8 shows an example of one of the sensor measurements for one trajectory with its RUL. The PHM 2008 competition data set developed from using the C-MAPSS program is used as an example of application in this paper [33]. Four data sets, FD001 through FD004, are available, and have the properties shown in Table 1  The four data sets have a combination of two fault conditions: high-pressure compressor (HPC) degradation and fan degradation. The data set includes the true RUL to measure prediction performance against. The data set is separated into training and test sets consisting of 26 different sensor measurements, three conditions of operation, flights and mission times. Each of the engines within the dataset initiates with different levels of manufacturing variation and initial degradation. This information is hidden from the engineer, and is not considered a fault condition. The three operational settings do have a substantial effect upon engine performance. These settings are known. Finally, the sensor data is contaminated with noise. Figure 8 shows an example of one of the sensor measurements for one trajectory with its RUL.
To avoid unnecessary duplications, the following sections only use FD001 and FD004 for the sake of brevity. To successfully predict the remaining useful life of the engine, a methodology that can fuse the twenty-six different sensor signals is necessary. sensor measurements, three conditions of operation, flights and mission times. Each of the engines within the dataset initiates with different levels of manufacturing variation and initial degradation. This information is hidden from the engineer, and is not considered a fault condition. The three operational settings do have a substantial effect upon engine performance. These settings are known. Finally, the sensor data is contaminated with noise. Figure 8 shows an example of one of the sensor measurements for one trajectory with its RUL.

Example of Application: Turbofan Engines
To evaluate the proposed semi-supervised methodology, two types of labeling were used: (1) fixed interval and (2) random interval. The fixed interval consists of labeling one out of every x number of labels (i.e., 5% equals labeling 1 out of every 20 data points.) Random interval labeling consisted of taking a random sample of the complete data set for labeling (i.e., 5% of 15,680 data points equals 784 randomly labeled data points). This was done because, as the time interval between labels is decreasing, the RUL estimation error improvements reduce. As one will notice in the rest of this section, this did affect RUL prognostics.
To evaluate the effects of adding a small subset of labeled data to the training procedure, semi-supervised learning was also conducted on the C-MAPSS dataset. There are two parts of the algorithm to evaluate this effect of labeling on the results: (1) feature learning and (2) regression. When it is stated "semi-supervised feature learning", it implies that the percentage of labels were fed into the feature learning phase of the model. When results are reported as "unsupervised feature learning", zero labels were used in the feature learning portion of the model.
The proposed methodology is evaluated against the true RUL via the root mean square error (RMSE). To not sway these results in a more positive light, the authors chose to train the model ten times and take the average results from all ten.
First, FD001 is evaluated from one percent to one hundred percent labels. The RMSE results can be found in Tables 2 and 3. As one can see from the results, there is an effect on the RUL prognostics with both types of labeling (fixed vs. random) and adding labels to both parts of the model. This can also be viewed in Figure 9 There are two observations to note when looking at the results: (1) adding labels to feature learning improves the RUL prediction, and (2) as more labels are added to the feature learning and regression parts of the modeling, the prediction performance (in terms of RMSE) improvement tends to taper off after twenty percent. The increase in prediction performance from adding labels to the feature learning portion of the model shows that feeding labels to the generative model help extract more degradation-related features present in the data. The appropriate percentage of labeling could be inferred or determined based on the evolution of the RMSE according to Figure 9. In this case, the RMSE marginally improves for FD001 beyond twenty percent labeling (1.5% improvement for 50% labeling and 2.7% improvement for 100% labeling). This is important because labeling data is expensive and time consuming. Therefore, increasing the prediction performance (i.e., reducing RMSE) beyond twenty percent of labels becomes increasingly more expensive for a smaller benefit.  To evaluate the effects and differences of modeling operating conditions and additional fault modes, FD004 was also examined. This data set is more applicable for cases that include fleets of vehicles operating in different conditions. Based on the results reported in Figure 10, Tables 4 and 5, this data set had a larger improvement in results by adding labels into both feature learning and regression parts of the model. One can argue that this is because of the non-homogeneity of the data resulting from the inclusion of additional operating conditions and fault modes.  26 Note that FD004 needed an increased percentage of labels given the inherent non-homogeneity of the data set. With six operating conditions and two failure modes, there is a higher degree of uncertainty, and therefore the model performance benefits from an increasing percentage of labels. Compared to the FD001 results in Figure 9, there is still a noticeable reduction of RMSE up to 100% labeling. This reflects the model taking advantage of the increased knowledge of the RUL evolution To evaluate the effects and differences of modeling operating conditions and additional fault modes, FD004 was also examined. This data set is more applicable for cases that include fleets of vehicles operating in different conditions. Based on the results reported in Figure 10, Tables 4 and 5, this data set had a larger improvement in results by adding labels into both feature learning and regression parts of the model. One can argue that this is because of the non-homogeneity of the data resulting from the inclusion of additional operating conditions and fault modes.

Ablation Study and Comparison Results
An ablation study was conducted on the FD001 data set to understand the effects and advantages of integrating variational inference with an adversarial approach, as it is done in the proposed mathematical framework. To this end, both VAEs and GANs were applied separately to FD001 and RUL estimates were performed. Unsupervised feature learning with semi-supervised regression was performed to evaluate the effects of the generative modeling without labels for feature learning. These results can be seen in Tables 6 and 7  These results allow one to see the effects of the variational and adversarial approach of the proposed methodology. Even though the VAE and GAN models provide acceptable results, the proposed methodology outperformed both on their own. The VAE model's RUL prediction performance in terms of RMSE was slightly better with fixed interval labeling, while the GAN model's performance was better with random intervals for the labeling. VAE did not perform as well as the GAN and the proposed methodology. The VAE model also did not benefit from labeling more data after adding 10% labels.
The authors suspect the VAE model did not perform as well due to the possibility of modeling the Gaussian priors of the VAE model sequentially in the training portion of the model [32]. These Note that FD004 needed an increased percentage of labels given the inherent non-homogeneity of the data set. With six operating conditions and two failure modes, there is a higher degree of uncertainty, and therefore the model performance benefits from an increasing percentage of labels. Compared to the FD001 results in Figure 9, there is still a noticeable reduction of RMSE up to 100% labeling. This reflects the model taking advantage of the increased knowledge of the RUL evolution granted by the known labels during the training stage.
Moreover, both FD001 and FD004 RUL prediction benefited from random interval labeling during semi-supervised feature learning. This is can be attributed to the proposed model's ability to better generalize the underlying generative model or lower-dimensional manifold space. The output of the proposed framework also includes a semi-supervised model that gives the engineer the ability to continuously add labels as more information about the degradation process becomes available. From a practical point of view, this is an important characteristic of the model: the engineer can weigh the economic impacts of labeling more data.

Ablation Study and Comparison Results
An ablation study was conducted on the FD001 data set to understand the effects and advantages of integrating variational inference with an adversarial approach, as it is done in the proposed mathematical framework. To this end, both VAEs and GANs were applied separately to FD001 and RUL estimates were performed. Unsupervised feature learning with semi-supervised regression was performed to evaluate the effects of the generative modeling without labels for feature learning. These results can be seen in Tables 6 and 7     Additionally, the proposed methodology and corresponding mathematical framework were assessed against the deep generative modeling technique outlined in Krishnan, et al. [22]. This modeling technique incorporates a Deep Markov Model (DMM) state-space system utilizing structured inference architecture without an adversarial mechanism. Additionally, Krishnan's methodology was not developed for, or applied to, the PHM context. It is, however, a state-of-the-art deep generative modeling technique on time series data. For this paper, it was applied to the CMAPSS FD001 and FD004 data sets as a comparison method. These results can be found in Table 8.   Additionally, the proposed methodology and corresponding mathematical framework were assessed against the deep generative modeling technique outlined in Krishnan, et al. [22]. This modeling technique incorporates a Deep Markov Model (DMM) state-space system utilizing structured inference architecture without an adversarial mechanism. Additionally, Krishnan's methodology was not developed for, or applied to, the PHM context. It is, however, a state-of-the-art deep generative modeling technique on time series data. For this paper, it was applied to the CMAPSS FD001 and FD004 data sets as a comparison method. These results can be found in Table 8. These results allow one to see the effects of the variational and adversarial approach of the proposed methodology. Even though the VAE and GAN models provide acceptable results, the proposed methodology outperformed both on their own. The VAE model's RUL prediction performance in terms of RMSE was slightly better with fixed interval labeling, while the GAN model's performance was better with random intervals for the labeling. VAE did not perform as well as the GAN and the proposed methodology. The VAE model also did not benefit from labeling more data after adding 10% labels.
The authors suspect the VAE model did not perform as well due to the possibility of modeling the Gaussian priors of the VAE model sequentially in the training portion of the model [32]. These results show the value of the combination of the non-Markovian adversarial and variational capabilities within the proposed methodology.
Additionally, the proposed methodology and corresponding mathematical framework were assessed against the deep generative modeling technique outlined in Krishnan, et al. [22]. This modeling technique incorporates a Deep Markov Model (DMM) state-space system utilizing structured inference architecture without an adversarial mechanism. Additionally, Krishnan's methodology was not developed for, or applied to, the PHM context. It is, however, a state-of-the-art deep generative modeling technique on time series data. For this paper, it was applied to the CMAPSS FD001 and FD004 data sets as a comparison method. These results can be found in Table 8. As shown in Table 8, the proposed methodology provides superior results when compared with DMM. Additionally, the DMM is restricted to unsupervised learning, and does not provide a mechanism for semi-supervised learning and labeling. To compare the methodology to a more traditional fully supervised RUL approach, a basic neural network was also applied to the data set within this table. This gives a baseline for the engineer to make an economic decision about labeling the complete data set. This further demonstrates the benefits of the proposed methodology for RUL assessment.
The analysis presented was completed with an nVidia TitanXP GPU processor, and each set of training results took approximately fifty-five minutes to complete. Currently, it takes 1.1 s on average to feed a batch set of data through the model after training. Further research would be required and is beyond the scope of this paper; however, once training is complete it would be reasonable to envision the regression side of the proposed methodology as capable of operation within an online real time system.

FEMTO Dataset Results
For an additional point of experimental validation, this dissertation uses the PHM 2012 Challenge dataset incorporating the PRONOSTIA platform for accelerated aging The term "multi-sensor" for this data set means multiple bearing trajectories, load conditions and rotational speeds. The experimental setup is shown in Figure 13.
The platform's goal is to provide a sensor data output that characterizes the realistic degradation processes of rolling element bearings throughout their life. This data set consists of a run to failure data set for seventeen bearings at different load cases and rotational speeds. The information for each bearing is outlined in Table 9.

FEMTO Dataset Results
For an additional point of experimental validation, this dissertation uses the PHM 2012 Challenge dataset incorporating the PRONOSTIA platform for accelerated aging The term "multisensor" for this data set means multiple bearing trajectories, load conditions and rotational speeds. The experimental setup is shown in Figure 13. The platform's goal is to provide a sensor data output that characterizes the realistic degradation processes of rolling element bearings throughout their life. This data set consists of a run to failure To evaluate this data set, sixteen of the seventeen bearings were used as the training set, while the seventeenth bearing is used as the test set. Data augmentation in the form of spectrogram images was done to ensure a consistent signal and degradation path. Figures 14 and 15 show the raw signal and spectrogram signals for bearing 1-3 prior to data normalization. Data was normalized for each bearing signal to maintain a consistent scale for the data, and it is a necessary step in preparation to be input into the proposed methodology.
Sensors 2020, 20, x FOR PEER REVIEW 15 of 18 data set for seventeen bearings at different load cases and rotational speeds. The information for each bearing is outlined in Table 9. To evaluate this data set, sixteen of the seventeen bearings were used as the training set, while the seventeenth bearing is used as the test set. Data augmentation in the form of spectrogram images was done to ensure a consistent signal and degradation path. Figures 14 and 15 show the raw signal and spectrogram signals for bearing 1-3 prior to data normalization. Data was normalized for each bearing signal to maintain a consistent scale for the data, and it is a necessary step in preparation to be input into the proposed methodology.    The results of the FEMTO data set within the proposed methodology show good performance against research incorporating this dataset, where the published fully supervised RMSE results for bearings 1-3, 2-4 and 3-1 are 9.0, 8.9 and 24.2, respectively [35]. As shown in Table 10, the proposed methodology outperformed these results. Similarly, to the CMAPSS results, increasing the percentage of the labels reduces the RMSE results of the prediction accuracy. This again demonstrates that the model's ability to extract system health features improves with more knowledge about the system. Due to the nature of this data set, only fixed interval labeling was used. Random interval labeling was explored; however, unlike the CMAPSS results, it had mixed results. The results show the robust ability of the proposed methodology to generalize the underlying machine health state degradation process.

Conclusions
In this paper, a novel deep learning-enabled, adversarial-variational methodology, and corresponding non-Markovian mathematical framework, for remaining useful life estimation, was proposed. This was then applied to both a public multi-sensor turbo fan dataset and a public multi-sensor rolling element bearing data set. The proposed methodology achieved superior RUL prediction performance and demonstrated its ability to predict the RUL even with a small percentage of labeled data.
Within the ablation study, the proposed framework proved higher RUL prediction performance with a combined generative modeling methodology. The prediction performance was further enhanced with the addition of labels to the data set. Moreover, the type of labeling (random versus fixed-interval) was explored, and it was uncovered that the method with which one labels time series data can be beneficial towards RUL prediction. Preliminary results indicate that the proposed methodology could be implemented as an online system; however, further research is required to explore this possibility.
A limitation of the proposed methodology is inherent when trying to quantify the variability, while not specifically calculating the uncertainty. This is a potential drawback to this model supporting PHM risk decision making given the computational complexity. Therefore, a suggestion would be to expand the method in terms of a Bayesian framework so the uncertainty on RUL can be explicitly calculated. This approach would be useful with the implication of a quantifiable uncertainty metric, where a loss function could find a relation between the percentage of labeling and the uncertainty on RUL. From this, one could then find an optimal labeling percentage for a given physical asset's dataset based on the risk surrounding a failed prediction.

Conflicts of Interest:
The authors declare no conflict of interest.