Deep reinforcement learning for predictive aircraft maintenance using probabilistic Remaining-Useful-Life prognostics

The increasing availability of sensor monitoring data has stimulated the development of Remaining-Useful-Life (RUL) prognostics and maintenance planning models. However, existing studies focus either on RUL prognostics only, or propose maintenance planning based on simple assumptions about degradation trends. We propose a framework to integrate data-driven probabilistic RUL prognostics into predictive maintenance planning. We estimate the distribution of RUL using Convolutional Neural Networks with Monte Carlo dropout. These prognostics are updated over time, as more measurements become available. We further pose the maintenance planning problem as a Deep Reinforcement Learning (DRL) problem where maintenance actions are triggered based on the estimates of the RUL distribution. We illustrate our framework for the maintenance of aircraft turbofan engines. Using our DRL approach, the total maintenance cost is reduced by 29.3% compared to the case when engines are replaced at the mean-estimated-RUL. In addition, 95.6% of unscheduled maintenance is prevented, and the wasted life of the engines is limited to only 12.81 cycles. Overall, we propose a roadmap for predictive maintenance from sensor measurements to data-driven probabilistic RUL prognostics,


Introduction
Modern aircraft are equipped with multiple sensors that generate large volumes of health monitoring measurements for aircraft systems and components.For example, for a Boeing 787, approximately 1000 parameters are continuously monitored for the engine, amounting to a total of 20 terabytes of data per flight hour [1].Such data are the basis for Remaining-Useful-Life (RUL) estimation [2] and predictive aircraft maintenance planning [3].In this paper, we are interested in integrating RUL prognostics into predictive aircraft maintenance.Below we discuss relevant, recent studies on RUL prognostics and maintenance planning.

Relevant studies on RUL prognostics
Many studies have focused in the last years on developing RUL prognostics for aircraft components and systems [4].For example, RUL prognostics for aircraft landing gear brakes are developed using stochastic regression models in [5].The RUL of aircraft cooling units is estimated using particle filtering in [6].RUL prognostics for electromechanical actuators are obtained using a Gaussian process regression in [7].RUL prognostics have been developed also for components such as batteries [8,9], and fuel cells [10].Several studies have developed RUL prognostics for turbofan engines using the C-MAPSS data set [11].Because the degradation of a turbofan engine is non-linear, many such RUL prognostics models are based on deep learning models, such as convolutional neural networks (CNNs) [12], deep convolutional neural network (DCNN) [13], multi-scale DCNN [14], and CNN with pooling [15,16].Recently, deep learning models have been improved further by using a hybrid approach of deep learning models and physics-based models [17].All these studies, however, predict RUL as a point estimate, i.e., a single value of RUL.For predictive maintenance planning, quantifying the uncertainty associated with the estimated RUL is a prerequisite [18].
In this paper, we develop RUL prognostics for the turbofan engines of the C-MAPSS data set [11] using neural networks, together with quantifying the uncertainty of the estimated RUL.In general, Bayesian learning, Gaussian process, deep ensembles, and Monte Carlo dropout are approaches used to estimate the uncertainty of the output of neural networks [19].In [20], Bayesian learning is applied to quantify the uncertainty of the model parameters of the RUL prognostics.In [21], deep ensemble models are applied to quantify the uncertainty of RUL prediction.However, both deep ensembles and the Bayesian learning are computationally intensive, especially for deep neural networks with https://doi.org/10.1016/j.ress.2022.108908Received 10 June 2022; Received in revised form 5 September 2022; Accepted 18 October 2022 significantly many parameters [19].Another uncertainty quantification approach is deep Gaussian process learning, which estimates the confidence interval of RUL predictions [22].Finally, Monte Carlo dropout is an efficient approach for uncertainty quantification [23].Moreover, Monte Carlo dropout approximates Bayesian learning at a much lower computational cost [24].Thus, in this paper, we use neural networks with Monte Carlo dropout to estimate the probability distribution of RUL.

Relevant studies on maintenance planning
Maintenance planning optimisation has been extensively studied for various assets [25].However, most these studies propose advanced maintenance planning models with simple assumptions about the degradation of systems/components [26].For instance, the degradation process of components is often assumed to follow a Gamma process [5,27], a Wiener process [28], a non-homogeneous Poisson process [29], or a Markov process [30,31].
Only a few studies integrate data-driven RUL prognostics into maintenance planning [32].In [33], for example, the replacement of aircraft brakes is scheduled taking into account data-driven RUL prognostics.In [12], data-driven RUL prognostics for aircraft engines are obtained.Based on these prognostics, alarms are triggered and maintenance actions are specified.In [34], component inspections are scheduled based on the epistemic uncertainty of the estimated RUL.In [35], the replacement of airframe panels is scheduled based on the crack size predicted by the extended Kalman filter.Even when these studies consider RUL prognostics for maintenance planning, planning is done using fixed degradation thresholds for the degradation of components.For instance, in [12], all engines are replaced as soon as their RUL is estimated to be 44 days or less.In [35], the replacement of airframe panels is triggered by a fixed threshold of 47.4 mm crack size.
Recently, studies have proposed deep reinforcement learning (DRL) for adaptive maintenance planning [32], where no fixed thresholds are needed to schedule maintenance.In [36], DRL is used to schedule the replacement of a component whose degradation is represented by 4 discrete states.In [37], the maintenance of multi-component systems is optimised using DRL, where they assume that the degradation follows a compound Poisson process and a Gamma process.In [38], a DRL approach is applied to the case study on railways maintenance.These studies apply DRL for the cases where the system degradation is modelled as a discrete set of states, or the degradation process is modelled as a stochastic process.In contrast with these studies, we develop a threshold-free DRL approach for maintenance planning that considers the distribution of RUL.Thus, our DRL adaptively considers the uncertainty associated with the degradation level of the assets.
In this paper, we propose a deep reinforcement learning (DRL) approach for predictive maintenance that adaptively schedules maintenance considering probabilistic RUL prognostics.The overview of the proposed framework is shown in Fig. 1.Based on sensor measurements, the probability distribution of RUL is estimated using Convolutional Neural Networks (CNNs) with Monte Carlo dropout.We further develop a DRL approach that uses these probabilistic RUL prognostics to plan maintenance.The estimated distribution of RUL directly specifies the states of the DRL.Using DRL, maintenance actions are planned adaptively, without relying on fixed thresholds for the degradation level of the components.We illustrate our approach for maintenance planning of aircraft turbofan engines.
The main contributions of this paper are as follows: • An integrated framework for predictive maintenance is proposed, where probabilistic Remaining-Useful-Life (RUL) prognostics and deep reinforcement learning (DRL) are used to plan the maintenance of aircraft engines.Here, probabilistic RUL prognostics (the estimated RUL distribution) are directly used to construct the states of the DRL.• Probabilistic RUL prognostics are obtained using Convolutional Neural Networks (CNNs) and Monte Carlo dropout.Using probabilistic RUL prognostics, we show that the number of unscheduled maintenance is lower than when using point-RUL estimates.This shows the benefit of quantifying the uncertainty of RUL estimates for maintenance planning.• We pose the problem of predictive maintenance planning as a DRL problem.This approach adaptively proposes maintenance actions based on the trends of the estimated RUL prognostics.
The remainder of this paper is organised as follows.In Section 2, we propose probabilistic RUL prognostics using CNN with Monte Carlo dropout, and validate this proposed model for turbofan engines.In Section 3, we formulate a DRL problem for predictive maintenance planning taking into account probabilistic RUL prognostics.In Section 4, we illustrate our DRL approach for the maintenance planning of turbofan engines.In Section 5, we compare our DRL approach against other maintenance strategies.Finally, we provide conclusions in Section 6.

Estimating the distribution of RUL using CNN with Monte Carlo dropout
In this section, we propose a multi-channel convolutional neural networks (CNNs) and Monte Carlo dropout for probabilistic RUL prognostics.We validate our model for aircraft engines.These RUL prognostics are updated after every flight cycle as new degradation data become available.

Data description and pre-processing
We consider the degradation data of aircraft turbofan engines obtained by NASA using the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) [11].This data set consists of data subsets FD001, FD002, FD003, and FD004, each considering a specific number  3)).The red lines visualise a forward pass of a linear layer (see Eq. ( 4)). of fault modes and operating conditions (see Table 1) [39].The training instances of the subsets have run-to-failure data of sensor measurements, while the testing instances have sensor measurements up to some moment prior to failure.Each instance consists of time-series of 21 sensor measurements per flight cycle.Following [12,13], we select for our analysis 14 non-constant sensor measurements.We discard the remaining 7 constant sensor measurements.
We pre-process the raw data as follows.First, using the clustering of operational settings proposed by [40], 6 operating conditions are identified.Let   denote the operating condition of an engine during th flight cycle,   ∈ {1, … , 6}.
We also consider the history of the operating conditions.Let ℎ , denote the number of cycles that an engine has been operated for under operating condition , up to th flight cycle.
Next, the measurements of sensor  ∈ {1, … , 14} are normalised with respect to operating condition  as follows [12,15]: where Here,   is selected based on the number of cycles available for the shortest testing instance in each data subset [12].We use   = 30 cycles for FD001 and FD003,   = 21 for FD002, and   = 19 for FD004.

Architecture of the multi-channel CNN with Monte Carlo dropout
To obtain probabilistic RUL prognostics, we propose a neural network architecture combining multi-channel convolutional layers, linear layers, and Monte Carlo dropout (see Fig. 2 and Table 2).
In contrast to [12], where a common 1D kernel is applied for all features, we apply one 1D kernel per each time-series of a feature, i.e., each column of the input  is convoluted with different 1D kernels.Such multi-channel 1D convolutional layers are shown to be effective for multi-variate time-series [41], which is also the case of the C-MAPSS data set.Since an independent kernel is used for the time-series of each feature, the convolutional layers are able to learn the patterns of each feature.
A multi-channel 1D convolutional layer is defined by the size (length)   of the kernel, and the number of output channels   .The th convolutional layer gets input  (−1) from ( − 1)th layer, where  (−1) has  (−1)  channels.Then, the output of channel  of th convolutional layer is obtained as follows: where * is the convolutional operator,   , ′ is the kernel for input channel  ′ and output channel ,     is the bias of output channel , and   (⋅) is the activation function of the convolutional layer.Here we use the rectified linear unit (ReLU) activation function.
We use 5 convolutional layers.For the first convolutional layer, the input  0 is  defined in Eq. ( 2), and the number of input channels  0  is equal to the number of features   = 21.Table 2 shows the number of output channels (  ) and the size of the kernels (  ) for all convolutional layers.For all convolutional layers we use zero padding to ensure the same size of the output.These hyper-parameters are selected based on a grid-search, where the hyper-parameters suggested in [12,13] are used as a starting point.After the convolutional layers, we apply two intermediate linear layers, and one output linear layer with a single neuron without activation (see Fig. 2).The th linear layer gets the (flattened) input  (−1) .Then, its output is obtained as follows: where   is the weight matrix,   is the bias, and (⋅) is the ReLU activation function.We denote the number of output neurons of the linear layers as   (see Table 2).
Using the Adam optimiser [42], we optimise the kernels   , ′ and the bias    of the convolutional layers, as well as the weights   and the bias   of the linear layers.The loss function considered here is the mean-squared-error.We train the network using a fixed learning rate of 0.001, a mini-batch of 256 samples, and a maximum of 10 3 training epochs.

Monte Carlo dropout
Typically, Monte Carlo dropout is used only during training to prevent overfitting [23].In this paper, we use Monte Carlo dropout (i) during training to prevent overfitting of the model, and (ii) during testing to obtain the probability distribution of the RUL [24].We apply Monte Carlo dropout after each layer.The dropout rate is set to be 0.5 after a grid-search to minimise the test loss of data subset FD002 [12].

Probabilistic RUL prognostics for turbofan engines -Validation
We validate our CNN with Monte Carlo dropout for probabilistic RUL prognostics using the C-MAPSS data set for turbofan engines [11].

Comparing the RMSE of our estimated RUL against other RUL prognostics models
Our prognostics estimate the distribution of the RUL.We determine the Root Mean Squared Error (RMSE) of our prognostics based on the mean of the estimated distribution of RUL and the true RUL of the testing instances of the C-MAPSS data sets (see Table 3).We compare these results against the RUL prognostics models in [12][13][14][15][16]. Since these models estimate RUL as a point estimate, the RMSE of the prognostics in [12][13][14][15][16] is determined based on the estimated point RUL and the true RUL (see Table 3).For all prognostics models, the RMSE of subset FD002 and FD004 are higher than that of FD001.This is due to the multiple operating conditions considered in FD002 and FD004 (see Table 1).Also, FD002 and FD004 have the shortest time window of the input data compared to FD001 and FD003 (  = 21 for FD002 and   = 19 for FD004) Table 3 shows that our multi-channel CNN with Monte Carlo dropout outperforms several other studies that employ CNNs for RUL prognostics.In fact, we obtain the lowest RMSE for subsets FD002 and FD004.For subsets FD001 and FD003, only MS-DCNN achieves a slightly smaller RMSE compared to our approach [14].In general, the accuracy of our prognostics is higher or comparable to other existing studies.

Estimating the distribution of the RUL
We illustrate the probabilistic RUL prognostics for three turbofan engines in testing instances of subset FD002.Fig. 3 shows the evolution of the estimated RUL distribution over time, as more sensor measurements become available.Fig. 4 shows the estimated RUL distribution of the three engines after they are operated for 136, 96, and 107 flight cycles, respectively.For Engine 148 (Fig. 3(a)), the standard deviation of RUL distribution decreases after 126th cycle.After 136th flight cycle (Fig. 4(a)), the error between the mean-estimated RUL and the true RUL is small (1.00 cycles), and the RUL distribution is concentrated around the true RUL (the standard deviation is 6.31 cycles).
For Engine 173 (Fig. 4(b)), the RUL distribution is right-skewed and the error between the mean-estimated RUL and the true RUL is small (2.87 cycles).Although the error of the estimated point (mean) of RUL is small, having the distribution of the RUL provides additional support for maintenance planning.Should we consider only the mean prediction of RUL (16.87 cycles) for Engine 173 to schedule a replacement, then we would be inclined to schedule a replacement at 16th cycle from now.However, this maintenance decision would lead to an engine failure since the true RUL of Engine 173 is 14 cycles.Should we consider the estimated distribution of RUL for Engine 173 (Fig. 4(b)), then we would observe the high probability (more than 45%) that Engine 173 fails in less than 14 cycles.In fact, the probability that Engine 173 fails at 12th cycle is highest (8.0%).Observing the RUL distribution we would be inclined to replace the engine at 12th cycle from now, avoiding an engine failure.
For Engine 021 (Fig. 4(c)), the error between the mean RUL prediction and the true RUL is large (26.93 cycles), and the standard deviation of RUL distribution is large (11.44).This is also informative for maintenance decision making, i.e., the accuracy of the RUL prognostic is low.In fact, the variance of RUL distribution is large across a sequence of flight cycles (see Fig. 3(c)), leaving maintenance decision making conservative about the moment of engine replacement.
As shown in Figs. 3 and 4, the distribution of RUL prognostics provides valuable information that can lead to more efficient maintenance decisions.In the next section, we propose a deep reinforcement learning approach to specify the moment of engine replacement based on the estimated distribution of RUL.

Quality of the estimated RUL distribution
We analyse the quality of the estimated RUL distributions using calibration plots [43].Let  (|) be the cumulative distribution function (CDF) of the estimated RUL , given sensor measurements .Let  −1 (|) be the quantile function of  such that  −1 (|) = inf { ∶  ≤  (|)}.We say that the probabilistic RUL prognostics model is perfectly calibrated if where  is the true RUL.For a perfectly calibrated model, the probability that the true RUL is less than or equal to the % quantile of the estimated distribution is %.
Fig. 5 shows the calibration plots of the four C-MAPSS data subsets (FD001, FD002, FD003, FD004).The dashed, black line in Fig. 5 shows a perfectly calibrated model.Fig. 5 shows that our probabilistic RUL prognostics models are well calibrated, i.e., the deviation from the perfectly calibrated model is small.For the case of FD002 (Fig. 5(b)), the probability that the true RUL is less than or equal to the 10%, 50%, and 90% quantiles of the estimated RUL distributions using our prognostics models are 14%, 43%, and 86%, respectively.

Planning predictive maintenance using DRL and probabilistic RUL prognostics
In this section, we propose a deep reinforcement learning (DRL) approach for predictive maintenance of turbofan engines taking into account probabilistic RUL prognostics (estimated RUL distribution).These probabilistic RUL prognostics are updated periodically, as more measurements become available.

Scheduling engine replacements taking into account updated probabilistic RUL prognostics
The maintenance schedule of the engines is updated every  flight cycles.In other words, every  cycles, we need to decide whether to replace an engine during the next  cycles (a decision step).Some existing studies assume that maintenance schedules can be updated every 1 cycle/day ( = 1) [30].However, this assumption would be unrealistic for the maintenance of aircraft engines.In practice, several days are needed to prepare the required equipment before replacing an engine [12].Thus, we assume  > 1 and make a maintenance plan for the next  cycles.
Our aim is to minimise the total maintenance cost while avoiding engine failures and minimising the wasted life of the engines.If an engine is replaced too late and as a result this engine fails before the scheduled replacement, then we have to perform a very costly unscheduled replacement [33].On the other hand, if we schedule a replacement too early, we waste the life of this engine.The long-run maintenance cost also increases when engines are replaced too often.Our goal is to propose an approach to optimally schedule engine replacement taking into account probabilistic RUL prognostics (estimates of the RUL distribution).
Fig. 6 illustrates the maintenance planning at decision step .At the start of decision step , we use sensor measurements   , i.e., the measurements available up to decision step .Using   , we estimate the distribution of the RUL using a CNN with Monte Carlo dropout (see Section 2).Let  , denote the estimated cumulative probability that the RUL of the engine is less than or equal to  cycles, given   .Formally, where   is the predicted RUL at the start of decision step .By the definition of RUL, the engine fails at th cycle if ( − 1) <   ≤  when   denotes the true RUL.Since   in Eq. ( 6) is an estimate of   , we can interpret  , as the estimated probability that the engine fails within  cycles, given   .Based on  , at decision step , we decide whether  For example, at the start of decision step , we have the estimated  , given in Fig. 7.We need to decide when to schedule an engine replacement based on this estimate.Fig. 7 shows the estimated RUL distribution for FD002 Testing Engine 148 of the C-MAPSS data set.This cumulative probability  , is estimated after the engine has already been used for 136 cycles.Our prognostics model predicts that this engine fails within 10 and 15 cycles with probability 37% and 82%, respectively.In fact, the true RUL of this engine at this moment is 11 cycles.In the next section, we propose a deep reinforcement learning approach to optimally replace the engine based on this estimated distribution of RUL.

Predictive maintenance planning as a deep reinforcement learning problem
We formulate the predictive maintenance planning of an engine as a deep reinforcement learning (DRL) problem (see Fig. 8).The hidden state   denotes the true RUL of the engine at decision step .The observed state   denotes the RUL distribution estimated based on sensor measurements and CNN.Given   , an agent (decision-maker) takes an action   ∈  based on a policy .Then, reward   is obtained based on the hidden state   and action   .Finally, the system transits from state   to  +1 at the next decision step ( + 1).We formalise our DRL problem as follows.The observed state   is the estimated distribution of the RUL  , for the next  cycles, i.e.,  ∈ {1, … , }.Formally, where  , is the probability that the RUL is less than  cycles (see Eq. ( 6)).The reward   obtained at decision step  is defined for 4 cases considering action   and the hidden state   : (1) a replacement is scheduled earlier than the engine failure; (2) a replacement is scheduled later than the engine failure; (3) we decide to do nothing in the next  cycles, but the engine fails within the next  cycles; and (4) we decide to do nothing, and the engine does not fail in the next  cycles.Formally, Here,  sch () denotes the cost of a scheduled replacement at cycle  ( ∈ {1, … , }), which is defined as follows: where  0 is a fixed cost of replacement ( 0 > 0), and  1 is a penalty for an early replacement ( 1 > 0).We assume that a too early replacement is expensive because we have less time to prepare the required equipment [33].Also, we assume  sch () is positive for all , i.e.,  0 − 1  > 0.
In Eq. ( 9),  uns denotes the cost of an unscheduled replacement.We assume  uns >  0 since an unscheduled replacement is generally more expensive [33].Fig. 9 shows the cost model of a scheduled and an unscheduled engine replacement for  0 = 1,  1 = 0.01, and  uns = 2.
The goal of the DRL agent is to choose an optimal moment to schedule a replacement such that the expected reward is maximised (or the expected cost is minimised).When scheduling an engine replacement, the DRL agent considers only the observed state   defined in Eq. ( 7).The training of the DRL agent is based on the observed state   , the action taken   , the obtained reward   , and the observed next state  +1 .Although the reward   is calculated using the true RUL (hidden state   ) in Eq. ( 9), the DRL agent does not observe the true RUL   directly.
In general, it is not trivial to choose an optimal moment to replace an engine, given the estimated RUL distribution (Fig. 7) and the cost model (Fig. 9).As an example, let us consider the cost in Fig. 9 and engine 148 with the estimated RUL distribution shown in Fig. 7. Should we decide to replace engine 148 at 25th cycle, and this engine does not fail by 25th cycle, then the cost of this replacement would be 0.75.Should we decide to replace engine 148 at 25th cycle, but this engine fails before 25th cycle, then an unscheduled replacement would be performed at cost 2.0.For Engine 148, there is a 97% estimated probability that this engine fails before 25th cycle (see Fig. 7).Should we decide to replace the engine at 5th cycle, then the cost would be 0.95.Although a cost of 0.95 is higher than the cost of scheduled replacement at 25th cycle (0.75), this maintenance action reduces the risk of an expensive unscheduled maintenance.For Engine 148, there is a 7% estimated failure probability at 5th cycle, so a low risk of unscheduled maintenance.Overall, deciding at which cycle should engine 148 be performed is non-trivial.
Once the DRL agent chooses an action, the hidden state  +1 and the observed state  +1 are updated accordingly.If the engine is replaced at decision step , then the next decision step considers a new engine from the C-MAPSS data set.Otherwise, we further obtain sensor measurements  +1 during the next  cycles and update the distribution of the RUL (the next state  +1 ) by generating new RUL prognostics using a CNN with Monte Carlo dropout.

Training the DRL agent for predictive maintenance
The DRL agent chooses action   (maintenance decision) for a given state   (estimated distribution of RUL) based on a policy (  |  ) ∶  ×  → [0, 1], which is the probability to choose action   for a given state   .The optimal policy  * is defined as a policy that maximises the expected reward defined as follows: where  is a discount factor, and   (  ,   ) is the state-action trajectory distribution induced by a policy  [44].

Soft-actor-critic algorithm to train the DRL agent for predictive maintenance planning
We train the DRL agent using a Soft-Actor-Critic (SAC) algorithm [44].The SAC algorithm is an actor-critic algorithm where a policy (actor) is trained to choose actions that maximises the estimated stateaction value (critic).Compared to traditional actor-critic algorithms, the SAC uses a stochastic policy and maximises a soft objective to explore new policies.
We consider a stochastic policy   (  |  ) to determine the mean    (  ) and the standard deviation    (  ) of an action for a given state   , where  is the trainable parameters of    and    .Then, action   is chosen as follows: where   is sampled from a standard Gaussian distribution.
The considered soft objective includes the expected entropy of the policy   .Formally, where  is the temperature parameter determining the relative importance between the entropy term and the reward term.Thus, the SAC algorithm simultaneously maximises the expected reward and the entropy of the policy, allowing the exploration of new policies.
Considering the soft objective in Eq. ( 13), the state-action value (Q function) is modified as the soft Q function  ∶  ×  → R.This soft Q function is then obtained by iteratively applying the following modified Bellman backup operator   [44] : where  is the distribution of  +1 , given   and   , and  (  ) is the soft state value function  ∶  → R defined as follows: For the SAC algorithm, we train three functions: the policy (), the soft Q function (), and the soft value function ( ).We model these functions by means of three deep neural networks,   ,   , and   , where , , and  are the trainable parameters of each neural network.
During training, we collect the replay buffer  = {(  ,   ,   ,  +1 )} based on the current policy   .Then, we update the trainable parameters to minimise the following loss functions.
Algorithm 1 Soft-Actor-Critic algorithm for predictive maintenance planning.

4:
for each decision step  do

7:
Get next state  +1 (Eq.( 7)) The policy net   is updated using the Kullback-Leibler (KL) divergence, which guarantees the improvement of the policy in terms of its soft value [44].We minimise the expected KL divergence as follows: where   (  ) is the partition function that does not contribute to the gradient with respect to , and   is sampled from the current policy   using Eq. ( 12).For the value net   , we minimise the residual of the value function calculated based on the critic net   : with For the critic net   , we minimise the modified Bellman residual: with Here, we use the target value net  ψ , where ψ is an exponentially moving average of the value net parameters [45].Also, we adopt the double Q-learning approach: we simultaneously train two critic nets (  1 and   2 ), and we use   (  ,   ) = min{  1 (  ,   ),   2 (  ,   )} [46].Both the target value net and double Q-learning approach are known to stabilise the training process [45,46] The gradients of the loss functions in Eqs. ( 16), (17), and ( 19) are obtained by backward propagation.Given the gradients of the corresponding objectives, the parameters , , and  are updated using the Adam optimiser with learning rates   ,   , and   , respectively.
We train the DRL agent for predictive maintenance for aircraft engines using the SAC algorithm (see Algorithm 1).We first initialise the parameters of the neural network models, , , ψ, ,  1 , and  2 (line 1).We train these networks for   episodes (line 2).An episode is initialised with observation  0 , which is the initial distribution of the engine's RUL sampled from a DRL episode data set (line 3).Here, the RUL distribution is estimated using the CNN that is already trained based on an independent training data set.The episode continues for   decision steps (line 4).At each decision step , for the observed state   , we sample the action   using the policy net   (  |  ) (line 5).Based on this action, we obtain a reward   and the next state  +1 (line 6-7).We add (  ,   ,   ,  +1 ) into the replay buffer  (line 8).Then, for each learning step, we sample mini-batch  from replay buffer  (line 9-10), and use this to calculate the loss functions in Eqs. ( 16), (17), and (19).We next update the policy net, the value net, and the critic nets such that the corresponding objectives are minimised (line 11-13).
Here  ,   , and   are the learning rates of each network.Also, the parameters of the target value net ψ is updated with an exponential moving average of , where  is a smoothing factor (line 14).For the next decision step, we update the current state (line 16).

Design of the architecture of the neural networks
We design the architectures of the policy net   , the value net   , and the critic net   as shown in Fig. 10 and Table 4.
The policy net   has input   , which is a vector of size , and returns two scalar values corresponding to the mean of action    (  ) and the standard deviation of action    (  ).These two outputs    (  ) and    (  ) are used to sample action   for given state   (see Eq. ( 12)).We consider hidden layers shared by    (  ) and    (  ) to facilitate learning from the shared features (see Fig. 10(a)).Following these shared hidden layers, we consider separated hidden layers for each of    (  ) and    (  ).The value net   has input   and returns a scalar value   (  ).We consider two hidden, fully-connected layers.The same architecture is used also for the target value net  ψ .
The input of critic net   is a vector of size ( + 1), which is the augmentation of state   and action   .Its output is a scalar value   (  ,   ).We consider two hidden, fully connected layers (see Table 4).Since we use a double Q-learning approach, we consider two critic networks   1 and   2 having the same architecture but different parameters  1 and  2 .4).

Case study: DRL for predictive maintenance of turbofan engines with probabilistic RUL prognostics
This section shows how probabilistic RUL prognostics (Section 2) for turbofan engines are integrated into maintenance planning, i.e., the DRL approach discussed in Section 3.

Training the probabilistic RUL prognostics
We consider the maintenance of a turbofan engine whose sensor measurements are given in subset FD002 of the C-MAPSS data set (see Table 1) [11].From the 260 training instances of FD002, we randomly sample 130 engines to obtain data subset FD002-Prog, which is used to train the probabilistic RUL prognostics model (CNN with Monte Carlo dropout, Section 2).The remaining 130 engines, referred to as FD002-DRL, are used to generate episodes of the DRL problem.

Training the DRL agent
The DRL agent considers the maintenance episodes generated based on the data set FD002-DRL.At each episode, we sample engines from FD002-DRL.Using measurements   and the trained RUL prognostics model, we generate the RUL distribution ( , ) of the sampled engine.This estimated RUL distribution is state   observed by the DRL agent.If the DRL agent decides to do nothing, the sensor measurements of the sampled engine are updated at the next decision step  + 1.If the DRL agent decides to replace the engine, a new engine from FD002-DRL is sampled.
We train the DRL agent for   = 5000 episodes using Algorithm 1.Each episode consists of maximum   = 100 decision steps, and each decision step considers  = 30 flight cycles (see Fig. 6 for the definition of a decision step).As a reward (cost) model, we assume  uns = 2,  0 = 1, and  1 = 0.01 for the cost parameters defined in Eqs. ( 9)- (10) (see also Fig. 9 for the cost model).The hyper-parameters of the SAC algorithms are as follows: discount factor  = 0.9, temperature parameter  = 0.01, learning rates   = 10 −5 ,   = 10 −4 ,   = 10 −4 , smoothing factor of target value net  = 10 −3 , the maximum size of replay buffer || = 10 6 , and the size of the mini-batch | | = 4096.Fig. 11 shows the learning curve of the DRL agent, illustrating the total reward per episode.The total reward rapidly increases during the first 500 episodes and converges to around −18 after 1000 episodes.After 1000 episodes, the total reward of each episode varies because the considered DRL problem is stochastic.However, the moving average of the total reward stabilises after 1000 episodes.Moreover, 5 independent training curves show the same trends.Thus, the training is stopped after 5000 episodes.

Evaluation of the DRL agent: Predictive maintenance using DRL
Following training, we evaluate the trained DRL agent for 1000 episodes generated by our CNN model and data set FD002-DRL.During evaluation, the DRL agent chooses an action   for a given state   from the mean action    of the trained policy   [44].Formally, Below we discuss the benefits of our DRL approach for predictive maintenance, by presenting some decision steps (the estimated RUL distributions and the associated maintenance actions made by the DRL agent).

Maintenance decision based on updated RUL distribution
The estimated RUL distributions are updated at every decision step  ( cycles), as more sensor measurements become available.This ensures that the maintenance decision is always based on the most recent RUL prognostics.Fig. 12 shows 2 consecutive decisions steps ( = 80 and 81) for the maintenance of Engine 247, FD002 of the C-MAPSS data set.At decision step  = 80, the engine has been operated for 149 cycles.Our CNN model predicts the probability that the engine will fail within 30 cycles to be  30,80 = 0.005.The entire distribution  ,80 for  ∈ {0, … , 30} is given in Fig. 12(a).Such a distribution quantifies the uncertainty of the RUL prognostics, and provides basis for maintenance decisions of the DRL agent.The DRL agent observes the RUL distribution and decides to do nothing, i.e., do not schedule replacement in this decision step  = 80.Since no replacement is scheduled for the next  = 30 cycles, the engine is operated continuously until the next decision step  = 81, and more sensor measurements are collected.Based on the new sensor measurement, we update the distribution of the RUL again using the CNN with Monte Carlo dropout (see Fig. 12(b)).At decision step  = 81, the probability that the engine will fail within 30 cycles is estimated to be  30,81 = 0.807.Given the updated distribution of the RUL, the DRL agent schedule a replacement after 7 cycles (see the blue vertical line in Fig. 12(b)).The probability that the engine will fail within 7 cycles is  7,80 = 0.091.In fact, the (hidden) true RUL is 18 cycles at decision step  = 81, i.e., the DRL agent schedules a replacement 11 cycles before the engine fails.

Adaptive maintenance decision using deep neural network
Using a deep neural network model (policy net), our DRL agent adaptively considers the updated RUL distribution of individual engine, instead of relying on one fixed threshold for all engines.As a result, our DRL agent can identify optimal moment of engine replacement taking into account different trends of RUL distributions.For example, Fig. 13 shows distinctive RUL distributions of three engines, estimated during different episodes.In Fig. 13(a), there is a very high chance that the engine will fail within the next 30 cycles ( 30, = 0.896).In this case, the DRL agent schedules a replacement after 5 cycles when the probability that the engine will fail within 5 cycles is estimated to be  5, = 0.113.In Fig. 13(b), the estimated  , increases from  1, = 0.011 to  30, = 0.748, and the DRL agent schedules a replacement after 9 cycles when  9, = 0.073.In the last case in Fig. 13(c), the probability that the engine will fail within 30 cycles is smaller compared to the two previous cases ( 30, < 0.15), but the trend rapidly increases.In this case, the DRL agent schedules a replacement after 18 cycles when  18, = 0.011, effectively preventing the engine failure.
In contrast to our DRL approach, existing predictive maintenance approaches often consider fixed thresholds for all same-type of components to trigger maintenance.For example, in [12], an alarm is triggered when the estimated RULs of engines are below a threshold (44 days).Similarly, in [35], airframe panels are replaced when the predicted crack size is larger than a threshold (47.4 mm).Since a fixed threshold value is applied for all components, differences between individual RUL prognostics results may not be considered in these traditional approaches.
The benefit of our adaptive maintenance planning using deep neural network is evident when trying, unsuccessfully, to find one fixed threshold that is optimal for all three cases in Fig. 13.Let us assume that we use a fixed threshold 0.11 and always schedule an engine replacement after  cycles if  , > 0.11, irrespective of the RUL distribution.Using such a fixed threshold of 0.11 will effectively prevent the failure for the first case (Fig. 13(a)).Using the same threshold for Figs.13(b) and 13(c), engine replacements are scheduled after 12 and 28 cycles, respectively.However, in both cases, the engine replacements are later than the true RUL (11 and 21 cycles, respectively), leading to unscheduled replacements at higher cost.In the same line, let us assume that we set a much lower fixed threshold of 0.01, i.e., we always schedule an engine replacement after  cycles if  , > 0.01.Using this J. Lee and M. Mitici   threshold will avoid an unscheduled engine replacements in the last case (Fig. 13(c)).However, with such a low threshold, replacements are scheduled too early for the other two cases in Figs.13(a) and 13(b), wasting the useful life of the engines.This example shows that finding one fixed threshold that is optimal for all cases is challenging.In contrast, our DRL agent adaptively considers the different trends of RUL distributions, without using fixed thresholds.As a result, our approach leads to less unscheduled maintenance.

Scheduling replacements with small wasted life of engines
Using updated RUL distributions and adaptive maintenance decision, our DRL agent schedules engine replacements without wasting useful lives of engines.Fig. 14 shows the distribution of the wasted life of engines at the moment of replacement, when using our DRL approach.Replacements are scheduled when the true RUL of an engine is 12.81 cycles on average.This is only 6% of the average life of the engines in subset FD002.Also, more than 82% of the engines are replaced when their wasted life is less than 20 cycles.

Predictive maintenance using DRL vs other maintenance strategies
In this section, we compare the performance of our DRL approach for predictive maintenance against three other traditional maintenance strategies: (1) Predictive maintenance at mean-estimated-RUL: This strategy schedules engine replacements at the mean RUL predicted by the CNN model in Section 2. This strategy uses a point estimate of the RUL, while our DRL approach uses a distribution of the RUL.Considering this strategy, we aim to evaluate the impact of using the distribution of RUL for maintenance planning, rather than just a point estimate of the RUL.
(2) Corrective maintenance: This strategy replaces engines as soon as they fail.Under this strategy, we always perform unscheduled replacements, which is the most undesirable case.
(3) Ideal maintenance at true RUL: This strategy assumes that the true RUL is known in advance by an Oracle, and engines replacements are scheduled exactly at this true RUL.Under this strategy, there are no unscheduled maintenance tasks and the wasted lives of engines are always zero, i.e., an ideal maintenance strategy.
Table 5 shows the performance of these traditional maintenance strategies vs our DRL approach, using three following performance indicators: (i) The total cost: this is the cost of both scheduled and unscheduled replacements during 3000 cycles of engine operations (i.e., 100 decision steps).The cost (reward) model is given in Eqs. ( 9)-( 10).
(ii) The number of unscheduled replacements: this is a direct metric for maintenance reliability.We aim to minimise the number of unscheduled engine replacements.
(iii) The total number of replacements: this is the number of both scheduled and unscheduled replacements during 3000 cycles of engine operations.Since we consider a fixed period of cycles, a lower number of total replacements implies that we utilise the engines for a longer duration.
Table 5 shows that our DRL approach using RUL distributions outperforms the other maintenance strategies, especially in terms of the total maintenance cost and the number of unscheduled replacements.Our DRL approach saves 36.3% of the total costs compared to corrective maintenance.Moreover, it also achieves a more reliable maintenance planning by preventing 95.6% of unscheduled replacements.The total number of replacements (both scheduled and unscheduled) is slightly (6.4%) larger for our DRL approach since engines are replaced earlier than their end-of-life to prevent unscheduled replacements.However, this slight increase in the total number of engine replacements is balanced out by a large economic efficiency (large cost savings) and maintenance reliability (lower number of unscheduled replacements) that our DRL approach achieves.
The benefit of using probabilistic RUL prognostics instead of a point estimate of RUL is evident when comparing our DRL approach against predictive maintenance at mean-estimated-RUL (see Table 5).Both strategies make use of RUL prognostics obtained using a CNN (see Section 2).But our DRL approach uses probabilistic RUL prognostics (estimated RUL distribution) to plan engine maintenance.As a result, predictive maintenance based on the mean-estimated-RUL reduces only 9.8% of total costs and 22.3% of unscheduled replacements, while our DRL approach further reduces the total cost (36.3%)and unscheduled replacements (95.6%).
The cost savings obtained by our DRL approach are further explained in Fig. 15.Since we assume 2 times higher costs for unscheduled replacements (see the cost model in Fig. 9), even a small number of unscheduled replacements can take a large portion of the total cost.Due to this reason, all maintenance strategies performed a similar number of total replacements, but the total maintenance costs are significantly different.In the case of predictive maintenance at the mean-estimated-RUL, 85% of the total cost is associated with unscheduled replacements.In contrast, for our DRL approach, only 7% of the total cost is associated with unscheduled replacements.

Conclusions
In this paper, we propose a deep reinforcement learning (DRL) approach to plan predictive maintenance for aircraft engines.This maintenance planning takes into account the estimated distribution of the engines' Remaining-Useful-Life (RUL).
We first estimate the RUL distribution of engines using Convolutional Neural Networks with Monte Carlo dropout.These estimates are periodically updated, as more sensor measurements become available.Such estimates of the RUL distribution provide useful information about the uncertainty associated with the RUL prognostics and enables more effective maintenance planning.
With the estimated RUL distribution, we schedule maintenance for turbofan engines using DRL.Maintenance actions are specified adaptively, based on the trends of the RUL prognostics.In contrast to existing studies, we do not use fixed thresholds to trigger maintenance actions.Thus, our DRL approach enables adaptive and flexible, threshold-free maintenance planning.
The results show that our DRL approach with probabilistic RUL prognostics leads to lower maintenance costs and fewer unscheduled maintenance events, when compared to several other maintenance strategies.Compared to maintenance planning at mean-estimated-RUL, our DRL approach reduces the total maintenance cost by 29.3%.Moreover, it prevents 95.6% of unscheduled engine replacements.The engines are replaced just before their end-of-life, with an average wasted lives of only 12.8 cycles.Overall, our DRL approach outperforms the several other traditional maintenance strategies in terms of the cost and reliability indicators.
Overall, this study proposes a generic framework to integrate datadriven, probabilistic RUL prognostics into predictive maintenance.This framework is readily applicable for other aircraft components whose health is continuously monitored.
As future works, we plan to expand the proposed DRL approach for predictive maintenance of multiple components.In addition, we consider more realistic inputs and constraints of aircraft maintenance such as limited space of hangar, logistics of spare parts, and dynamic flight conditions.

Fig. 1 .
Fig. 1.Overview of the proposed predictive maintenance framework using probabilistic RUL prognostics and DRL.

Fig. 2 .
Fig. 2. Proposed multi-channel CNN architecture.The blue lines visualise how multiple channels are convoluted into the next layer (see Eq. (3)).The red lines visualise a forward pass of a linear layer (see Eq. (4)).

Fig. 3 .
Fig. 3. Evolution of the RUL distribution over time.(a) The mean-estimated RUL gets closer to the true RUL, and the variance decreases.(b) The mean-estimated RUL gets closer to the true RUL, and the variance decreases though it is skewed.(c) Neither the error of the mean-estimated RUL nor the variance decrease.

Fig. 4 .
Fig. 4. Probabilistic RUL prognostics.(a) The error between the mean-estimated RUL and the true RUL is small, and the standard deviation of RUL distribution is small.(b) The mean-estimated RUL is slightly larger than the true RUL, and the RUL distribution is right-skewed.(c) The error between the mean-estimated RUL and the true RUL is large, and the standard deviation of RUL distribution is large.

Fig. 5 .
Fig. 5. Calibration plot of the estimated CDF of RUL for four test data sets.

Fig. 8 .
Fig. 8. Illustration of states, actions, rewards and transitions of the DRL problem for predictive maintenance planning.
Given state   , the agent choose an action   : either schedule a replacement of the engine at cycle  ( ∈ {1, … , }), or Do nothing.Formally,   = { , 0 <  ≤  Schedule replacement at cycle , ,  >  Do nothing.(8) If   =  > , we do not schedule an engine replacement in the next  cycles and postpone the engine replacement to the next decision step ( + 1).

Fig. 11 .
Fig. 11.Learning curve of the DRL approach during 5000 episodes.The thin grey lines are 5 learning curves, and the solid lines are the moving average of 100 episodes of each learning curve.

Fig. 13 .
Fig. 13.Three different RUL distributions and adaptive maintenance decisions of the DRL agent.

Fig. 14 .
Fig. 14.Wasted life of engines at the moment of replacements under our DRL approach.

Fig. 15 .
Fig. 15.Cost of scheduled/unscheduled replacements of the proposed DRL approach and other maintenance strategies.
, is the normalised measurement of sensor  at th flight cycle,   , is the raw measurement generated by sensor  at th flight cycle.This th flight cycle is performed under operating condition .

Table 2
Architecture of the proposed CNN, where   and   are the number of output channels and the length of the kernel of Conv1D layers, respectively, and   is the number of output neurons of Linear layers.A dropout rate  = 0.5 is used for all layers .

Table 3
RMSE of the RUL predictions of the C-MAPSS data subsets using the proposed architecture of multi-channel CNN with Monte Carlo dropout and the other studies.STD: Standard deviation, where N/A implies that STD is not available in the original papers.

Table 4
Architecture of the deep neural network models for policy net   , value net   , and critic net   (see also Fig.10) .

Table 5
Comparison of the proposed DRL approach using RUL distribution, and other maintenance strategies.Percentage in parenthesis indicates the relative ratio to the corrective maintenance .