A data-driven predictive maintenance strategy based on accurate failure prognostics

Maintenance is fundamental to ensure the safety, reliability and availability of
engineering systems, and predictive maintenance is the leading one in maintenance technology. This paper aims to develop a novel data-driven predictive maintenance strategy that can make appropriate maintenance decisions for repairable complex engineering systems. The proposed strategy includes degradation feature selection and degradation prognostic modeling modules to achieve accurate failure prognostics. For maintenance decision-making, the perfect time for taking maintenance activities is determined by evaluating the maintenance cost online that has taken into account of the failure prognostic results of performance degradation. The feasibility and effectiveness of the proposed strategy is confirmed using the NASA data set of aero-engines. Results show that the proposed strategy outperforms the two benchmark maintenance strategies: classical periodic maintenance and emerging dynamic predictive maintenance.


Introduction
Many of modern engineering systems operate in highly demanding environments. During long-term continuous operation under extreme conditions, operation performance inevitably deteriorates over time [1]. When reaching a critical degradation degree, underperformed components or subsystems might fail and risk the system safety [7]. Well-timed maintenance is a core desire in all engineering systems.
Maintenance strategies can be categorized into two types: preventive maintenance and corrective maintenance [8]. Preventive maintenance schedules proactive maintenance activities routinely; while corrective maintenance is an unscheduled strategy that attempts to restore the system after failures [4]. For those systems that have excessive demands on safety and reliability, preventive maintenance is the main stream. Traditional preventive maintenance is based on the serving time and the probability distribution of trouble-free operation time span of the system. So, it is also termed as time-based maintenance (TBM). Its conservation is obvious. On one hand, taking intensive preventive maintenance results in excessive maintenance; and on the other hand, preventive maintenance with fixed time span can't avoid unexpected faults or the faults with insufficient prior knowledge [10]. To improve cost-effectiveness ratio of preventive maintenance, condition based maintenance (CBM) that takes into account the actual operating conditions of the system over time, has been proposed and received considerable attentions from academia to industry over the last decade [19].
In the existing CBM strategies, degrading system condition is often described by stochastic modeling, such as a Markov chain with multiple discrete states [13,14,15,16] or a stochastic process model with a continuous degradation state [5,6,22]. These stochastic-model-based CBM strategies either require that the transition probabilities of system states are known in advance or can be learned from the historical reliability data, or require that there exists a stochastic process characterizing the system degradation mechanism. However, in practice, it is difficult or even impossible to obtain the accurate probability distributions of all possible transitions of system states and the accurate degradation mechanism of a complex engineering system with affordable cost. To avoid these tough problems of the existing stochastic-modelbased CBM strategies, in recent years, machine learning based methods that can be independent of the system degradation mechanism are applied to the field of prognostics and health management (PHM) [18]. In this emerging field, a trend of maintenance technology is to make maintenance decision based on multivariate condition monitoring and failure prognostics [2]. For example, a new deep neural network structure called long short-term memory (LSTM) network was used to discover the underlying time series patterns for predicting the Maintenance is fundamental to ensure the safety, reliability and availability of engineering systems, and predictive maintenance is the leading one in maintenance technology. This paper aims to develop a novel data-driven predictive maintenance strategy that can make appropriate maintenance decisions for repairable complex engineering systems. The proposed strategy includes degradation feature selection and degradation prognostic modeling modules to achieve accurate failure prognostics. For maintenance decision-making, the perfect time for taking maintenance activities is determined by evaluating the maintenance cost online that has taken into account of the failure prognostic results of performance degradation. The feasibility and effectiveness of the proposed strategy is confirmed using the NASA data set of aero-engines. Results show that the proposed strategy outperforms the two benchmark maintenance strategies: classical periodic maintenance and emerging dynamic predictive maintenance.
system remaining useful life (RUL) [21]. In paper [3], the authors adopted a restricted Boltzmann machine to pre-train the abstract features for LSTM input. Moreover, a two-dimensional grid LSTM is designed to improve the prediction accuracy of fuel cell performance degradation [9].
The above machine learning based research only focuses on life prediction, and does not consider the maintenance decision-making issues. Recently, a novel dynamic predictive maintenance (PdM) framework using LSTM network for failure prognostics has been developed [12]. The authors have discussed in detail the advantages of PdM over other maintenance strategies, and a complete framework from data-driven prognostics to maintenance decisions has been given. In our past work, an effective data-driven degradation prognostic technique has been developed with good verification results for the aero-engine system [20]. The work of this paper is a follow-up of [12] and [20], and the main contribution is to develop a data-driven PdM maintenance strategy to make long-term, reliable maintenance decisions for engineering systems. In detail, we design a module of degradation feature selection. It can enable the failure prognostics and maintenance decision-making to have lower computing load, faster convergence speed and better robustness in presence of uncertainties. More accurate failure prognostics can be realized via the multivariate LSTM network whose inputs are the selected degradation features. The prognostic model can provide the future degradation trend online for failure prognosis. For maintenance decision-making, the perfect time for taking maintenance activities can be determined by evaluating the maintenance cost online based on the failure prognostic results of performance degradation. Correspondingly, long-term, reliable maintenance decisions can be realized, which is crucial for planning maintenance, inventory and production activities in advance.
The remainder of this paper is organized as follows. In Section 2, an enhanced data-driven PdM strategy is presented, including implementation details and performance evaluation, under the framework of [12]. In Section 3, the feasibility and effectiveness of the proposed PdM strategy will be confirmed using the NASA data set of aero-engines. Conclusions and future works will be discussed in Section 4.
2. An enhanced data-driven PdM strategy

Key idea
A novel data-driven dynamic PdM framework has been proposed in [12], which has provided a complete process from data-driven prognostics to maintenance decisions. The entire process, as shown in Fig. 1, functionally includes three parts: LSTM modeling, online failure prognosis and maintenance decisions.
The LSTM step includes training of an LSTM classifier and using the LSTM classifier to determine the degradation label of online measurements. It deals with the multivariate raw data directly and all data are used as the inputs of LSTM model. This may cause extensive computing load, low convergence speed, low robustness of the LSTM modeling, and ultimately reduce the accuracy of failure prognosis. Also, the LSTM network only provides the probabilities of system failure at the current moment. This limits the decision-making to be instantaneous. Instantaneous decision-making of system only answers whether or not the system need maintenance activities at the current moment. It cannot give the exact time when the system must take preventive maintenance activities. Apparently, in practice, a long-term, reliable decision-making is more valuable for industrial organizers to plan maintenance, inventory and production activities in advance.
To overcome the above issues, this paper proposes an enhanced dynamic PdM strategy that can enable to achieve future failure prognosis and long-term, reliable maintenance decision-making. The main steps are shown in Fig. 2. Compared with the original PdM framework in Fig. 1, in data preprocessing step, the multivariate raw data are firstly (1) preprocessed to extract the features that can reflect the degradation trends; in LSTM modeling step, an extra LSTM regression model is (2) introduced for predicting the future degradation trends of system; in the decision-making step, the predicted failure probabili- (3) ties at different moments in future are used to make long-term maintenance decisions, e.g., to decide when the system needs taking maintenance activities and ordering the spare parts. Fig. 3 illustrates the difference between the dynamic instantaneous and long-term decision-making processes. At the current moment, the instantaneous decision-making answers whether or not the system need maintenance activities, while the long-term decision-making gives the exact time when the system must take preventive maintenance activities. Obviously, the long-term decision-making has a broader vision. As the operation time of the system increases, the sensors will obtain more condition monitoring data, making the decision-making results more accurate.

Degradation feature selection and improved failure prognosis via LSTM
In practice, the sensor measurements are often contaminated with noises. Noises may conceal the tenuous degradation trend. So data de-noising should be conducted in the data pre-processing phase. To do so, the simple but effective moving average method is employed to extract the system degradation trends [20]. This process is briefly described as follows. Firstly, all available historical condition monitoring data can be arranged into a three-dimensional data ( ) X I J K × × , where I denotes the number of samples, J denotes the number of measuring variables and K denotes the operation cycle. The k th value of the j th variate in the i th sample is denoted as ( ) ij x k . Thus, the degradation values using moving average are given by: where n is the size of moving window. Then, the Z-score normalization is used to handle the different ranges of sensor measurements. Normalized sensor measurements are given by: whereμ μ and δ denotes the mean and standard deviation of these degradation values, respectively, and are given by: In addition, eliminating usefulness data is necessary before LSTM network modeling since it can generally improve the performances of modeling, failure prognosis and decision making. Therefore, a module of degradation feature selection is included in the proposed maintenance strategy. In this paper, the correlation and trend indicators are adopted for degradation feature selection due to their effectiveness. The correlation and trend indicators are given by: x k d  denotes the difference between ranks for each ( ) ij x k  and k , and δ ( ) x is the direct function, i.e., δ ( ) x = 1 when x is true and δ ( ) x = 0 otherwise. According to the two indicators, the crucial features can be selected by the criterion, ρ ij j T o r ≥ = = 0 5 0 1 . && [20].

. Dynamic instantaneous and long-term decision-making processes
Next, to obtain the failure probabilities at different moments in future, a multivariate LSTM regressor for degradation trend prediction is first trained with historical data (see Algorithm 1). It is noted that, the multivariate LSTM network can exploit the nature of the evolving degradation trend [23], and in Algorithm 1, ( ) X I F K × ×  denotes the pre-processed data with F important features. Fig. 4 shows a schematic diagram of the degradation trend prediction. For the online condition monitoring data (duration: 1-t), they will be pre-processed in the same way, and then fed into the well-trained multivariate LSTM regressor. The regressor can predict the degradation trends of system in future.

Fig. 4. Schematic diagram of degradation trend prediction
Similar to [12], a multivariate LSTM classier for failure probability estimation is trained with historical data (see Algorithm 2). It is noted that, in Algorithm 2, ( 1 ) RI K ×× denotes the RUL data, and the RUL value of k th cycle of the i th sample is denoted as 1 The degradation data will be labeled by two classes: Deg1 and Deg2. Deg1 represents the case where the system RUL time is greater than or equal to the time window w 0 , i.e., 0 RUL w ≥ . Deg2 means 0 RUL w < . The two labels can be regarded as two degradation states with different degrees, like allowable degradation and intolerable degradation. 13: return well-trained network parameters.
In practice, due to technical and logistical constraints, maintenance activities cannot be carried out at anytime and anywhere. As an illus-tration, the maintenance activities for train or aircraft engines cannot be realized during their journeys. Maintenance activities can be performed only at the inspection moment. It is assumed that the inspection interval T ∆ between two successive inspections is constant. If the RUL of the system at some inspection moment h in the future is less than T ∆ , it means the system has failed at the next moment h T +∆ . Hence, the time window is equal to inspection interval, i.e., 0 wT =∆ .
The predicted degradation trends are ultimately fed into the welltrained LSTM classier, and thus the failure probabilities at different moments in future are obtained.

Improved maintenance decision-making method
The following long-term maintenance strategy attempts to answer the exact points in the future to take maintenance activities and to order spare parts. The optimal maintenance moment can be determined by choosing the solution with the lower cost from the expectedpreventive-maintenance (PM) cost and the no-PM cost based on the predicted failure probabilities.
The expected-PM cost is defined as follows. At a future moment h ( , 2 , h t T t T = + ∆ + ∆  ), all the costs associated with the preventive maintenance actions such as replacing the worn parts with new ones, system cleaning and adjustment, and the inventory cost of spare parts, are summed up to be the expected-PM cost, which can be denoted as C p . An important assumption to note here is, the system after taking the PM actions can be restored to be "as good as new" state, or in other words, perfect maintenance is considered in this paper.
If no PM actions are taken at the moment h , there will be no PM cost from the current moment t to the future moment h, but there exists the failure risk of the running system between h and h T +∆ . In this case, one must consider the no-PM cost, which includes the corrective maintenance cost c C with unexpected failures and the out-of-stock cost os C in the case of unavailable spare parts. Thus, the expected cost with the decision of no-PM action is defined as ( ) ( ) denotes the probability of the unexpected failures between the inspection period [, ) hh T +∆ . Fig. 5 shows the decision process based on the above-mentioned maintenance costs. If the expected-PM cost is lower than or equal to the no-PM cost, PM activities should be taken. Otherwise, no maintenance activity is required in the inspection period [, ) hh T +∆ , i.e.: Thus, the optimal maintenance moment maintenance t * can be obtained as: Ordering of spare parts should be implemented before the maintenance activities. If the longest advanced ordering time is L , the optimal ordering moment order t * can be given by:

Implementation and performance evaluation
With the historical condition monitoring data and the real-time condition monitoring data of the system, the optimal preventive maintenance and ordering moments are obtained online according to the following procedures: Obtain crucial degradation features according to the correla- (1) tion and trend criteria; Obtain future degradation trends by feeding the crucial degra- (2) dation features into the network in Algorithm 1; Obtain failure probabilities at different moments in future by (3) feeding the predicted degradation trends into the network in Algorithm 2; Calculate the expected-PM cost and no-PM cost according to (4) Eq. (7); Obtain optimal maintenance time (5) maintenance t * and optimal ordering moment order t * according to Eq. (8) and Eq. (9).
To evaluate the maintenance strategy, maintenance cost rate (MCR) [12] is considered. It is defined as the ratio between the total maintenance cost and the total life cycle duration. The strategy with lower MCR is considered to have better performance. It is worth noting that, there two possible scenarios in real-world maintenance activities.
If the scheduled preventive maintenance moment is ahead of the actual failure moment of the system, the preventive maintenance activities will be performed. In this case, the available spare parts can arrive in time thanks to the scheduled order moment. Correspondingly, the MCR with no system failure (denoted by p MCR ) is given by: Contrarily, if the system is failed before the scheduled preventive maintenance moment, the corrective maintenance has to be taken. In this case, there is no available spare parts, and the corrective maintenance cost c C and the out-of-stock cost os C with unavailable spare parts have to be paid. Thus, the MCR with system failure (denoted by c MCR ) is given by: where F T denotes the actual failure moment of the system, and [ ] x + means taking a smallest integer more than or equal to a real number x.

Data description
To verify the feasibility and effectiveness of the proposed maintenance strategy, the Turbofan Engine Degradation Simulation Data Set [11] provided by NASA Ames Prognostics Data Repository is referred. The data set is generated by C-MAPSS tool that simulates the degradation process of the main components of turbofan engines, e.g., fan, low-pressure compressor (LPC), high-pressure compressor (HPC), high pressure turbine (HPT) and low pressure turbine (LPT). Twenty-one sensors are installed inside the engine for monitoring the conditions of the engine. The first nine sets of data are obtained by direct measurement of sensors #1~#9, while the remaining data are gained by soft measurement of sensors #10~#21 [17].
In the experiment, the available data set "FD001" that describes the gradual degradation process of HPC under a constant work condition is selected to show the use of the proposed maintenance strategy. The data set contains the "train_FD001.txt" composed of 100 complete run-to-failure data (100 21 ) , the "test_FD001.txt" composed of 100 incomplete run-to-failure data (100 21 ) X' K' × × (31 303) K' ≤ ≤ and the "RUL_FD001.txt" providing the actual RUL information. Fig. 6. shows the parts of results of degradation feature selection. For sensor #1 (see Fig. 6(a)), its correlation indicator in each engine training sample is always 0, which means that the monitoring variable remains constant during the engine operation phase. Obviously, such monitoring variable has no effect on the system failure prognosis and should be eliminated. For sensor #4 (see Fig. 6(b)), its correlation indicator in each training sample is always greater than 0.5. This means that such monitoring variable has been positively correlated with operating time (flight cycle). In addition, its trend indicator value is 1, indicating that it has a monotonous upward trend. Thus, the sensor #4 are retained. Regarding the sensors # 9 and 13 (see Fig. 6(c) and (see Fig. 6(d))), they are also not proper degradation features since their correlation indicators are not still positive or negative. Finally, only seven sensors are selected, i.e., the sensors #4, #7, #11, #12, #15, #20 and #21. After some experiments, the value 20 is taken as the moving window size due to the best performance on test data set. Then, the data are normalized using the Z-score method (see Eq. (2)) so that they have the same means and variances.

Offline modeling
With reduced degradation feature data, the next step is to train degradation prognostic model and failure prognostic model using LSTM neworks. Notably, the degradation prognostic model is used to obtain the evolving degradation trends, while the failure prognostic model is used to obtain the failure probabilities at different moments in future based on the predicted degradation trends. In the LSTM network, the number of iterations is set to 50, the dropout rate is set to 0.2, the number of 1st LSTM units is set to 100 and the number of 2nd LSTM units is set to 50 [12]. Using Algorithm 1, the degradation prognostic model is built. Fig. 7 shows the offline degradation trend prediction results for training Engines #1, #2 and #3. It can be seen that regardless of Engines #1, #2 or #3, the offline predicted degradation trend values are very close to the actual degradation trend values. The offline training root-mean-square errors (RMSEs) of three engines are 0.50, 0.43 and 0.47, respectively, which indicates that the degradation prognostic model has been well built.
Given the inspection interval 10 T ∆= , the failure prognostic model can be built based on Algorithm 2. Fig. 8 shows the offline failure probability estimation results for training Engines #1, #2 and #3. The abscissa represents the operation cycle of the engine, while the ordinate "1" and "2" represent two categories: Deg1 and Deg2, respectively.
For the training Engine #1, the predicted cycles of label "2" are 1-185 cycles whose corresponding probabilities satisfy ( 10) 0.5 P RUL < < , while the actual cycles are 1-183 cycles. With regard to the training Engine #2 and #3, the predicted cycles of label "2" are 1-277 and 1-173 cycles, while the actual cycles are 1-278 and 1-170 cycles, respectively. These results shows that the failure prognostic model has been well built.

Online maintenance scheduling
As an example, the testing Engine #1 is used to illustrate the online prognostics. The online prognostics contain the online degradation trend prediction and online failure probability estimation. Fig. 9 shows the online trend prediction results for testing Engine #1. The condition monitoring data collected up to present are 31 cycles for the testing Engine #1. It can be seen that the conditions of the engine are gradually deteriorating over time.
Next, these predicted trend values are fed into the well-trained failure prognostic model. Fig. 10 shows the online failure probability estimation results for testing Engine #1. It can be seen that as the operation cycle of the engine increases, the failure probability increases. When the operation cycle exceeds the Cycle 133, the failure probabilities are stable with a high value (0.8278). Note that the moment that the first predicted failure probability crosses 0.5 is Cycle 128, indicating that the RUL of the engine will only survive for 10 days. Thus, the estimated end of life (EOL) of testing Engine #1 is Cycle 138, while the actual EOL is Cycle 143 according to the "RUL_FD001. txt". This indicates the failure prognostic is accurate.
Suppose that the preventive maintenance cost of the aero-engine. According to Eq. (7), the expected-PM cost and no-PM cost can be calculated, as shown in Table 1. Before the 129th cycle, the expected-PM cost is higher than the no-PM cost, while in the 129th cycle, the expected-PM cost is lower than the expected no-PM cost. Hence, theoretically, the optimal maintenance moment is the 129th cycle. However, in practice, the maintenance activities can be carried out only at the inspection moments, so the real maintenance activities will be taken at the 120th cycle. If the logistic service department can provide the lead time of 20 cycles in ordering the spare parts, the optimal order moment will be 100th cycle.

Comparative results and discussion
In this section, the proposed maintenance strategy is compared with the three benchmark maintenance strategies [12]: original dynamic PdM strategy, classical periodic maintenance (PeM) strategy and ideal predicted maintenance (IPM) strategy. It is noted that, the original dynamic PdM strategy focuses on the instantaneous decisionmaking, while the PeM and IPM strategies can handle the long-term decision-making problem.
Firstly, the original dynamic PdM strategy is compared with the enhanced one. Table 2 lists the decision-making results of the original PdM and enhanced PdM. As for the PdM strategy presented in [12], the decision-making results are that no maintenance and no ordering of spare parts are carried out in Cycle 31 (current cycle). Obviously, this strategy provides an instant decision. Regarding the enhanced PdM strategy (the method of this paper), the scheduled maintenance time is Cycle 100 and the ordering time of spare parts is Cycle 120.
As far as the failure time of Cycle 143 is concerned, the planned maintenance time and ordering time of spare parts is reasonable. It is self-evident that, the enhanced PdM strategy gives the exact time when the system must take preventive maintenance activities, which helps to plan inventory and production activities in advance. Secondly, the PeM strategy and the IPM strategy are compared with the proposed strategy. Considering that the PeM and IPM strategies are also aimed at the long-term decision-making, we uses the maintenance cost rate (MCR) presented in Section 2.4 to illustrate the superiority of the proposed strategy. The testing Engines #1-20 are taken as an example. Fig. 11 shows the MCRs of three maintenance strategies for testing Engines #1-20. From the 20 engine instances, the performance of the proposed maintenance strategy is highlighted. Specifically, compared with the PeM strategy, the MCRs of the proposed maintenance strategy are lower in most engine instances. This can be explained by the fact that, to ensure the engine safety, the PeM strategy is relatively conservative, resulting in excessive maintenances and poor economic efficiency. As for the IPM strategy, perfect prediction information is only an ideal hypothesis that cannot be attained in practice. From the figure, the MCRs of the proposed maintenance strategy are close to that of IPM strategy with perfect predictions. More specifically, the average MCRs of the three maintenance strategies are respectively calculated as follows: 1.9513 for the PeM strategy, 1.1515 for the enhanced PdM strategy, and 0.5270 for the IPM strategy. These results show that the proposed enhanced PdM strategy works well, allowing significantly reducing the maintenance cost rate.

Conclusions
As an important input of maintenance activities, the precision of failure prognosis directly affects the effectiveness of maintenance strategy formulation. Therefore, from the perspective of engineering applications, the data based failure prognosis needs to be considered  jointly with maintenance decision-making to ensure the system safety and reliability. In this work, an enhanced data-driven predictive maintenance strategy has been developed. It provides a complete solution from failure prognosis to maintenance decision-making. The proposed strategy can obtain effective features reflecting the degradation trends. Also, it can achieve accurate failure prognostics and provide the failure probabilities at different moments in future. In particular, the proposed strategy solves the instantaneous decision-making problem and gives the exact time when the system must take preventive maintenance activities.
The verification results using NASA data repository reveal the feasibility and effectiveness of the proposed maintenance strategy. The performance of the proposed strategy is highlighted when compared with the decision-making results of the emerging dynamic predictive maintenance, the classical periodic maintenance and the ideal predicted maintenance. However, one limitation of the proposed strategy is, only the perfect maintenance is considered. Further work will focus on the investigation of imperfect maintenance with different levels. Also, the ambition is to develop flexible maintenance strategies by estimating the residence time of different health states.