Spatio-temporal graph convolutional neural network for remaining useful life estimation of aircraft engines

Accurate remaining useful life (RUL) estimation is crucial for the maintenance of complex systems, e.g. aircraft engines. Thanks to the popularity of sensors, data-driven methods are widely used to evaluate RULs of systems especially deep learning approaches. Though remarkably capable at non-linear modeling, deep learning-based prognostics techniques lack powerful spatio-temporal learning ability. For instance, convolutional neural networks are restricted to only process grid structures rather than general domains, recurrent neural networks neglect spatial relations between sensors and suffer from long-term dependency learning. To solve these problems, we construct a graph structure on sensor network with Pearson Correlation Coefficients among sensors and propose a method for combining the power of graph convolutional network on spatial learning and sequence learning success of temporal convolutional networks. We conduct the proposed method on aircraft engine dataset provided by NASA. The experimental results demonstrate that the established graph structure is appropriate and the proposed approach can model spatio-temporal dependency accurately as well as improve the performance of RUL estimation.


Introduction
In view of the major consequences and huge expense are usually relevant to system failures, prognostics and health management (PHM) has received increasing attention in academia and industry. As a challenging task in PHM, prognostics aims to estimate how much time remains before a likely failure or in other words remaining useful life (RUL) to promote reliability and safety. Methods to evaluate RUL can be mainly grouped into three major categories: model-based methods, experience-based methods and datadriven methods [1]. The former requires extensive expert knowledge about systems to build a physical model [2][3][4] which is usually unavailable in practice, e.g. aircraft engines. Experience-based methods leverage historical failure cases to build hidden relationship among current system states, current lives and recorded failure instances. However, methods included in this category, e.g. instance-based methods [5,6], require more calculations as reference cases increase and depend too much on run-to-failure cases.
Data-driven methods, especially deep learning techniques, have emerged as a research hotspot with the availability of sensor data from numerous systems. The clipped sensor readings and their corresponding RULs are used as inputs to train a deep neural network, and the network can discover intricate relations between them with powerful non-linear modeling capabilities. For example, Babu et al. [7] treated segmented sensor readings as 2-dimensional (2D) inputs and utilized 2D convolution to extract more informative feature. Li et al. [8] stacked 1-dimensional (1D) convolution layers for feature extraction of each sensor signals and predicted RUL by a fully-connected layer. Malhi et al. [9] proposed a learning-based method for long-term prognostics of rolling bearing with recurrent neural network (RNN). Wu et al. [10] combined a long short-term memory network (LSTM) based model with a dynamic difference method to get rid of operating condition influence. However, methods mentioned above neglect spatial relationships between different sensors and RNN-based structures are possible to accumulate error by steps and difficult to capture long-term dependence. In this work, we adopt two crucial strategies to solve these two problems. First, we compute the weighted adjacency matrix of senor network with Pearson correlation coefficients (PCCs) among sensors. Second, graph convolutional networks (GCNs) and temporal convolutional networks (TCNs) constituted by stacked dilated casual convolutions work together to capture spatiotemporal dependencies followed by gating mechanism and skip connections.
The rest of the paper is organized as follows. Section 2 introduces the related work, including convolution on graph structure and spatio-temporal graph neural networks. Section 3 presents how to build graph structure of sensor network via PCCs and introduces the proposed framework. Section 4 shows the experimental results on C-MAPSS dataset. The last section draws the conclusion and discusses the future work.

Convolution on graph structure
Convolutions in the graph domain can be categorized as spatial approaches and spectral approaches [11]. The former defines convolutions directly on graph which is challenging to deal with differently sized neighborhoods [12][13][14]. Li et al. [12] model the spatial dependency based on diffusion process as: where x ∈ R N . is the input; g θ is a filter parameterized by θ ∈ R K ×2 .; A 1 and A 2 are the transition matrices of the diffusion process and the reverse one; k is the diffusion step and in [12] a finite K -step truncation of the diffusion process is adopted. The latter operated convolutions based on Fourier domain [15], which is defined as: where g θ is a filter parameterized by θ ∈ R N ; x ∈ R N is the signal; L I N − D − 1 2 AD − 1 2 is the normalized graph Laplacian (D is the degree matrix and A. is the adjacency matrix of the graph); is the diagonal matrix of eigenvectors of L; U is the matrix of eigenvectors of L.
Further to reduce the high computing consume of Eq. (2), many researchers proposed different methods. Hammond et al. [16] approximate the g θ ( ). with a truncated expansion in terms of Chebyshev polynomials up to K th order: More efforts are made to overcomthe problem of overfitting and numerical instabilities in [17], where a first-order approximation of Chebyshev polynomials and some normalization tricks are introduced: where θ θ 0 −θ 1 , θ 0 and θ 1 are two free parameters; A A + I N is the adjacency matrix with self-connections, I N is the identity matrix;D ii jÃ i j .

Spatio-temporal graph neural networks
Spatio-temporal graph neural networks have a wide range of applications, e.g. traffic forecasting, action forecasting and wind speed forecasting [18][19][20][21][22]. In these tasks, the key is to determine the optimal combinations of spatial information and temporal dynamics under specific settings. Li et al. [12] replaced the matrix multiplications in gated recurrent units (GRU) with diffusion convolutions to extract spatio-temporal features of traffic flow, which was modeled as a directed graph. Seo et al. [19] combined convolutional neural network (CNN) on graphs to identify spatial structures and RNN to find dynamic patterns for predicting structured sequences of data, e.g. frames in videos.
However, RNN-based approaches are inefficient for long sequences and more likely to explode when they are combined with GCN [20]. Yu et al. [21] computed the distances among stations as the adjacency matrix of road graph, and then utilized complete convolutional structures to extract spatio-temporal features of traffic network. Yan et al. [22] designed a generic representation of skeleton sequences and extended graph neural networks with 1-D CNN to model dynamic skeletons effectively. Zhang et al. [23] used adaptive graph convolution to learn spatio-temporal features for RUL prediction. Different from their work, we explore a different graph convolution process and a different network structure.

RUL estimation based on spatio-temporal graph convolutional network
In this work, the graph structure of sensor network is constructed with PCCs and a method combining GCNs and TCNs is proposed to estimate RULs. Figure 1 shows the structure of our model where spatio-temporal convolutional blocks (ST-Conv blocks) are crucial components. Each ST-Conv block consists of two GCNs and two TCNs followed by gating mechanism to fuse both spatial and temporal features. Residual connections are applied in blocks and skip connections are utilized throughout the network to speed up convergence. The fully connected layer is utilized to get the final RUL result.

Problem definition
Generally, RUL estimation is a typical time-series prediction problem that predicting the system will fail afterŷ t time steps based on given T running observations: whereŷ t is the estimated RUL; is a function that maps T running observations to the RUL at time t. Figure 2 shows a simplified diagram of engine. We can see various subroutines are assembled together and we can monitor its status by collecting data from different parts, such as pressure at fan inlet, HPC stall margin, temperature at LPT olet and so on. It is undoubtful that each part will be affected by other parts even operating environments rather than independent of each other. Considering the external and internal influences, graph structure is very suitable for this situation since it can aggregate information from neighboring nodes. In this work, we define the sensor network as a graph where sensors are nodes, their relations are edges, impact magnitude and its positive or negative represent weighted adjacency matrix. The schematic diagram is shown in Fig. 3.   Fig. 2 Simplified diagram of engine [24] Combined with the graph structure, the original Eq. (5) can be expanded to: where G (V, E, A) is the graph structure of sensor network; V is the set of nodes or in other words sensors with a total of N ; E. is the set of edges; A ∈ R N ×N is a weighted adjacency matrix derived from other nodes.

Graph convolution for spatial information
Different from traffic forecasting or action recognition where the adjacency matrix can be computed based on Euclidean distance among stations in traffic network or the intra-body connections of joints can be represented by distance according to partitioning strategies, relationships between sensors are not provided explicitly in RUL estimation tasks. Since the proximity of two sensors in Euclidean domain doesn't mean they are closely related, Euclidean distance is not the optimal metric for relationship modeling. Hence, we construct the weighted adjacency matrix based on PCCs among sensors. The formulation of PCCs for sequence X nd Y is: Fig. 3 Graph structure of sensor network where cov(X , Y ). is the covariancof X and Y ;σ X and σ Y are the standard deviations of X and Y ; μ X and μ Y are the means of X and Y ; ρ X ,Y is a number between -1 and 1 that indicates the extent to which two variables are related. Then the weighted adjacency matrix is defined as: Further based on the research of Li et al. [12] and Eq. (4), GCN on sensor readings can be represented as: where P A j A i j is the transition mrix; W k are trainable weights.

Spatio-temporal convolutional block
The ST-Conv block takes advantages of CNNs and RNNs as well as prevents their disadvantages. Due to the defects of RNN-based structure, here we choose dilated causal convolution [25] as temporal convolutional layer to extract temporal dynamics. Dilated causal convolution can learn long-term dependency properly in a non-recursive manner and increase the receptive field without greatly increasing computational cost or stacking many layers [26]. The formulation of dilated causal convolution is: where x ∈ R T is a 1D sequence input; f (·) ∈ R K is a filter; d is the dilation factor which controls the skipping distance. LSTM outperforms than basic RNN in sequences modeling thanks to the gating mechanism which can effectively control the flow of information. Learning from the success of LSTM and to enrich convolution operations with nonlinearity, 1-D dilated causal convolution and gating mechanism work together to extract features on time axis [27]. Furthermore, to fuse information from both spatial and temporal domains simultaneously, we embed GCN into ST-Conv block. So, the formulation of ST-Conv block can be defined as follows: where is the Hadamard product; σ (·) is the sigmoid activation function; θ 1,1 , θ 1,2 , θ 2,1 and θ 2,2 are convolution parameters; Z is the output of GCN; X is the output of TCN.

Dataset description
Saxena et al. [24] modeled a series of turbofan engines by Commercial Modular Aero-Propulsion System Simulation and recorded their run-to-failure data as C-MAPSS dataset. As shown in Table 1

Data preprocess
In practical application, degradation of a system tends to be negligible in the initial stage and increases as it approaches run-to-failure. Hence, we utilize a piece-wise linear degradation model to obtain RUL labels with respect to each sample, whose maximum RUL is set as 130 according to the research of Zheng et al. [28].
Though each observation is 26-dimentional, some channels do not provide valuable information about degradation, i.e. records on sensor #1, #5, #10, #16, #18, #19 and setting3. Each selected channel is normalized by z-score normalization: where μ is the mean; σ is the corresponding standard deviation.
In some cases, predicting failure early is better than late. Late prediction may lead to accidents because conditionbased maintenance is too late to perform, while early failure warning will not pose life threatening. So, an unbalanced evaluation indicator average score (AS) is employed to penalize late prediction. Its definition is denoted as: where n is the number of test samples; s i is the computed score of i th sample: where d i RUL true i − RUL pre i is the RUL estimation error of the i th engine; a 1 13 and a 2 10 are default factors in two cases.
We can know from Eq. (14) that if there are some instances whose d is large enough, then the score will grow exponentially. Furthermore, we also adopt another fair evaluation index called Root Mean Square Error (RMSE), it is: where d i is the estimation error of the i th engine; n is the number of test samples.

Experiment results
To show the effectiveness of our approach, we compare its performance with other methods which are conducted under the same data division way and same evaluation metrics. The Adam optimization algorithm and RMSE loss function are used during training. The comparison results are shown in Table 2. We can know from Table 2 that our proposed method has competitive performance in both RMSE and AS especially for those operating condition is more complex. Compared with previous approaches whose AS of FD002 and FD004 is fifteen times more than it in a single environment, our method reaches a more similar level in different environments. This proves GCN is effective for aggregating spatial information from different orders of neighboring nodes. Figure 4 shows RUL prediction samples of four subsets. Figure 5 shows weighted adjacency matrixes of FD001 and FD002 based on PCCs. Since data in FD001 is collected under single operating condition, 0th and 1st observations (i.e. setting1 and setting2) are unrelated with other sensor readings, while in FD002 sensors are influenced by them. So, the weighted adjacency matrixes can not only identify relations of running data but also model the influence of operating conditions.

Analysis the effectiveness of GCN
To further prove the effectiveness of our approach, we compare our method with its non-GCN version. Figure 6 shows their performance differences with respect to RMSE and AS.
We can see non-GCN method performs slightly worse on RMSE and AS in FD001 and FD003 but much worse in FD002 and FD004 where environments are more complex. This demonstrates that our method can cope with multiple operational conditions and RUL estimation tasks can benefit from spatial information.

Effects of sequence length on performance
To explore the effect of sequence length on model performance, here we take FD001 dataset as an example. Figure 7 shows the effect of sequence length on model performance. The histogram represents RMSE and the line is average training time. It can be observed that with the growth of sequence length, average training time tends to increase while

Conclusion
This paper constructs the graph structure of sensor network properly and proposes an effective method to estimate RUL, which combines GCN and Gated-TCN to substantially fuse spatial and temporal features. The experiments on C-MAPSS dataset prove that our method can extract spatio-temporal features adequately and improve the accuracy of RUL estimation. Analysis about weighted adjacency matrix indicates that the constructed graph is qualified for multi-environment tasks. Furthermore, the effects of sequence length on prognostic performance are investigated.
Further improvements are still available. For instance, more precise adjacency matrix and more effective spatiotemporal structure deserves more exploration. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.