A Spatio-Temporal Attention Graph Convolutional Networks for Sea Surface Temperature Prediction

Sea surface temperature (SST) is an important index to detect ocean changes, predict SST anomalies, and prevent natural disasters caused by abnormal changes, dynamic variation of which have a profound impact on the whole marine ecosystem and the dynamic changes of climate. In order to better capture the dynamic changes of ocean temperature, it’s vitally essential to predict the SST in the future. A new spatio-temporal attention graph convolutional network (STAGCN) for SST prediction was proposed in this paper which can capture spatial dependence and temporal correlation in the way of integrating gated recurrent unit (GRU) model with graph convolutional network (GCN) and introduced attention mechanism. The STAGCN model adopts the GCN model to learn the topological structure between ocean location points for extracting the spatial characteristics from the ocean position nodes network. Besides, capturing temporal correlation by learning dynamic variation of SST time series data, a GRU model is introduced into the STAGCN model to deal with the prediction problem about long time series, the input of which is the SST data with spatial characteristics. To capture the significance of SST information at different times and increase the accuracy of SST forecast, the attention mechanism was used to obtain the spatial and temporal characteristics globally. In this study, the proposed STAGCN model was trained and tested on the East China Sea.Experiments with different prediction lengths show that the model can capture the spatio-temporal correlation of regional-scale sea surface temperature series and almost uniformly outperforms other classical models under different sea areas and different prediction levels, in which the root mean square error is reduced by about 0.2 compared with the LSTM model.


I. Introduction
T he dynamics of the oceans, which make up about two-thirds of the planet, have an extremely important impact on climate, marine ecology, and the lives of the people around them. Sea surface temperature (SST) is an important index to detect ocean changes, predict SST anomalies, and prevent natural disasters caused by abnormal changes. Therefore, it's significant to predict the dynamic change of SST in the future. Moreover, SST has played an indispensable role in the ocean-atmosphere interaction, that is, the exchange of matter, energy, and momentum between the ocean and the atmosphere [1] [2] [3]. As a result, changes in SST have incalculable impacts on global climate and marine ecosystem [4] [5] [6] [7]. Besides, SST predictions also has implications for applications related to the ocean, such as weather forecasting, fisheries, and marine environmental protection. Therefore, it's critically necessary to predict dynamic changes of SST in the future to help people identify and prevent severe weather events such as drought in advance [8] [9], and it's also of great significance for scientific research and applications [10]. However, due to the influence of many complex factors, such as sea surface heat flow, radiation, and solar wind, the prediction of SST is quite indefinite and challenging.
In recent years, SST prediction methods have been widely applied and attracted much attention further become an attractive field of marine research. Three kinds of methods are generally used to predict SST including the numerical method based on the mathematical model, the data-driven method using the historical model to predict SST in the future, and the method combining the two methods [11]. The numerical methods generally use kinetic and thermodynamic equations to exposit the dynamic changes of SST and then solve a series of differential equations which are difficult to solve due to they are usually sophisticated and require a large amount of computation. The data-driven method is mainly used to predict the future SST value from the perspective of data. This method builds the model by learning the relationships and patterns from the historical SST observation data and further uses the learned relationships model to approximate the future SST data. The data-driven method is less complicated than the numerical method and is suitable for the prediction of SST in highresolution areas. The data-driven method mainly predicts the future SST from the perspective of statistical data analysis, machine learning, and artificial intelligence algorithms. Among them, statistical data analysis techniques primarily contain the Markov model [12] [13], Empirical canonical correlation analysis [14] and regression model [15] [16], etc. Classical machine learning methods including linear regression, support vector machine (SVM)[ [17],nonlinear regression model [18], and artificial neural network [19] are used to forecast future SST. Support vector machine (SVM) is a generalized linear classifier that classifies data based on supervised learning. Particle swarm optimization (PSO) algorithm is a random and parallel optimization algorithm based on population. Those two kinds of artificial intelligence methods are commonly used in SST prediction [17] [20]. The numerical method and machine learning method can also be combined [11] to better predict SST, but the prediction effect is similar to that using the numerical method.
With the continuous development and innovation of deep learning, the deep learning method has been mostly used in SST prediction due to its powerful ability to learn and model the relationship between data [21] [22] [23] [24]. Recurrent neural network (RNN) can effectively deal with time series prediction problems, but serious gradient vanishing or outbreak problems will occur when processing long time series data. As a variant of the recurrent neural network, a long shortterm memory (LSTM) network with recurrent structure and gating mechanism is proposed to solve the long-term time dependence problem, which can remember longer time series information and obtain better prediction results [25]. In order to simplify the complex structure of LSTM network, a gate recurrent unit (GRU) [26]with relatively unsophisticated gate structure was proposed. As an improved variant of the LSTM network, the GRU model not only retains the advantage of LSTM in long-term series memory but also has high computational efficiency, which alleviated the phenomenon of network overfitting and underfitting.Those model have been widely used in ocean surface temperature prediction [27] proposed a fullconnected LSTM (FC-LSTM), which is composed of LSTM layer and Full Connected layer. [26]designs an adaptive mechanism based on deep learning and attention mechanism to predict SST, which uses GRU encoder-decoder to obtain the static change of SST and apply dynamic influence link to acquire the dynamic variation for realizing the long-term prediction of future SST. [9]proposes an integrated learning model (LSTM-Adaboost) that combines the deep LSTM neural network with the Adaptive Boosting (AdaBoost) algorithm to predict the daily SST in the short and medium-term. Feng et al used time-domain convolutional network to achieve short-term small-scale SST prediction of the Indian Ocean [28]. Han et al. used convolutional neural network method to achieve regional prediction of Sea surface temperature, sea surface height and ocean salinity in the Pacific [29].
However, although these SST prediction models have achieved a good prediction effect, they only consider the time correlation but ignore the spatial dependency, so they cannot achieve high accuracy in predicting the dynamic changes of SST sequence data. Beside, the association structures constructed when capturing the spatial influence of adjacent nodes on the central node are not all standardized grid structures. For example, topology structures based on spatial association can be constructed when missing values exist. Therefore, in order to capturing spatial dependencies from complex topologies, the GCN was applied to obtain the spatial dependence from the SST series data of the ocean location points, an original SST prediction method named spatio-temporal attention graph convolutional network (STAGCN) based on ocean location points network was proposed in this paper. Specifically, the GCN is applied to capture spatial correlation from the ocean positions network with the topological structure. The GRU is used to capturing the temporal dependence from the dynamic changes of the SST time series data. In addition, the STAGCN model introduces an attention mechanism to learn global correlation, adjust and integrate global temporal information of SST for realizing accurate SST prediction tasks eventually.
The contribution of this paper can be summarized in the following two aspects: (1) A STAGCN model is designed to capture the global spatial and temporal dependence simultaneously for accurate SST prediction, which combines the GCN deep learning model with the GRU learning model and introduces an attention mechanism.
(2) The concept of a graph is applied to the field of SST prediction, and the topology structure network of ocean location points graph is constructed to obtain the spatial characteristics from SST time series by GCN model. Using 38-year time-series satellite data from some areas of the East China Sea, the experiments show that the STAGCN model can achieve preferable prediction results than the GRU model and the GCN model demonstrating that the STAGCN model has the ability to capture both time and space correlation from SST series data simultaneously, and can achieve desirable prediction effect for the short-term prediction of future SST.
The rest of the paper is organized as follows. Section II proposes the novel STAGCN model for SST prediction in detail. Section III describes the experiments and analyzes the results. Finally, Section IV gives the conclusion of the paper.

A. Problem Clarification
In this study, the prediction of SST is to predict the sea surface temperature within a certain time in the future according to the historical SST time-series information.
The undirected graph G = (V, E) with no weights was used to describe a topological network composed of oceanic observation points, and each location represents a node in the graph, where V = v 1 , v 2 , ....,v N means all oceanic positions that correspond to the N vertices of the graph. E represents the correlation between ocean points corresponding to the edges between nodes in the graph, reflecting the connection between nodes at different positions. The connection relationship between nodes in the whole graph is represented by the adjacencies matrix A, the number of the rows and columns of which are determined by the number of nodes, and each value of which represents the connection relationship between nodes. The value of each item in the adjacency matrix A is either 0 or 1. 0 indicates that there is no direct connection between nodes and 1 expresses that there has a linkage between nodes.
The feature X P*N of the node in the topology graph corresponds to the SST value of each location point on the ocean positions network. where N represents the length of the time dimension in oceanographic satellite data, and X t ∈ R t*N represents the SST value at time t.
The SST prediction problem can be regarded as looking for a mapping function f, which map the SST value in the historical time n to the SST value in the next T periods under the conditions of the topology diagram G of the ocean location points network and feature matrix X. Equation (1) refers to the SST prediction process: The left side is the historical SST value with the length n, from which the model learns the variation trend of SST, and the right side of the equation is the predicted future SST value with length T under the mapping condition.

B. The STAGCN Model
To capture global dynamic changes of the SST information, a novel STAGCN model combines graph convolutional network capturing spatial correlation and gated recurrent unit obtaining temporal dependence with attention model at the same time is proposed in this paper.In the STAGCN model, a layer of GCN network can be used to obtain a better prediction effect, and Equation (2) refers to the specific convolution process: (2) where f(X, A) express the final output of the GCN model, X is the eigenmatrix, A means the adjacency matrix of the graph convolution. ReLU(·) represents an activation function used to add linear factors to improve model expression.is the form of the adjacent matrix after further renormalization to avoid causing gradient explosion. W 0 represents the weight that needs to be trained in the progress of the graph convolutional network, whose first dimension F means the time series length of ocean data and second dimension G means the number of neural units in the output layer.
Specifically, the area we study is ocean location points graphs composed of position points determined by longitude and latitude, and the topological connection relationship between location points can be captured by the GCN network. The shaded part of the dotted line in Fig. 1 is a topological graph of a point in the simulated ocean location. It is assumed that the red dot 1 represents the central node in the topology diagram, and the green dots (the green dot in the shaded part of the dotted line) around it are the adjacent nodes. The interaction degree between the central node and the adjacent nodes can be obtained by GCN model. Then the GCN model captures the SST characteristic attributes of the topological structure and further acquires the spatial correlation from the location points network. The GRU model was adopted in this study to obtain the temporal dependence of SST data. The attention model is used to screen which moments of SST data are relevant, that is, to distinguish the importance of data at different moments, which improves the accuracy of the prediction and realize the SST prediction task based on the structure of the ocean location points graph. The specific spatial-temporal prediction process of the STAGCN model is shown in Fig. 2, where the distance function (Equation) was applied to calculate the adjacency matrix A 1 that represents the connection relation between the position  First of all, the obtained adjacency matrix and the SST feature data X i (i = t−n, ..., t−1, t) of n historical time series were taken as input, from which the GCN model capture the spatial information. Then, The input of the GRU model is replaced by the output of the GCN model to obtain the temporal characteristics of SST data. Equation (3) to (6) refer to the update gate, reset gate, cell state and output state at time t in the STAGCN model respectively.
where u t and r t express the update gate and reset gate, h t−1 means the output at previous time, h t means the state output at present t moment, c t represents the information be reserved from previous moment and present time. Function f gcn (·) express the graph convolution, W u , W r and W c are the connection weights between the output of graph convolution and the previous output h t−1 , b u , b r and b c are the corresponding thresholds. Moreover, the final hidden state information of the GRU model is used as the input of the attention model, which is used to obtain the importance of the changes information of SST series data. Finally, we get the prediction from the full connection layer. In the attention model, multi-layer perception is used as the scoring function, in which w i (i = t−n, ..., t−1, t) is the weight matrix of multi-layer perception. The score e i (i = t−n, ..., t−1, t) of the multi-layer perception output is brought into the Softmax function to get the attention distribution probability. The last hidden state and its weight are weighted to obtain the final context vector C.
To sum up, we propose that the STAGCN model has the ability to obtain the global spatial dependence and temporal dynamic changes which can obtain a preferable SST prediction effect. The GCN model is applied to capture spatial information by building the structure of the interrelation between the position nodes. The GRU model is used to obtain the temporal dependence from the SST series data with spatial characteristics. Moreover, the attention model captures the global variation trends of the SST information which is significant to achieve accurate SST prediction tasks.

A. Research Area and Data
Rich in natural resources, the East China Sea is the confluence of many rivers covering a wide area including China's Bohai Sea in the north and the Taiwan Strait in the south and is the strategic maritime area for China, Japan, South Korea, and other countries. Therefore, studying the dynamic changes of SST data in the East China Sea plays an extremely important role in national marine transportation and people's production and life in the surrounding countries. The research area selected in this study is the part area of the East China Sea with sea areas of 26.8755°N -32.125°N and 123.125°E -127.125°E covering most areas of the East China Sea and no land area, as is shown in Fig. 3. The selected area includes most of the East China Sea to facilitate the acquisition of grid-based SST data and the establishment of the topological structure of the ocean position points graph for SST prediction.
In this study, the data used in the experiments were derived from the daily Optimum interpolated SST (OISST Daily Edition 2.1) with a spatial resolution of 1/4° in the N ational Oceanic and Atmospheric Administration (NOAA) platform. In this study, AVHRR-only daily SST data contained a total of 13,879 days SST data from 1982/01/01-2019/12/31 covered the spatial span of 26.875°N-32.125°N, 123.125°E-127.125°E (the most of the east sea) are used as the experimental dataset, with a total of 22*17 position points as the study nodes.
In this study, the experimental data is composed of two parts: the adjacency matrix and the eigenmatrix respectively. The former is an adjacency matrix of size 374*374, which depicts the spatial dependence between position points. Equation (7) refers to each value in the adjacency matrix W is derived by some scaling of the distance between each position point in the ocean anchor point network.
where W ij is the weight of the edges in the position graph determined by the distance between position i and j (d ij ). σ 2 and ε are thresholds to control spacial arrangement and sparse arrangement of the adjacency matrix W, which are set to 0.1 and 0.4 respectively after testing in the experiments. The latter part of the experimental data is the eigenmatrix, which describes the dynamic changes of the SST value over time at the position nodes.

B. Experiment Setup
In order to better explain the superiority of the STAGCN model proposed in predicting SST, we chose five comparative models including the autoregressive moving average model (ARIMA), linear support vector machine model (SVR), graph convolutional network model (GCN), and gated recurrent unit model (GRU). In the experiments, we divided the dataset into the training set and test set, and the ratio of the two is 8:2. The Adam optimizer is used to train the model. The STAGCN model, GCN model, and GRU model use TensorFlow 1.5.0 (GPU version) as runtime environment during the training and testing progress. The ARIMA and SVR are respectively implemented using Statsmodels 0.12.2 and Scikit-learn 0.24.1 [31].
The real SST and the predicted SST of different nodes at time t are respectively expressed by Y t and . In the network training, the loss function value should be minimized as far as possible, which is beneficial for the predicted SST value of each ocean position node to be a better fit to the actual SST value. The loss function defined in the STAGCN model is shown in Equation. The first term ||⋅|| is the 2-norm of the real value and the predicted value, and the second term is the regularization term λL reg with hyperparameter λ, which improved the prediction performance of the model and prevented overfitting occurrence during training. To prove the desirable prediction performance of the STAGCN model, five measurement criteria are used to compare the SST prediction performance of the proposed model with other models including root mean square error (RMSE), mean absolute error (MAE), accuracy (Accuracy), coefficient of determination (R 2 ) and explained variance score (Var).
(11) (12) where Equation (8) to (12) refer to the calculation process of RMSE, MAE, Accuracy, (R 2 ) and (Var) respectively, T means the length of the SST time series, N represents the total number of position points. TN is the number of recorded changes in temperature values at all ocean locations with a time length of T. represents the real SST value and the means predicted SST value in the position point n at the moment t. The entire sets of and can be defined as Y and respectively.
represents the average value of collection Y. Five criteria are used to measure the merits of the model from the perspective of error, accuracy, and fitting degree. The F in the accuracy indicator refers to the F-norm.
In the neural network model, the setting of model parameters is crucial to the model training, such as the number of neurons in the hidden layer, whose size determines the computational complexity and predictive performance of the whole model. Therefore, in order to improve the training efficiency and the accuracy of prediction, we used different hidden units to conduct experiments on the test set and select the corresponding number of hidden units with the best predicted effect from the predicted results. We set the number of neurons in the hidden layer as the value in [8,16,32,64, 128], respectively. The variation trend of error indicators under different hidden units condition as shown in 4, from which we can see that the RMSE and MAE reach the minimum value when the number of hidden layer units is 32. At the same time, we can see from 4 that the prediction error including RMSE and MAE has a similar decreasing trend while hidden units increases from 8 to 32, and then it shows an increasing trend when hidden units exceed 32. On the contrary, the prediction accuracy has an opposite trend, rising first and then falling. The trend of the model measurement indexes shows that there is a critical value of the number of hidden layer cells in the model. When the critical value is exceeded, the complexity of the model will increase with the increase of the number of hidden layer cells, and the performance of the model will decline simultaneously. Since the prediction performance of the model has the best performance when the number of hidden units is 32, the following experiments will be carried out under the condition that the number of hidden units is set to 32.

C. Experiment Results and Discussion
We performed prediction experiments on 1 day, 7 days, 14 days, and 30 days SST values in the future and measure the performance of the proposed STAGCN model, the ARIMA model, SVR model, GCN model, and GRU model with five performance metrics. The SST prediction result of the STAGCN model is analyzed from the perspectives of prediction accuracy, temporal and spatial prediction ability, and longterm prediction ability. Table I shows that the result of the five metrics using to measure the prediction result of the STAGCN model compared with other models on the East China Sea dataset for the 1-day, 7-days, 14-days, and 30-days prediction tasks, with boldface sections representing the optimal values of the various metrics of the model. -show that the value is negative, indicating that the model cannot predict well and can be ignored. Compared with other models, the proposed STAGCN model achieves excellent prediction performance under almost all conditions, indicating that the STAGCN model can capture the global spatio-temporal correlation, thus achieved an accurate prediction of SST. For different prediction lengths, the prediction effects of the STAGCN model are preferable to other models, demonstrating that the STAGCN model has the ability to predict SST accurately in both the short and long term.
As can be seen from Table I, the prediction performance of the GRU model is better than that of the GCN model. The RMSE of the GRU model is approximately 0.07 lower than that of the GCN model and the accuracy of the GRU model is increased by 0.4% compare with the GCN model for future one-day SST prediction. The forecasted effect of the GCN model is worse than that of the GRU model probably because the data itself has obvious time series features, but the GCN model merely captures the spatial dependency without considering the temporal characteristics from SST data. In order to better illustrate that the STAGCN model proposed in this paper can capture the temporal and spatial dependence from SST data and obtain satisfactory prediction effect simultaneously, we visualized the error index RMSE of STAGCN model, GRU model, and GCN model in predicting SST for next 1 day, 7 days, 14 days and 30 days and analyzed their prediction performance. The visualization results are shown in Fig. 5. Fig. 5 (a) and Fig. 5 (b) are visual comparison effects of the STAGCN model compared with the GCN model and the GRU model on RMSE metric respectively. As can be seen from the bar chart, the RMSE of the models increases with the increase of the predicted length, but the error of the STAGCN model is smaller than that of the other two models demonstrating that the STAGCN model has the ability to obtain spatio-temporal correlation from SST series data. For example, the RMSE values of STAGCN model are about 0.16, 0.21, 0.23 and 0.35 lower than that of the GCN model for 1-day, 7-day, 14-day and 30-day SST forecasting, indicating that the STAGCN model is capable to obtain spatial features from sequence data. The RMSE values of the GRU model considered single temporal characteristics are raised by about 0.07, 0.14, 0.14 and 0.17 for future 1-day, 7-day, 14-day and 30-day SST prediction respectively compared with the STAGCN model, indicated that the STAGCN model can obtain the time correlation.  To reflect the prediction effect of the STAGCN model, we visualized the variation trend of the RMSE and the accuracy in SST forecasting for the next 1 day, 7 days, 14 days and 30days. The results are shown in Fig.6. As can be seen from the figure, with the increase of prediction length, the error of the STAGCN model increases relatively large and the accuracy decreases slightly. Although the RMSE of the STAGCN model does not have a certain stability, its prediction accuracy is relatively stable, indicating that the STAGCN model can also achieve long-term SST prediction while it has better prediction ability for short-term SST prediction than long-term SST prediction. Fig.7 (a) and Fig.7 (b) are RMSE and Accuracy results of SST prediction by different methods at different prediction horizons. It's observed that the STAGCN model achieved the lowest RMSE and the highest Accuracy for different predicted lengths.
To better explain the prediction performance of the STAGCN model, an ocean location point was randomly selected from the East China Sea dataset, in which we predicted the future SST for the next 1 day, 7 days, 14 days, and 30 days, and visualize the prediction effect of all the selected days and the next 90 days. Fig. 8 , Fig. 9 , Fig. 10 , and Fig. 11 show the visualization results of the SST for the 1-day, 7-days, 14-days and 30-days forecast intervals.
The prediction results of STAGCN model with the prediction length of 1 day, 7 days, 14 days and 30 days show that STAGCN model has poor prediction at peak and peak valley. The main reason may be that the GCN model in STAGCN model captures spatial features by constantly moving its defined smoothing filter, which will lead to excessive peak smoothing of the overall prediction results. At the same time, the model has a corresponding delay in the overall prediction.Although the prediction performance of STAGCN model decreases with the increase of prediction length, the fitting degree of predicted value and real value of STAGCN model is still high, and good prediction results can be obtained.   STAGCN model can capture the temporal correlation and regional spatial characteristics from the TIME series of SST, and obtain the global spatiotemporal dynamic change trend by using the attention mechanism, which reduces the prediction error and improves the accuracy of prediction, and realizes the long-term and short-term prediction task of regional scale SST.

IV. Conclusion
The sea surface temperature is an important index to detect ocean changes, predict SST anomalies, and prevent natural disasters caused by abnormal changes, the dynamic variation of which have a profound impact on the whole marine ecosystem and the changes of climate. Therefore, it's essential to predict the future sea surface temperature. In order to achieve accurate SST prediction, a prediction model combining the GCN model with the GRU model and introduces the attention mechanism named the STAGCN model is proposed in this paper. We use the graph network to model the network of ocean location points. Nodes on the graph represent each ocean location point, edges on the graph represent that there have connections between location points. The GCN model is used to obtain the spatial correlation from the SST time series by constructing the spatial topology structure on the ocean points graph, which is obtained by the distance function between the nodes. The STAGCN model takes the GRU model to capture time dependence in the way of filtering and retaining historical and current SST information. Meanwhile, the attention model is applied to captures the importance of SST information from the output state and combines the global spatio-temporal characteristics from SST information. In this study, the experimental results of predicting the future short-term and long-term SST with STAGCN model on the data set indicating that the STAGCN model can achieve desirable prediction performance compared with the ARIMA model, SVR model, GCN model, and GRU model. In conclusion, the STAGCN model can acquire preferable forecasting results for future short-term and long-term SST prediction in the way of capturing global spatial characteristics and temporal dependence from SST series data.