SPATIOTEMPORAL CONVOLUTIONAL LSTM WITH ATTENTION MECHANISM FOR MONTHLY RAINFALL PREDICTION

,


INTRODUCTION
Rainfall forecast information is one of the crucial analyses to help regulate water resources and often involves several variables since rainfall is part of meteorological phenomena. This prediction is more complicated when dealing with the emergence of climate change in tropical areas such as Indonesia, which lies on the equator with implications from North and South.
Furthermore, climate change has affected rainfall patterns, causing several natural disasters such as heavy rains that result in flooding or prolonged absence of precipitation that results in droughts [1]. Drought Management Plans (DMPs) are regulatory instruments that establish priorities among different water uses and define more stringent constraints for access to publicly available water during droughts and reduced water supplies because of climate change vulnerability to drought events. To deal with this problem, rainfall prediction with an excellent and accurate method is needed to anticipate it [2]. Precise rainfall forecasts, both short and long term, have significant benefits in water resource management, flood control, disaster reduction, and agricultural management [3]. However, rainfall is a complicated nonlinear atmospheric system that depends on space and time; besides, many factors can influence rain in the area [4]. Therefore, it is not never convenient to realize the complexity and uncertainty of the predictability of rainfall to produce precise and accurate rainfall forecasts [5]. Forecasting rainfall, the beginning of the rainy season, the duration of the rain, and the end of the rainy season are determined by a monthly period, often using a three-month system known as the SPI (Standardized Precipitation Index) method [6]. In addition, monthly rainfall can provide a more accurate 3 MONTHLY RAINFALL PREDICTION distribution of the mean intra-year rain when compared to seasonal rainfall [7]. Hence, it is vital to periodically estimate rainfall on a monthly time scale, in which rainfall predictions are usually made using physical-based models and deep learning methods [8]. Climate Hazards Group Infrared Precipitation with Stations, also known as (CHIRPS), is data obtained with specifications such as environmental records, new quasi-global (50 ° S-50 ° N), high resolution (0.05 °), daily, pentadal, and monthly rainfall datasets. These datasets have the spatial surface of the earth and temporal from 1981 to 2020, which are able to visualize the rainfall condition in every place on the land. Scientists developed CHIRPS from various countries to support the United States Agency for International Development Famine Early Warning Systems Network (FEWS NET) [9]. The approach is built using thermal infrared precipitation (TIR), which has been successful in trials like the National Oceanic and Atmospheric Administration's (NOAA) Rainfall Estimate CHIRPS uses Tropical Rainfall Measuring Mission Multi-Satellite Precipitation Analysis version 7 to calibrate global Cold Cloud Duration (CCD) rainfall forecasts. In addition, CHIRPS also employs the current state-of-the-art interpolation measurement approach using an 'intelligent interpolation' approach that can work with anomalies in high-resolution climatology [10].
CHIRPS is one of the unique spatiotemporal data that require specific consideration when utilizing deep learning predictions such as LSTM and GRU. While the deep learning approaches mostly used temporal data to build the model, spatiotemporal has always had a different way when choosing suitable algorithms. Understanding the features is necessary, which has high-dimensional and temporally correlation, which means the data are indexed by up to two dimensions in space and one in time [11]. In general, spatiotemporal data is spatial of the correlation between nearby locations like a photo with the pixel and temporal correlation between adjacent timestamps [12].
To handle spatiotemporal data, Tao et al. [3] used attention mechanisms to enhance the prediction model, the state of the art of attention mechanism was invented by Bahdanau [13] to improve the accuracy of machine translation algorithms. Another attention model comes from Vaswani et al. [14], which is Multi-Head Attention, and some people make the robust algorithm, namely transformers.
Deep Learning fields are always spread over several areas of prediction and classification. In 4 FREDYAN, KUSUMA sequential or time-series data, Recurrent Neural Networks (RNN) and their derivatives maintain some vectors in calculating every neuron using propagated through time [15]. However, RNN has trouble dealing with long data sequences that preside vanishing gradient problems when training using traditional RNN with long legs [16]. LSTM comes with the solution using memory to improve RNN and avoid vanishing gradient, becoming more advanced with modifications such as encoder-decoder, attention mechanism, etc. [17].
In this study, the authors propose Convolutional LSTM with an additional Attention Layer to enhance the accuracy of monthly rainfall prediction using CHIRPS data as a solution to predict rainfall with gridding data to make more accurate forecasting. A hyperparameter is tuned manually with an endless number of models to ensure that it has the same comparison. We compare and analyze each model's loss error and the number of performances according to the evaluation metrics most used in hydrology and deep learning. The results indicate that the proposed Convolutional LSTM-AT model is the best so far. We also analyze the spatial and temporal for interpreting the physical causality of our model.

RELATED WORKS
In this section, the authors give reviews of relevant research that can inspire the author to construct the Convolutional LSTM-AT model, including several fundamental studies in rainfall forecasting, sequential data using LSTM-based models, and the exciting method of machine translation, which is an attention-based model.

Rainfall Forecasting
Seasonal prediction models are commonly used for the prediction of rainfall to make an early warning from the tools or agency of government when hydrological extremes come to attack people. Based on climatologists, climate prediction models can be classified into three approaches first physical or numerical approach, the second empirical or statistical approach, and the third mixing between physical-empirical [18]. Still, rainfall depends on numerous lands, large oceans, and the atmosphere in the lop of the processes. On the other hand, Physical models are generally developed based on interpretations of atmospheric processes, but they frequently show weak 5 MONTHLY RAINFALL PREDICTION predictability in providing good information on annual climate variability [19]. In general, physical-empirical prediction models, which are the most used by climatologists, are developed utilizing the traditional statistical approach. For instance, Zhu and Li [20] 2017 applied the regression method to predict the wet season in East Asia. Li and Wang [21] 2018 studied the forecast capability of summertime highly extreme rainfall days in eastern China by utilizing a stepwise regression model. The ability of the traditional regression-based method is likewise inadequate in forecasting highly nonlinear and nonstationary performance. Therefore, the connection of local climate in a specific area with ocean-atmospheric variables such as SST or sea level pressure cannot be described by employing traditional regression models [22].

LSTM-Based Methods
Long Short Term Memory (LSTM) is one of the modified versions of recurrent neural networks that have a problem with vanishing gradient, which is designed to resolve the problem of sequential long-distance (time) data reliance by Hochreiter and Schmidhuber [23]. Yuan et al. [24] proposed an LSTM network model to build occupancy by simulating energy, operation, and management.
ElSaadani et al. [25] used the LSTM model to predict soil moisture and fill gaps between the observation. Further, Zhou et al. [26] combined the LSTM model and attention mechanism based on machine-translation to recognize skeleton-based abnormal behavior. The conclusions indicated that attention-based LSTM could recognize behavior better than only LSTM Model.

Attention-based methods
In deep learning, one way to increase accuracy in the model learning process is through attention mechanisms inspired by selective human visuals to choose which information to pay special attention to and which ones to reject. In general, the application of Attention mechanism has been applied in various areas of research and industry, such as machine translation, image captioning, and video motion recognition. Song et al. [27] have a proposal related to an end-toend spatiotemporal attention model to perform recognition and prediction of human action in a video frame. In addition, Chen et al. [28] proposed a model of spatial combined with channel attention and image labeling with an additional convolutional neural network, having a good result in their data set. Ding et al. [29] proposed spatiotemporal LSTM to predict floods in three basins 6 FREDYAN, KUSUMA in China. Tao et al. [3] also proposed LSTM with an attention mechanism to improve monthly rainfall prediction, which performed well in most spatial points. From the above model author was inspired to develop another model, we propose a multi-head attention LSTM to optimize monthly rainfall prediction with spatiotemporal data.

STUDY AREA AND DATASET
In this study, Kalimantan Timur was selected as the study area to evaluate and compare the performance of several LSTM models in forecasting monthly rainfall. East Kalimantan is located  Monthly rainfall data covering January 1980 -December 2020 is CHIRPS data accessible from https://data.chc.ucsb.edu/products/CHIRPS-2.0/. The data for the 40 years January 1980 to December 2020 was used as a dataset of this model, as shown in Figure 1, sampling of December 2020. Rainfall data known as CHIRPS is still in the form of worldwide raster data, where the research only focuses on the Kalimantan Timur region, so the data needs to be split. First, a printout of the Kalimantan Timur area is required from https://tanahair.indonesia.go.id/. Still, combining the data using the ArcGIS application is necessary because the custom is city and district data.
Furthermore, after the data for the East Kalimantan region is obtained, splitting the rainfall data worldwide using the SAGA application is needed. It should be noted that the Split process requires degrees of longitude and degrees of latitude and a grid size that must be adapted to raster data worldwide, which is 0.05 o x 0.05 o , the result can be seen in Figure 2. As shown in Figure 2, data visualization has colors black and white which means black has representative sea surface and white island surface. Raster data is one of the best formats of data to represent surface area since raster can keep multi-band of data to create complex spatial conditions. CHIRPS contain a single band to interpret monthly precipitation values without additional variables. It can be seen in Figure 2 that data has three dimensionality of perspective. 8 FREDYAN, KUSUMA As shown in Figure 3, this data includes dimensions 89 x 89 of spatial and 480 of temporal, in this case, monthly data. Having three-dimensional condition make this research more complex since it should be done with a specific method, so the spatial and temporal will not be biased or even removable on that dimension.

PROPOSED METHOD
Rainfall is critical in supporting human life; besides, various policies often consider rainfall the main factor. Based on rainfall data, climate classification can be done according to the ratio between the average dry months and the average number of wet months. The dry month occurs when the monthly rainfall is less than 60 mm/month, while the wet month occurs when the monthly rainfall is above 100 mm/month. A humid month occurs between the dry and wet months when the monthly rainfall is between 60-100 mm/month

Overview
Variational data and models increase with many perspectives to understand data to build the best alternative model. The authors have searched for and understood a literature review to know which is the newest and best model or the strengths and weaknesses of those models. Still, the rainfall forecasting model suffers from predicting rainfall accurately and precisely. Hence, the authors built the proposed model Convolutional LSTM-AT as an alternative solution to optimize monthly rainfall prediction with the spatiotemporal dataset.

Data Preprocessing
The data preprocessing stage is the data selection stage which aims to obtain relevant data for use. In raw data, missing values are often found, not stored values (misrecording), data sampling that is not good enough, and others. However, because this research does not use raw data but secondary data, preprocessing will be done to process spatial and temporal data. In addition, preprocessing will only be conducted to focus on the data on cells with value, so the cells with no data will not be used. 9 MONTHLY RAINFALL PREDICTION FIGURE 4. Illustrated spatiotemporal data using the sliding window in spatial perspective In this study, focal operation theory is implemented, a spatial function to calculate the output value of each cell using neighborhood values, like the nearest neighbors' algorithm (K-NN), a machine learning algorithm, as shown in Figure 4 [30]. In addition, this theory is also commonly used in convolution, kernel, and moving windows in deep learning algorithms such as CNN or RNN. Moving Window can be imagined as an arrangement of square cells with a specific size, which in this study is 3 x 3 in size, which shifts its position with certain steps. As the operation is applied to each cell of the moving window, the values in the raster tend to be smoother. It was adopted in this study to smooth the predictive value in spatial conditions. Spatial-temporal data are generally placed in continuous space, while classical data sets such as images or video data are usually in a discrete area. Spatiotemporal data patterns usually present 10 FREDYAN, KUSUMA very complex spatial and temporal properties, and correlations between data are challenging to explain with traditional methods. Finally, one of the standard statistical assumptions is that the sample is obtained independently. However, this does not apply in spatiotemporal analysis because Spatiotemporal data tend to be highly correlated, so it is impossible to carry out separate studies.
As explained earlier, the data used in each time unit (temporal) is 89x89 with a length of 480 temporal, as shown in Figure 4. Hence for modeling, the data is taken spatially with a size of 3x3 for 13 months (temporal), and this data will slice the sliding window along the temporal axis.
Moving to the right side with a single step will be implemented in the data, so after the last window on the right area, it will continue by a sliding window in the next row, from left to right. It can be seen in the blue area in Fig. 4 until the end of the spatial data, which is the right bottom side.

Data Clustering
One of the data mining techniques is clustering to find similarities in character in the group data; this technique is included in traditional machine learning studies and also becoming part of the unsupervised algorithm, which only requires training data without target data [31]. In theory, cluster analysis is one of the tools to group data based on variables or features to maximize the resemblance of characteristics within the cluster and maximize the differences between clusters themselves [32]. The popular algorithm is the K-means clustering algorithm groups data based on the distance between the data and the cluster centroid point obtained through an iterative process [33]. The analysis needs to determine the number of K as input to the algorithm.
Following the Eq. (1), is the objective function, is many clusters, is the number of cases, is a case in , and is the centroid for cluster itself. In k-means clustering, this distance can be measured using distances: Euclidean distance, Manhattan distance, A-squared Euclidean distance measure, and Cosine distance measure. The choice of this distance measurement method will affect how the algorithm calculates the similarity in the cluster and shape.
Nevertheless, some of the problems come when determining the number of because no theory 11 MONTHLY RAINFALL PREDICTION states how to choose it very well since the number of is very essential to searching the cluster.
The researcher solves this problem using the Elbow Method, which is obtained by performing a visual assessment of the line graph where the x-axis is the number of K, and the Y-axis is the Within Cluster Sum Square (WCSS) value.

Convolutional LSTM-AT
LSTM is derivative from the RNN in sequential data study, having three units of gates such as input, output, and forget gate. It allows the gates of LSTM to store and access information or characteristics of the data over a while, dependence to Hochreiter and Schmidhuber [23], mitigating the vanishing gradient problem. The model parameter including all the input is weight or and the bias term , , , , Ĉ respectively represent input-output forget and memory, the other symbols are ℎ meaning hidden state and sigmoid activation function, but it always depends on the data, sometime can be changed becoming hyperbolic tangent or ReLU [29] [34] [35].
̃= tanh(ℎ −1 + + ) = * −1 + * ̃ (4) The Attention Mechanism is often used to optimize sequence handling models in some deep attention. Hard attention refers to selecting a single input data feature, which means the attention weight can only be 0 or 1. Soft attention refers to a weight between 0 and 1, and the range of weight selection is more flexible [36]. Since those several models of attention were invented by Bahdanau et al. [13] and Multi-Head Attention by Vaswani et al. [14], empirically, addictive attention can improve the modal and attention layer's performance and make the unit's weight noticed.

FIGURE 5. Spatiotemporal using Convolutional LSTM Attention Layer based
Modifying the original LSTM with an attention mechanism is necessary to fully utilize the Spatiotemporal input information. The authors take rainfall as input features, and the output of our model is the next n-step rainfall prediction. Spatial and temporal attention weights affect the input and output of LSTM cells [37]. With the help of the Spatiotemporal attention module, the authors were able to dynamically adjust attention weights and improve the performance of LSTM cells 13 MONTHLY RAINFALL PREDICTION [38]. This model uses Adam's algorithm optimizer [39] to train the model, as shown in Figure 5. Before training the model, the step that must be conducted is to determine the network architecture, such as deciding how many layers are used, the number of neurons in each layer used, the activation function used, and other parameter values, it can be seen in Table 1. For the input layers based on the features that will be used, 9 spatial features will be used as input neurons; the number 9 comes from 3 x 3 spatial. Then temporal data have 12 timesteps and one time step as the target.

Experimental Design
The fully built model uses five models to compare the proposed model to others. Besides, the whole architecture has explained in Table 1. Postprocessing aims to make better rainfall predictions than "raw" (unprocessed) hydrological simulations. For this aim, it is significant to evaluate the model's performance and compare it with each other to conclude which model is the best. Several metrics are used to evaluate predictions for different wait times. Since accurate and reliable predictions are so crucial during rainfall events, the primary accuracy measure for a 14 FREDYAN, KUSUMA deterministic forecast is the root-mean-square error (RMSE) in equation (8): Where denotes the − th timeprediction of daily rainfall, denotes the observed daily, and represents the total number of time-k monthly rainfall predictions. Compared with mean absolute error (MAE) metrics, RMSE penalizes significant errors [40], desirable for high rainfall forecasts. Unlike RMSE, which gives a relatively high weight to significant errors, Mean Absolute Error (MAE), a linear statistical measure, is more applicable when the overall impact of errors is proportionate to the increase in error, MAE can be formulated as [40] in equation (9).

RESULTS AND DISCUSSION
Seven models have been built to forecast rainfall area in Kalimantan Timur, leading by 12 months' time step to predict one month. Those models are: • RNN: Recurrent Neural Network that allows previous outputs to be used as inputs while having hidden states.
• GRU: Gated recurrent unit (GRU) is a gating mechanism implemented in recurrent neural networks.
• LSTM: Long Short-Term Memory Network is a famous variant of RNN having three gates.
• Convolutional LSTM-AT: Combination of Convolutional and LSTM with attention layer, as shown in Figure 5.

Clustering Result
Every spatial point has different statistical distribution, and different models should be trained for different clusters of spatial points with similar characteristics. Because of that, we use K-means clustering to cluster the spatial points. We use the Elbow method to find the optimal = √ ∑( − ) 2 (8) 15 MONTHLY RAINFALL PREDICTION cluster. Figure 6 shows the best number of clusters that we choose is 4 as a representation of the maximum number of clusters with a significant distance reduction indicator is Cluster 0, Cluster 1, Cluster 2, and Cluster 3. This paper evaluates all clusters as input candidates to build the proposed model that every cluster has own characteristics to generate which spatial become specific cluster. This work it might be the first way to find another clustering.  In the Elbow method, author is varying the number of clusters (K) from 1 -10. For each value of K, author is calculating WCSS (Within-Cluster Sum of Square). WCSS is the sum of squared distance between each point and the centroid in a cluster. When author plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1. When author analyze the graph author can see that the graph will rapidly change at a point and thus creating an elbow shape. From this point, the graph starts to move almost parallel to the X-axis. The K value corresponding to this point is the optimal K value or an optimal number of clusters. Figure 6 shows the best number of clusters that author choose is 4 as a representation of the maximum number of clusters with a significant distance reduction indicator is cluster 0, cluster 1, cluster 2, and cluster 3. As shown in Table. 2, the number of WCSS can be seen in there same as in Figure. 6. Moreover, Table 3 showed result the clustering location in 4 cluster.

Convolutional LSTM-AT Result
The first step of this experiment is building a proposed method with LSTM with a modification layer such as an attention mechanism. The challenge of building the model is looking for the best hyperparameter to adjust the number. As shown in Table 1, we use a constant hyperparameter and build all models with the same hyperparameter but different architecture. showed that the performance still best than others method in average value of spatial point. The attention-based models are more accurate and robust than the original LSTM model and reducing 18 FREDYAN, KUSUMA the number of errors significantly. This proves that the proposed method is still the stable to get minimum value of spatial point. Leading to smaller output should be this model perform much better-using data CHIRPS. All the model performances were entirely satisfactory when we see the average of MAE since the average is the testing of all spatial data that have different characteristics.
The RMSE and MAE of the predictions from models in experiments is shown in Table 4. On the CHIRPS dataset, the proposed Convolutional LSTM-AT model has lowest error even using maximum value of all spatial target. However, we can infer that the dataset is already split by    Table 5    For future work, we will be looking another way to reduce error in the result of the model.
The direction may include how to preprocess data and train 3-Dimensional data without losing spatial information. Besides, we will investigate much architecture and develop spatiotemporal approaches. We will consider further improving the performance of the model by utilizing the graph information of area that we predict in the data. Moreover, it will be grated to add flood data augmentation and physical interpretation of model to make prediction more closely with the ground truth.