Development of the GSTARIMA(1,1,1) model order for climate data forecasting

The space-time model combines spatial and temporal elements. One example is the Generalized Space-Time Autoregressive (GSTAR) Model, which improves the Space-Time Autoregressive (STAR) model. The GSTAR model assumes that each location has heterogeneity characteristics, and that the data is stationary. In this research, the moving average component is calculated by involving the relationship between variable values at a certain time and residual values at a previous time


Introduction
The time series is the realization of a stochastic process defined as a sequence of ordered observations over time indices (Wei, 2006).Referring to Box et al. (2015), there is a procedure for modelling time series based on the Box-Jenkins approach, which involves three stages: model identification, parameter estimation, and diagnostic checking.The identification stage aims to determine a suitable model for forecasting, including deciding the model order.The order in time series data serves to describe the number of lags or previous periods, and it also refers to the complexity level of the model used to represent the data.A higher order in a time series implies a more complex model.The appropriate order is crucial for producing a good and accurate model for forecasting or modelling time series.While a higher-order model may capture intricate patterns in time series data, estimating and interpreting can be more challenging.The principle of parsimony is applied to use a simpler model for a more straightforward interpretation.However, time series data may sometimes have intrinsic complexities requiring more complex models, such as in marine ecosystems, climate, global economy, and human social behavior.Therefore, the principle of parsimony is not an absolute rule but rather a guideline to be considered in time series modelling.
Time series can be combined with spatial and is commonly referred to as space-time series.An example of a space-time model is the Space-Time Autoregressive (STAR) model.The assumption in the STAR model is homogeneous location characteristics.According to Wei (2019), the STAR model is a special case form of the Vector Autoregressive (VAR) model.The STAR model was developed by Pfeifer & Deutsch (1980) applied to stationary data to analyze crime rates in Boston. Ruchjana (2002) extended the STAR model to the Generalized STAR (GSTAR) model for forecasting oil production in the Jatibarang field well locations.In the GSTAR model, locations are assumed to have heterogeneous characteristics, and the data is stationary.Borovkova et al. (2008) estimated the parameters of the GSTAR model using Ordinary Least Squares (OLS) and applied it to tea production.Nurhayati et al. (2012) compared GSTAR models of order 2 and order 1 to forecast the Gross Domestic Product (GDP) in Western European countries.Based on Mean Squared Forecast Error (MSFE) results, the GSTAR model of order 2 performed better.Prillantika et al. (2018) used GSTAR order 2 to compare parameter estimates with OLS and Kalman Filter for forecasting inflation rates, finding that parameter estimation using Kalman Filter was superior.Huda and Imro'ah (2023) modelled GSTAR of orders 1, 2, and 3, comparing five weight matrices, including uniform, queen contiguity, Minimum Spanning Tree (MST), inverse distance, and modified by railroad for forecasting Covid-19 in West Java Province.The results showed that the GSTAR model of order 3 using the MST weight matrix was the best forecasting model.Susanti et al. (2018) applied the GSTAR model to non-stationary data, resulting in the Integrated GSTAR model (GSTARI) for forecasting assets of Rural Credit Banks (BPR).Monika et al. (2022) utilizes a data mining approach to predict climate phenomena by combining the GSTARI model with exogenous variables and Autoregressive Conditional Heteroskedasticity (GSTARI-X-ARCH) to overcome heteroscedasticity conditions.
Di Giacinto ( 2006) developed the Generalized Space-Time Autoregressive Moving Average (GSTARMA) model of order 1 for autoregressive and moving average to analyze unemployment in Italy, using Maximum Likelihood Estimation (MLE) for parameter estimation.Akbar et al. (2020) applied the GSTARMA(1,0,1) model to forecast air pollution in Surabaya using two weight matrices: inverse distance and uniform.Parameter estimation methods included OLS and Seemingly Unrelated Regression (SUR).The research also discussed how the GSTARMA model can improve forecast errors compared to the GSTAR model.Andayani et al. (2017) used the GSTARMA model with added exogenous variables (GSTARMA-X).When applied to non-stationary data, the GSTARMA model requires differencing, resulting in the Generalized Space-Time Autoregressive Integrated Moving Average (GSTARIMA) model.Min and Hu (2010) developed the GSTARIMA(1,1,1) model and applied it to urban traffic network modelling and short-term traffic flow forecasting using the Least Square method for parameter estimation.Mubarak et al. (2022) applied the GSTARIMA(1,1,1) model to missing data to forecast gold prices.Sukarna et al. (2023) used the GSTARIMA (1,1,1) model to forecast COVID-19 in Sulawesi, comparing it with the GSTARI(1,1,0) and GSTIMA(0,1,1) models; both models showed similar accuracy based on Mean Absolute Percentage Error (MAPE).
One of the spatiotemporal phenomena is climate, as it can be observed based on location and time.According to the World Meteorological Organization (WMO) (2022), climate characterizes the average weather conditions for a specific area over a long period, requiring a minimum of 30 years for description.The European Union (2023) states that climate change affects all regions globally across various sectors, including agriculture, health, fisheries, tourism, and more.According to Ahlonsou et al. (2018), climate variables directly impacting daily life include rainfall, which has wet and dry seasons in tropical or subtropical regions.Wet months are typically characterized by higher rainfall and more frequent rain events, while dry months experience lower rainfall and less frequent rain.Wet months often occur in December, January, and February (DJF), and dry months in June, July, and August (JJA).The amount of climate data being observed is increasing rapidly, leading to the emergence of "Big Data" -huge and complex data sets that cannot be easily managed, processed, or analyzed using conventional techniques.Based on Dietrich et al. (2015) Big Data is characterized by 3Vs: volume, variety, and velocity.Volume refers to the size of the data, which can be in billions of rows and millions of columns.Variety describes the different forms in which data is produced, including structured and unstructured data.Velocity refers to the speed at which new data is created and grows.According to a survey by Transforming Data With Intelligence, organizations that implemented Big Data Analytics saw improvements in marketing focus, business insights, and client-based segmentation.A study by McKinsey Global Institute (2011) discovered that data holds comparable significance for organizations as both labor and capital.Those organizations adept at efficiently collecting, analyzing, visualizing, and leveraging insights from Big Data can set themselves apart from competitors, surpassing them in terms of operational efficiency.Organizations in any industry with Big Data can benefit from its careful analysis to gain insight and depth to solve real problems.The data analytics lifecycle method can be used to analyze Big Data.This method is designed specifically for Big Data and has six phases: discovery, data preparation, model planning, model building, communication results, and operationalized.By applying this method, Binbusayyis and Vaiyapuri (2019) were able to identify the main features of a cyber intrusion detection system.Based on the above exposition, this research develops a GSTARIMA(1,1,1) model order for climate data forecasting.The chosen model order aligns with model identification results using the Time Autocorrelation Function (STACF) and Space-Time Partial Autocorrelation Function (STPACF) since climate data falls into intrinsic complexities that require a more complex model.Additionally, parameter estimation for the GSTARIMA model is performed with model orders more significant than one using Maximum Likelihood Estimation (MLE).The climate data for this research is obtained from the National Aeronautics and Space Administration Prediction of Worldwide Energy Resources (NASA POWER), which falls under Big Data, and the data analytics lifecycle method is employed.Mean Absolute Percentage Error (MAPE) is used as the evaluation metric for the forecasting method.

Materials
This section discusses the theories that support research following the Box Jenkins procedure, starting from model identification, parameter estimation, and diagnostic checking-as well as the data analytics lifecycle method for analyzing Big Data.

Data Analytics Lifecycle
According to Dietrich et al. (2015), the data analytics lifecycle is specifically designed for data science projects and Big Data problems.The lifecycle consists of six phases, namely discovery, data preparation, model planning, model building, communicating results, and operationalization.These phases can be observed in Fig. 1.In most of the phases, the process can move forward to the next phase or return to the previous one.

Box-Jenkins Procedure
The Box-Jenkins method is a time series analyse technique that was introduced by George E. P. Box and Gwilym M. Jenkins.It is a three-stage procedure that involves model identification, parameter estimation, and diagnostic checking (Box et al. 2015).The model identification stage involves selecting a suitable model for the forecasting process and determining the model order.For univariate time series, the data must meet the stationarity requirements before proceeding with the Autocorrelation Function (ACF) and Partial ACF (PACF) plots.For multivariate time series, the ACF Matrix (MACF) and MPACF can be used, while for space-time series, Space-Time ACF (STACF) and STPACF are appropriate.The model parameter estimation stage is focused on estimating the model parameters, and the diagnostic checking stage aims to test the suitability and feasibility of the forecasting model.A feasible model should have significant parameters and the resulting errors should not have a particular pattern.Additionally, the process requires that white noise or errors are independent and have a normal distribution.

Generalized Space Time Autoregressive Integrated Moving Average Model
According to Wei (2019) the Generalized Space-Time Autoregressive Integrated Moving Average (GSTARIMA) model combines the GSTARI and GSTIMA models.The GSTARIMA model assumes heterogeneous location characteristics, non-stationary data, and white noise or zero average errors with constant uncorrelated variance, independent and normally distributed.

Distance Inverse Weight Matrix
Based on Wei (2019) a weight matrix is a square matrix that contains the weights of the corresponding locations as its elements.In the GSTARIMA model, the weight matrix is calculated based on the distance between locations to determine its elements.On the other hand, the inverse distance weight matrix is a weight matrix that is based on the actual distances between the locations.The inverse distance weight can be calculated using Eq. ( 7).
where  is an element of the distance inverse weight matrix at locations  and ,  is the distance from location  to location .Standardization is performed in the form  to obtain the value ∑  ( ) = 1.If it is assumed that there are three locations then, the inverse distance weight matrix is in the Eq. ( 8).

Maximum Likelihood Estimation
Based on Terzi (1995), Maximum Likelihood Estimation (MLE) is a method for estimating regression parameters by maximizing the likelihood function.The probability density function used is expressed in Eq. ( 9).
The likelihood function is as follows: The natural logarithm of the likelihood function is used, so that it becomes an Eq. ( 10).
The Eq. ( 11) multiplied by 2 ′, becomes, Next multiply both sides by ( )  , so it is obtained as follows: Transpose both sides, to obtain the Eq. ( 12).
For estimate parameters using the MLE method, Eq. ( 15) was transformed following the simple linear model  =  + .
The Eq. ( 16), already has the same structure as a simple linear model, where .

Diagnostic Checking
The diagnostic checking process is conducted to identify errors and determine whether the assumptions have been met or not.
The error assumptions that need to be fulfilled are the multivariate nature of white noise and its normal multivariate distribution.The Portmanteau test is executed to verify whether the multivariate properties of white noise are satisfied.Similarly, to avoid the assumption of a normal multivariate distribution, Chi-Square QQ plots are used.The Portmanteau test was first developed by Box and Pierce (1970) then Ljung and Box (1978) set it using standardized autocorrelation values.The Portmanteau test is conducted using Eq. ( 17).Referring to Tsai and Yang (2005), Chi-Square QQ plots are a graphical technique that aims to check the validity of assumptions in data by calculating the expected values based on the distribution.A normal distribution can approximate data if the plot resembles a straight line.
where  are many samples, and  autocorrelation at the -th lag.

Mean Absolute Percentage Error
Mean Absolute Percentage Error (MAPE) is an evaluation of a forecasting model that considers the influence of actual values.MAPE can be calculated using the Eq. ( 18) (Lawrence et al., 2009).
where,  () : actual value of the -th period at the -th location,

𝑒̂ (𝑡)
: residual of the -th period at the -th location,

𝑇
: the number of time series observations,  : many observation locations.
Referring to Lewis (1982) in Lawrence et al. (2009), there is a scale for assessing model accuracy based on the MAPE value which can be seen in Table 1.The smaller the MAPE value, the more accurate the model forecasting results.

Methodology
The research method used is the data analytics lifecycle, which has six phases: discovery, data preparation, model planning, model building, communication results, and operationalization.This research aims to develop the GSTARIMA(1,1,1) model using the Maximum Likelihood Estimation (MLE) method for forecasting climate data in West Java Province.The data used comes from NASA POWER, which can be accessed at https://power.larc.nasa.gov/data-access-viewer/.The data collected on NASA POWER is 51,818,587 TB, which includes big data.In NASA POWER, the data consists of three communities: Agroclimatology, Renewable Energy, and Sustainable Buildings.Climate data is available in the Agroclimatology community.The climate variable used in this research is rainfall.
In general, this research flow uses the Data Analytics Lifecycle which can be seen in Fig. 2 which is also used to help analyze Big Data.Starting from determining the research gap, data sources, hypotheses, data selection, choosing the suitable model, model development, and data processing, to interpreting research results, which can be an appeal to relevant agencies regarding rainfall.

Result and Discussion
This section contains the results and a discussion of this research, which follows the phases of the data analytics lifecycle.

Discovery
During the discovery phase, the first step is to identify the problem by studying and investigating it and developing an understanding process.In addition, the data sources needed and available for the research to be conducted are examined, and initial hypotheses, which will later be tested with the data, are formulated.

1) Framing Problem
The ongoing issue of climate change has become a global concern due to its devastating consequences, which pose a significant risk to the survival of all living creatures and the future of upcoming generations.According to the (European Union, 2023) climate change has affected various sectors worldwide, including agriculture, health, fisheries, and tourism.Precipitation is a critical climate variable that influences several aspects of our lives.The scarcity of rainfall can cause droughts, which have a detrimental impact on the ecosystem.Conversely, excessive rainfall can trigger floods and landslides, leading to severe consequences.
Climate data is a crucial component of space-time data, which can be modelled using space-time models.To identify research gaps, a literature review was conducted on space-time models utilizing various sources such as scientific journals, e-books, and previous research.Moreover, bibliometric analysis was employed to investigate the research gap.

Fig. 3. Bibliometric analysis with keywords "Generalized Space Time Autoregressive" OR "GSTAR"
According to the bibliometric analysis results in Fig. 3, the GSTARIMA model still needs to be developed further.This can be observed from the size of the circle in the GSTARIMA model, which is smaller than that of the GSTAR model.Therefore, this research focused on developing the GSTARIMA model order.Additionally, the GSTARIMA model can capture fluctuations in time series data by considering the error between observed and predicted values.The research aims to forecast climate data using the GSTARIMA model with the appropriate order.Rainfall variables obtained from NASA POWER and included in Big Data.The data analytics lifecycle method was employed to handle Big Data.

2) Identify Data Sources
The research utilizes climate data from West Java Province, specifically focusing on rainfall variables.The data is obtained from NASA POWER, a system and dataset developed by the United States Space Agency to provide information about global energy resources.NASA POWER offers information on various weather parameters such as solar radiation, air temperature, humidity, rainfall, and more, sourced from different sources, including satellites and global climate models.The data volume used in this research is 512,101,087 TB, which is widely distributed.

3) Hypothesis
The development of the GSTARIMA(1,1,1) model order will provide more accurate climate data forecasting using MAPE value criteria.

Data Preparation
The data preparation phase includes data exploration, pre-processing, or conditioning the data before modeling and analysis.

Fig. 4. Map of West Java Province
Daily rainfall data for 27 regencies/cities in West Java Province, collected from the NASA POWER database, amounts to 340,173 or 12,599 data for each regency/city from January 1989 until February 2023.Data from NASA POWER is satellite data with a radius of 57 , so some several regencies/cities have the same data.Locations were selected to represent the same data, and 11 regencies/cities were selected, Bandung City, Bekasi City, Bogor City, Tasikmalaya City, Cianjur Regency, Cirebon City, Sumedang Regency, Indramayu Regency, Subang Regency, Kuningan Regency, and Regency Pangandaran which can be seen in Fig. 4 marked with a green dot on the map.

2) Data Cleaning
Data from 11 regencies/cities that underwent the data selection stage were cleaned to handle missing values using Python software.No missing values were found in 11 regencies/cities data in West Java Province.

3) Data Transformation
Daily data that has undergone the data cleaning stage is aggregated into monthly data.Obtained 414 monthly data from 12,599 daily data for each regency/city.In this research, the data taken was for December, January, and February (DJF), so from 414 data, 104 monthly data were obtained.

4) Data Integration
In the integration stage, data is combined from various sources.In this research, the data combined is rainfall data in the month of DJF with latitude and longitude coordinate data.

Model Planning
After the data selection phase, the resulting data is split into two groups: 80% in-sample data, which consists of 83 data (from January 1989-February 2016), and 20% out-sample data, which consists of 21 data (from December 2016-February 2023).

Fig. 5. The correlation value between locations
The in-sample data is used for training purposes, while the out-sample data is used for testing the accuracy of rainfall forecasting using the GSTARIMA model.The correlation value was calculated to determine the attachment between locations; in this case Python software was used; the results can be seen in Fig. 5. Based on Fig. 5, the strongest attachment occurs in the Bekasi City area with Bogor City, with a value of 0.96.The process of data centering involved subtracting the data recorded at time  (()) from the average ( ̅ ).Afterward, a stationarity test was performed using the Augmented Dickey-Fuller (ADF) test with the aid of Python software.The adfuller function was utilized for this purpose.The results of the ADF test, which can be observed in Table 2, were obtained through this process.Based on Table 2, ten of the eleven locations have  −   where  = 0.05 so accept  or the data is not stationary.However, Pangandaran Regency's has a  −   so reject  or the data is stationary.To achieve data stationarity, a differencing process was conducted by subtracting the t-th time series data from the previous one.Python software was used to assist with this process by utilizing the diff function.The differencing process was applied only once during the research, and the results are shown in Table 2.After the differencing process, the  −  for all locations is smaller than  or reject  , which states that the data is stationary.Next, identification model with calculate the Akaike's Information Criterion (AIC) value to determine the best model order at each location.The AIC values obtained for each location can be seen in Table 3. Wei (2019) states that the GSTARIMA model is a special case of the Vector Autoregressive Integrated Moving Average (VARIMA) model, and the VARIMA model is a combination of ARIMA models of the same order and correlation attachment between locations.In this research, locations that have the same order were selected, that the observation locations that have the same order include Bekasi City, Bogor City and Indramayu Regency with the best model order ARIMA(3,1,1); Cianjur Regency and Cirebon City with the best ARIMA model order (4,1,1); and Tasikmalaya City, Sumedang Regency, Kuningan Regency, and Pangandaran Regency with the best model order ARIMA(0,0,1).(24) Do the same thing for the Eq. ( 21) and Eq. ( 22).The GSTARIMA(3,1,1) model was used to forecast climate data on the out-sample data.The results have been plotted in Fig. 8, which shows that the forecasting pattern matches the actual data.This indicates that using the GSTARIMA(3,1,1) model for forecasting is effective.However, forecasting using the GSTARIMA(3,1,1) model on climate data is only for shortterm forecasting, which can be seen in Fig. 8; only the first month to the third month is close to the actual data.Forecasting was conducted using the GSTARIMA(1,1,1) model, following the same process as the GSTARIMA(3,1,1) model.In this case, the principle of parsimony was applied by selecting order one in autoregression.The MAPE values were then calculated for both models to evaluate their forecasting performance.The GSTARIMA(3,1,1) model achieved a MAPE value of 11% for in-sample data and 9% for out-sample data.Meanwhile, the GSTARIMA(1,1,1) model gained a MAPE value of 12% for in-sample data and 11% for out-sample data.As the MAPE value of the GSTARIMA(3,1,1) model is smaller than that of the GSTARIMA(1,1,1) model, the former is better at forecasting climate data, especially rainfall in West Java Province.

Communicate Result
Based on the model building results, the GSTARIMA(3,1,1) model was used to forecast climate data, and the MAPE results for the out-sample data were 9%, which is considered very accurate.This answers the hypothesis in discovery phase, which states that the development of the GSTARIMA(1,1,1) model order provides more accurate forecasting results.
A visualization was conducted on rainfall forecasting data in West Java Province, specifically in Bekasi City, Bogor City, and Indramayu Regency.The results of the visualization of the forecasting data for December 2022 until February 2023 can be seen in Fig. 9. Based on Fig. 9, the location with the highest rainfall in December 2022 was Indramayu Regency, with a recorded amount of 10.83496761.On the other hand, Bekasi City and Bogor City had the lowest rainfall in December 2022, ranging from 7.627009869 to 7.983449618.In January 2023, Bekasi City had the highest rainfall, and Bogor City had the lowest rainfall, while in February, Bekasi City had the highest rainfall and Indramayu Regency had the lowest.The rainfall values across the locations were quite similar, as indicated in the legend.These findings suggest that there was an almost even distribution of rainfall intensity in West Java Province, so there was no extreme rainfall in the three regions in December, January, and February (DJF).

Operationalized
The development of the GSTARIMA(1,1,1) model order results in more accurate forecasting.Selecting the correct order is crucial for the model's effectiveness.This advancement is expected to have a significant impact on the field of modeling.Additionally, using the GSTARIMA(3,1,1) model for forecasting holds significant potential in offering advantages as suggestions to relevant organizations such as the Meteorology, Climatology and Geophysics Agency (BMKG).It also functions as an early warning system for forecasting climate, with a primary focus on rainfall in the province of West Java.

Conclusion
The research on climate data forecasting uses the data analytics lifecycle approach to develop the GSTARIMA(1,1,1) model.This approach makes research more structured by beginning with the discovery phase, which helps identify problems and research gaps for model development.The data preparation phase is also helpful for analyzing big data, ranging from 12,599 data to 104 data for each region/city.
In the model planning phase, the best model order for forecasting is determined using STACF and STPACF-estimation using the MLE method for the GSTARIMA(3,1,1) model in the model building phase.The result communication phase helps to analyze forecasting results, and the operationalized stage is a form of implementing solutions or actions based on insights and analysis results.
The research shows that the GSTARIMA (3,1,1) model provides better forecasting results than the GSTARIMA(1,1,1) model, as seen from the MAPE value.In the GSTARIMA(3,1,1) model, the MAPE value for out-sample data is 9%, and for in-sample data is 11%.In contrast, in the GSTARIMA(1,1,1) model, the MAPE value for out-sample data is 11%, and for in-sample data, it is 12%.This confirms that the hypothesis given at the discovery stage is accepted.Based on this research, it can be concluded that selecting the correct order is crucial for the model's feasibility and accuracy in spatiotemporal forecasting.Therefore, it can be concluded that choosing the correct order will produce a more accurate model.
Research flow chart

Fig. 6 .
Fig. 6.STAF and STPACF rainfall data in West Java Province

Table 1
MAPE value scale

Table 2
Stationarity test of climate data in West Java Province

Table 3
AIC value to determine the best model order