To achieve the aforementioned goal the model was built using the monthly and annual rainfall data, which has been retrieved from a web-based spatial data portal from India's water resource information system (https://indiawris.gov.in/wris/#/rainfall) from 1901 to 2020. For rainfall forecasting monthly rainfall data have been integrated to produce total rainfall on a periodic and yearly basis. In this paper, several statistical approaches have been used to study rainfall forecasting, including Autoregressive Integrated Moving Average Method (ARIMA) using python (10.3 version), linear (Ordinary Least Square Method) and polynomial regression in MS Excel. Finally, to find the best fit model the Root Mean Square Error (RMSE) has been calculated using observed and forecasted rainfall data from 2001 to 2020. The methodological framework of the study is as follows:
Rainfall Forecasting
The process of forming assumptions about the future values of investigated variables is known as forecasting (Box, G. E. P., & Jenkins, 1976). The serial correlation effect has been checked since it plays a vital role in assessing and ultimately reducing the uncertainty of rainfall forecasting before it was examined in detail.
Augmented Dickey Fuller Test (ADF Test)
Augmented Dickey Fuller Test is one of the most widely used statistical test when it comes to analyzing the stationary of a series. For the prediction of rainfall with ARIMA it is necessary to check the time series data whether the data is stationary or not. In this study to check the data whether it is stationary or not ADF test have been applied using python. For this test adfuller function have been imported from ‘statsmodels.tsa.stattools’ library in python.
Serial Correlation Effect
The Serial correlation or auto correlation is used to used to find a link between the variable's current value and any prior values that must be accessed. It's a an authentic way to find hidden trends and patterns in time series data that would otherwise go unnoticed. While using ARIMA to forecast rainfall, it has been assumed that the observed time-series data is serially independent. However, significant serial correlation coefficients in time series rainfall data may arise, necessitating the testing of serial correlation effect while reviewing a series of historical data.
The Autocorrelation and partial autocorrelation functions in Python 10.3 have been used to evaluate the significant autocorrelation coefficients with varying lagged values at a confidence level of 0.05. Lag-1 autocorrelation is often used to examine the influence of serial correlation in time series data. (Anderson, 1942). The simple correlation coefficient between the first observation N-1, Xt, t = 1,2,3,...,N-1 and the subsequent observations, Xt + 1, t = 2,3,...,N-1 is known as the lag-1 autocorrelation coefficient (Sharma & Singh, 2017). The formula for calculating the relationship between Xt and Xt + 1 is given below-
$${r}_{1}=\frac{\sum _{t=1}^{N-1} \left({X}_{t}-X\right)\left({X}_{t+1}-\stackrel{-}{X}\right)}{\sum _{i=1}^{N} {\left({X}_{t}-\stackrel{-}{X}\right)}^{2}} \left(I\right)$$
where, \(\stackrel{-}{X}=\sum _{t=1}^{N}\)is the total mean.
The coefficient r1 is investigated to test its significance. The probability limitations of a two-tailed independent series' correlogram is illustrated below (Sharma & Singh, 2017).-
$${r}_{1}\left(95\text{\%}\right)=\frac{-1\pm 1.96\sqrt{N-k-1}}{N-k} \left(II\right)$$
where, N denotes the sample size and k denotes the lag.
The data are presumed to be serially dependent if r1 is outside the provided confidence interval and serially independent if r1 falls within the interval.
The Auto-Regressive Integrated Moving Average (ARIMA)
Autoregressive Integrated Moving Average (ARIMA) or Box-Jenkins have been extensively used as forecasting techniques. The autocorrelation function (ACF) and partial autocorrelation function (PACF) of the sample data were proposed as the main tools for determining the ARIMA model's order (Box, G. E. P., & Jenkins, 1976). The model is expressed as ARIMA (p, d, q), where p represents the order of the auto regressive process, d represents the order of the stationary data and q represents the order of the moving average process (P. G. Zhang, 2003). The ARIMA model is implemented in the following steps (Box, George E. P.; Jenkins, Gwilym M.; Reinsel, 1994).-
(i) Identification of Model: When time series data are stationary, the ARIMA model is useful. The first step is to determine whether the time series data is stationary or not before proceeding any further. Before forecasting with ARIMA, it is necessary to make a time series stationary if it has a trend or seasonality component.
(ii) Identification of PACF and ACF parameters: In order to use the ARIMA model, it is necessary to first identify the value of d (stationary data), the number of residual lag values (q) and the dependent lag value (p). The key tools for detecting q and p, as well as correlation are ACF (autocorrelation function) and PACF (partial auto correlation function) which displays the plot of ACF and PACF values for lag. The partial autocorrelation coefficient measures the similarity between Xt and Xt-k while assuming that the lab effect times 1, 2, 3..., k-1 are constant.
(iii) Build the optimal ARIMA Model:
There can be different ARIMA models based on the outcomes of the stationary detection and the determination of ACF and PACF. Hence, the auto-regressive parameters are determined. To identify the best order of the model in this analysis the auto_ARIMA function has been used which automatically gives the optimum order for this model. In Python, the auto_ARIMA function finds the best order for the model's parameters by employing a quick maximum likelihood estimation approach and a stepwise search based on minimum AIC (Akaike Information Criteria). The AIC is a fine-tuned technique for estimating the likelihood of a model to predict future values based on in-sample fit (Akaike, 1974). The best order of the model is that which produces the lowest AIC among all the other orders.
(iv) Forecasting: After obtaining the optimal model, forecasting for the subsequent period is possible. Forecasting using this method is frequently more efficient than forecasting using other time series methods.
(V) Residual Diagnostics to check the Model: In a time-series analysis every observation may be predicted using all prior observations, which are referred to as fitted values and the residuals are that which is deviated over after a model has been fitted. The linear regression hypothesis is tested using residual analysis, which determines if the error follows a normal distribution. The standardized residual graph, normal Q-Q plot and Histogram plus estimate density have been plotted in this study to check the white noise of the residuals.
(a) standardized residual graph: The standardized residuals are calculated by dividing the raw residuals by the overall standard deviation of the raw residuals which produces a consistent measure of prediction error. It is a metric that indicates how strong the gap between observed and predicted values is. This plot clarifies that the residuals are dispersed in a random fashion and the residuals can be demonstrated to be independent if the sequence does not display patterns like trend or periodicity.
(b) Normal Q-Q Plot: Normal Q-Q (quantile-quantile) plots are extremely useful for graphically analyzing and comparing two probability distributions by plotting their quantiles against one another. It is useful to verify the assumption that the dependent variable is normally distributed or not. If it is not normally distributed then it is required to be explain how the assumption is broken and what data points are involved. If the points on the graph are nearer to 45-degree straight line, the usual assumptions about allocation are met.The normal odds graph has been utilized in this investigation to see if the residuals satisfies the normal distribution assumption or not.
(c) Histogram plus estimated density (Kernel Density Estimators): The histogram along with estimated density have been used to assess whether the residuals are normal or not, depending on the interval values employed to categorise the data. A density plot is a continuous, smoothed form of a histogram derived from data. Kernel density estimation (KDE), the most frequent technique of estimate, have been used in this work.
Linear and Polynomial Regression
Regression analysis is a type of predictive modeling that is commonly used for forecasting, time series modeling and determining causal relationships between variables. In this work, a "least squares" method has been utilized to find the best-fit line to estimate rainfall using a linear regression model, which is expressed as follows
$$y=a+bx \left(III\right)$$
where, y represents estimated rainfall, ‘a’ represents intercept and ‘b’ represent slope which is coefficient of x.
In this paper polynomial regression have been applied to forecast the rainfall. In this study, the rainfall is treated as a dependent variable y, which is modelled as an nth degree polynomial in x, while time is treated as an independent variable x. When the pattern of rainfall trend is linear, the simple linear regression procedure works. However, if the data is non-linear, linear regression will not be able to create a best-fit line and will fail in such cases, hence polynomial regression has been applied in this study.
Model Validation
Lastly, to validate the above models compare the predicted values to the actuals and calculate the root mean squared error to find out the best fit model for forecasting the rainfall.
RMSE (Root Mean Squared Error)
The square root of Mean Squared Error (MSE) is RMSE, which measures the absolute fit of the prediction model to the data. As a result, the model's projected values are compared to the observed data points to determine how accurate the model is. So, the RMSE is the average prediction error, which is expressed as the follows
$$\text{R}\text{M}\text{S}\text{E}=\sqrt{\frac{\sum _{i=1}^{N} {\left({x}_{i}-{\widehat{x}}_{i}\right)}^{2}}{N}} \left(IV\right)$$
where, RMSE is Root- Mean Square Error, N is the number of data points, i is the ith variable, \({x}_{i}\) is the actual rainfall and \({\widehat{x}}_{i}\) is the forecast rainfall.