A Comparative Analysis of Arima and other Statistical Techniques in Rainfall Forecasting: A Case Study in Kolkata (KMC), West Bengal

Rainfall forecasting in urban areas is a significant consideration for city planners due to its connection with urban water management. In this study, the ARIMA (auto-regressive integrated moving average) model, as well as several regression approaches such as simple linear and second to sixth-degree polynomial regression equations, have been used to forecast the annual rainfall based on 120 years of monthly and annual rainfall from 1901 to 2020 in Kolkata Municipal Corporation (KMC), West Bengal. This study compares the performance of ARIMA and other regression techniques in forecasting rainfall using the metrics of R-squared and root mean square error (RMSE). The ARIMA model has been implemented using machine learning techniques in the Python programming language, while additional regression equations have been computed and analyzed using Microsoft Excel 2019. In order to employ the ARIMA model, all assumptions were assessed, and the optimal model order was established using the import auto-Arima package from the pmdarima. arima library. The stepwise model. aic function yielded 0,1,1 as the most suitable order for the model. The findings indicate that, out of all the regression methods employed for rainfall prediction, the fifth-degree polynomial equation exhibits the lowest root mean square error (RMSE), establishing it as the most effective model for rainfall forecasting in this study.


Introduction
Rainfall is the primary supply of water for those whose entire existence is reliant on it.Rainfall prediction has become a significant concern in recent years, attracting the attention of government agencies, industries, risk assessment agencies, and the research community.Rainfall prediction models can aid in preserving people's lives and property while indirectly supporting the country's economy.2][3] Long-term rainfall data prediction in meteorology can aid decisionmaking processes carried out by organizations responsible for catastrophe square avoidance. 4ainfall prediction is a challenging endeavor due to the non-linear character of climate processes.In recent years, data-driven (empirical) approaches have surpassed knowledge-driven (physical) approaches in terms of popularity. 5The effective use of several data-driven models in hydrology has opened new dimensions for the usability of deep neural networks for time series analysis.Box and Jenkins proposed the autoregressive integrated moving average model (ARIMA), generally known as the Box-Jenkins model, for time-series forecasting. 68] ARIMA is an enlarged form of the ARMA model, and it is one of the most effective models with a long history of use. 9- 11Sequential modeling was employed to forecast monthly precipitation in India and demonstrate that a deep learning network can be used successfully for time series analysis in the realm of hydrology and related fields. 12A comparative study using machine learning techniques has been used to build models for rainfall prediction. 13A comparative study of various statistical and deep learning techniques were carried out to forecast long-term pollution trends in Kolkata and it was discovered that statistical methods such as auto-regressive (AR), seasonal auto-regressive integrated moving average (SARIMA), and Holt-Winters outperformed the deep learning methods. 14The rainfall intensity of Coonoor in the Nilgiri district of Tamilnadu was predicted using regression techniques and other statistical models.The regression techniques employed for prediction were support vector regression (SVR), Random Forest (RF), and Decision Tree (DT), demonstrating that Random Forest is the best regression strategy for rainfall prediction (RF). 15A comparison of ANFIS, ARIMA, and the suggested Fuzzy based Curve fitting for weather forecasting was conducted using SSE, R2, RMSE, and MAE, and it was discovered that the curve fitting based on fuzzy logic outperforms ANFIS and ARIMA. 16Statistical downscaling local polynomial regression was used to derive future rainfall estimates in the catchment of the Idukky reservoir in Kerala, India. 17The rainfall in the city of Bengaluru, India, was forecasted using seasonal Naive, triple exponential smoothing and seasonal ARIMA time series models where many scale dependent error predictions methods and inferential analysis were used to assess the accuracy of forecasts from these time series models and the results suggest that the seasonal autoregressive moving average model delivers more accurate results. 18A comparison of 4 different machine learning algorithms (K-Nearest Neighbor, Logistic Regression, Random Forest Classifier, and Support Vector Machine) in solar flare forecasting shows that Logistic Regression and Support Vector Machine algorithms perform exceptionally well in forecasting active region flaring potential. 191][22] Forecasting rainfall directly impacts the security and economic stability in any region, which is typically a complex phenomenon.The Indian cities like Kolkata, one of the country's largest cities, have more than 80 percent built up area. 23Due to huge coverage of the built-up area of Kolkata Municipal Corporation (KMC) the city faces water crisis during non-monsoon periods while it faces urban flood during the monsoon period frequently.Hence, the management of rainwater in a sustainable way should be a major focus of research in this area.Rainwater harvesting may serve as an alternative method to reduce urban floods and, at the same time, maintain the water supply to create favorable conditions to meet the water demand of the city.Therefore, in order to invest for rainwater harvesting or any other sustainable planning to meet the water demand of KMC, rainfall forecasting is of utmost importance.Though many works have been done on rainfall forecasting but there is a lack of studies in comparative analysis of various degree of polynomial curve and ARIMA to find the best fit model in the study area to better forecast the rainfall.Therefore, this study aims to identify the optimal model by evaluating the root mean square error (RMSE) and examining the relationship between the coefficient of determination (r 2 ) and RMSE in identifying the best fit model.The focus of this study is to determine the optimum model for forecasting rainfall using different techniques like the auto-regressive integrated moving average (ARIMA) and other statistical regression techniques.In this study, a comparative analysis of ARIMA and other statistical techniques like a different degree of a polynomials have been used in Rainfall Prediction in Kolkata (KMC), West Bengal.For this purpose, 120 years of monthly and annual rainfall data of IMD (Indian Meteorological Department) from 1901 to 2020 of Kolkata has been retrieved from the Indian water portal "Water Resource Information System" (WRIS) and tries to forecasting rainfall using ARIMA, and different degree of Polynomial regression for identification of the best model.

Materials and Methods Study Area
As per the Official website of the Kolkata Municipal Corporation, the current study area is extended between 22°27'28" North to 22°38'20" North, and 88°15'50" East to 88°28'45 " East (Fig. 1).The whole area under study is 187 km 2 , which has been attained using ArcMap 10.2 by vectorization the Kolkata Municipal Corporation (KMC) map in UTM projection and WGS84 datum.The area under study is bordered on the north and north-east by the district of North 24 Parganas, on the south by the district of South 24 Parganas, and on the west by the Hooghly River.The entire region is located in the Ganges Delta, which is a natural recurrence.According to the Indian Meteorological Department (IMD), the research region exhibits a tropical wet and dry climate.The main seasons are summer, wet autumn, brief winter and summer which is characterized by excessive humidity.The southwest monsoon continues to be the dominant source of precipitation in the study region and the wettest months of the year are June, July, August and September.The Kolkata Municipal Corporation's land use and land cover map have been generated using ERDAS IMAGINE 2014 using the Sentinel 2A (2019) image.
The overall built-up area (Fig. 2) in KMC has been found to account for 82.23 percent of the total area in 2019.The huge built-up area in KMC combined with a diminishing trend in ground water level raises concerns about the area's long-term water availability.As a result, this study becomes critical for the city's long-term viability.

Database and Methodology
To achieve the aforementioned goal, the model was built using the monthly and annual rainfall data, which has been retrieved from a web-based spatial data portal from India's water resource information system (https://indiawris.gov.in/wris/#/rainfall)from 1901 to 2020.

Rainfall Forecasting
For rainfall forecasting monthly rainfall data have been integrated to produce total rainfall on yearly basis.In this paper, several statistical approaches have been used to study rainfall forecasting, including Autoregressive Integrated Moving Average Method (ARIMA) using python linear (Ordinary Least Square Method) and polynomial regression in MS Excel.
Finally, to find the best fit model the Root Mean Square Error (RMSE) has been calculated using observed and forecasted rainfall data from 2001 to 2020.The methodological framework of the study has been shown in Fig. 3 The process of forming assumptions about the future values of investigated variables is known as forecasting. 6The serial correlation effect has been checked since it plays a vital role in assessing and ultimately reducing the uncertainty of rainfall forecasting before it was examined in detail.

Augmented Dickey Fuller Test (ADF Test)
Augmented Dickey Fuller Test is one of the most widely used statistical test when it comes to analyzing the stationary of a series.For the prediction of rainfall with ARIMA it is necessary to check the time series data whether the data is stationary or not.In this study to check the data whether it is stationary or not ADF test have been applied using python.For this test, adfuller function have been imported from 'statsmodels.tsa.stattools'library in python.

Serial Correlation Effect
The Serial correlation or auto correlation is used to find a link between the variable's current value and any prior values that must be accessed.It's an authentic way to find hidden trends and patterns in time series data that would otherwise go unnoticed.While using ARIMA to forecast rainfall, it has been assumed that the observed time-series data is serially independent.However, significant serial correlation coefficients in time series rainfall data may arise, necessitating the testing of serial correlation effect while reviewing a series of historical data.
The Autocorrelation and partial autocorrelation functions in Python have been used to evaluate the significant autocorrelation coefficients with varying lagged values at a confidence level of 0.05.Lag-1 autocorrelation is often used to examine the influence of serial correlation in time series data. 24he simple correlation coefficient between the first observation N-1, Xt, t =1,2,3,...,N-1 and the subsequent observations, Xt+1, t = 2,3,..., N-1 is known as the lag-1 autocorrelation coefficient. 25he formula for calculating the relationship between Xt and Xt+1 is given below- where, X ̅ =∑ N t=1 is the total mean.
The coefficient r 1 is investigated to test its significance.
The probability limitations of a two-tailed independent series' correlogram are illustrated below. 25. (2)   where, N denotes the sample size and k denotes the lag.
The data are presumed to be serially dependent if r 1 is outside the provided confidence interval and serially independent if r 1 falls within the interval.of the sample data were proposed as the main tools for determining the ARIMA model's order. 6The model is expressed as ARIMA (p, d, q), where p represents the order of the auto regressive process, d represents the order of the stationary data and q represents the order of the moving average process. 26The ARIMA model is implemented in the following steps. 27

Identification of Model
When time series data are stationary, the ARIMA model is useful.The first step is to determine whether the time series data is stationary or not before proceeding any further.Before forecasting with ARIMA, it is necessary to make a time series stationary if it has a trend or seasonality component.

Identification of PACF and ACF parameters
In order to use the ARIMA model, it is necessary first to identify the value of d (stationary data), the number of residual lag values (q) and the dependent lag value (p).The key tools for detecting q and p, as well as correlation are ACF (autocorrelation function) and PACF (partial auto correlation function) which displays the plot of ACF and PACF values for lag.
The partial autocorrelation coefficient measures the similarity between Xt and Xt-k whereas the lab effect times 1, 2, 3,..., k-1 are taken as constant.

Build the Optimal ARIMA Model
There can be different ARIMA models based on the outcomes of the stationary detection and the determination of ACF and PACF.Hence, the autoregressive parameters are determined.To identify the best order of the model in this analysis the auto_ ARIMA function has been used which automatically gives the optimum order for this model.In Python, the auto_ARIMA function finds the best order for the model's parameters by employing a quick maximum likelihood estimation approach and a stepwise search based on minimum AIC (Akaike Information Criteria).The AIC is a fine-tuned technique for estimating the likelihood of a model to predict future values based on in-sample fit. 28The best order of the model is that which produces the lowest AIC among all the other orders.

Forecasting
After obtaining the optimal model, forecasting for the subsequent period is possible.Forecasting using this method is frequently more efficient than forecasting using other time series methods.

Residual Diagnostics to Check the Model
In a time-series analysis every observation may be predicted using all prior observations, which are referred to as fitted values and the residuals are that which is deviated over after a model has been fitted.
The linear regression hypothesis is tested using residual analysis, which determines if the error follows a normal distribution.The standardized residual graph, normal Q-Q plot and Histogram plus estimate density have been plotted in this study to check the white noise of the residuals.

Standardized Residual Graph
The standardized residuals are calculated by dividing the raw residuals by the overall standard deviation of the raw residuals which produces a consistent measure of prediction error.It is a metric that indicates how strong the gap between observed and predicted values is.This plot clarifies that the residuals are dispersed in a random fashion and the residuals can be demonstrated to be independent if the sequence does not display patterns like trend or periodicity.

Histogram Plus Estimated Density (Kernel Density Estimators)
The histogram along with estimated density have been used to assess whether the residuals are normal or not, depending on the interval values employed to categorise the data.A density plot is a continuous, smoothed form of a histogram derived from data.Kernel density estimation (KDE), the most frequent technique of estimate, have been used in this work.

Linear and Polynomial Regression
Regression analysis is a type of predictive modeling that is commonly used for forecasting, time series modeling and determining causal relationships between variables.In this work, a "least squares" method has been utilized to find the best-fit line to estimate rainfall using a linear regression model, which is expressed as follows where, y represents estimated rainfall, 'a' represents intercept and 'b' represent slope which is coefficient of x.
In this paper polynomial regression have been applied to forecast the rainfall.In this study, the rainfall is treated as a dependent variable y, which is modelled as an nth degree polynomial in x, while time is treated as an independent variable x.When the pattern of rainfall trend is linear, the simple linear regression procedure works.However, if the data is non-linear, linear regression will not be able to create a best-fit line and will fail in such cases, hence polynomial regression has been applied in this study.

Model Validation
Lastly, to validate the above models compare the predicted values to the actuals and calculate the root mean squared error to find out the best fit model for forecasting the rainfall.The validation process involves partitioning the 120-years of annual rainfall data into training and testing datasets.
The training dataset consists of rainfall data from 1901 to 2000, while the testing dataset consists of the remaining 20 years of rainfall data from 2001 to 2020.Initially, using a dataset of rainfall spanning 100 years from 1901 to 2000, a forecast has been made for the next 20-years of annual rainfall from 2001 to 2020.Subsequently, to validate the model, the observed rainfall data from 2001 to 2020 has been compared to the forecasted rainfall data for the same time period.Then, the root mean square error has been calculated to determine the most accurate model for rainfall prediction.

RMSE (Root Mean Squared Error)
The square root of Mean Squared Error (MSE) is RMSE, which measures the absolute fit of the prediction model to the data.As a result, the model's projected values are compared to the observed data points to determine how accurate the model is.So, the RMSE is the average prediction error, which is expressed as the follows ...( 4) where, RMSE is Root-Mean Square Error, N is the number of data points, i is the i th variable, x i is the actual rainfall and x̂i is the forecast rainfall.

Results and Discussion
In this paper, ARIMA and other statistical techniques have been used to forecast and determine the optimal model in rainfall prediction over Kolkata on the basis of 120 years of annual rainfall.Before applying the ARIMA model for prediction, the basic assumptions were tested and then the model has been compared to various regression techniques and discussed.

Augmented Dickey Fuller Test (ADF Test)
The The table displays the code and tools utilized to obtain the p-value.Here, the calculated 'p' value is 0.00091 (Table 1), which is less than the significance level 0.05.So, it is inferred that the data is stationary as it rejects the null hypothesis and accept the alternative hypothesis.

Serial Correlation Analysis
Serial correlation also known as autocorrelation is the correlation between two observations at different points in a time series data.When these auto correlations exist in data series it means that previous values have an impact on the current value.In the domain of time series analysis, it is prerequisite to study the autocorrelation and partial autocorrelation before modelling the series in order to get the better understanding about the datasets.These are mostly carried out to ensure that the presented data is a function of time and to provide evidence of that fact.
Here, the annual rainfall over 120 years has a stationary pattern which means that there is no seasonal pattern and hence there is no need of differencing the data.The ACF and PACF which have been plotted to comprehend the auto correlation effect better as demonstrated below (Figure 4): The above ACF and PACF have been calculated with import statsmodels and using function acf() and pacf() correspondingly and then plotted using plot_acf() and plot_pacf() function.The autocorrelation plot for 120 years of yearly rainfall in Kolkata Municipal Corporation is shown above (Fig. 4).The horizontal axis shows the lag numbers between the elements of the datasets and the vertical axis shows the value of the autocorrelated function which can range from -1 to 1.A vertical line corresponding to each lag is called a spike on the graph, which shows the value of autocorrelated function for that particular lag.The autocorrelation with lag zero is always one because it reflects the correlation between each term and itself.Statistical significance is assigned to each spike that rises above or falls below the significance zone.This indicates that the spike has a value that differs significantly from zero.When a spike is sufficiently away from zero it indicates presence of autocorrelation and when it is close to zero, it indicates absence of autocorrelation.So, from the above plot of autocorrelation of annual rainfall data over 120 years it is clears that all the spikes are closed to zero up to lag 21 except lag 1 which reflects that the data is not autocorrelated significantly.On the other hand, the PACF, which helps to determine the model terms shows the correlation between a sequence and itself over time.The partial autocorrelations for lags 1, 2, 3, 4 and 5 are statistically significant, and the other lags are very less significant as seen on the graph (Fig. 4).As a result, this PACF recommends fitting a fifth-order or sixth-order autoregressive model.

Rainfall Forecasting with ARIMA Model
In this study, the ARIMA model has been implemented in Python language and used to forecast rainfall.
To achieve the best results, the basic parameters of this model p, d, and q has been determined, which are dependent on the nature of the time series data.To determine the best order for this model, import the auto_arima package from the pmdarima.arimalibrary and use the stepwise_model.aicfunction, which finds the optimum order for the model automatically as shown in the following (table 2)  The above results (Table 2) shows that the optimum order for the best model is 0, 1, 1 for the parameters of p, d and q respectively, which has the lowest AIC that is 1757.65 among the other orders.It has been then used as the order of the model with the function of stats ARIMA (df, order = (0,1,1) to get best results of the model.After determining the best model to forecast the rainfall it is required to validate the result.
Therefore, the data has been divided into training data from 1901-2000 and test data from 2001-2020.After that, it produces the predicted annual rainfall for the above mentioned 20 years periods which is showed in the following table (Table 3) and then plotted the actual and forecasted annual rainfall which is shown in the following (Fig. 5): The ARIMA model may be used for modeling and forecasting rainfall but to enhance the accuracy of new model and forecasts it must constantly update the previous data with new data and need to validate the results with observed data.This information about predicted rainfall can be used for urban planning purposes, such as flood management and rainwater conservation, in the research region.

Residual Diagnostics to check the Model
In this study, the standardized residuals, typical Q-Q plot and histogram plus estimated density has been plotted to assess the model residual diagnostics, which are shown in the following (Fig. 6) The above normalized residual plot reflects that the residuals are independent since they are scattered in a random form, which is a prerequisite assumption   4)

Identification of Best Model
In this study, the coefficient of determination (R 2 ) and Root Mean Square Error (RMSE) have been computed and considered to determine the optimum model for forecasting rainfall.The minimum RMSE is found with the fifth-degree polynomial equation, as shown in Table 5, among all the regression models.It is important to note that though the R 2 (coefficient of determination) has increased with increasing degree of curvature from the straight line to sixth degree polynomial equation with higher r 2 being 0.2019 at sixth degree polynomial equation but the lower RMSE is found at the fifth-degree polynomial equation.
The value of r2 for the sixth-degree polynomial is maximum (Table 5) to express the degree of explained variance but the value of RMSE is not the least for the sixth-degree polynomial, which signifies that the sixth-degree polynomial regression is over fitting the 120 years of dataset.The RMSE for fifth degree polynomial is the least (Table 5) among all the polynomial equations, which entails that the fifthdegree curve is actually the best fit curve for forecasting the data, as it is neither an under fit nor an over-fit curve concerning its datasets.R-squared indicates how well a regression model explains the observed data or how well data fits the regression model.However, a big value of r-square does not always indicate a good regression model, which is proved in this study, as the quality of the statistical measure is dependent on a number of factors, including the nature of the variables used in the model, the units of measure used for the variables, and the data transformation used.The lowest RMSE has been found with fifth degree polynomial regression techniques that is 364.83 with compare to the other techniques or models.Although the ARIMA model have been used in this study and met all the requirements for forecasting rainfall but did not produce the best results.The r-squared value and RMSE has been shown in the following (Table 5).An analysis of the long-term rainfall trend and its variability is a prerequisite to determining the feasibility of rainwater harvesting.In this regard, identifying the best model to forecast rainfall is necessary.The result obtained through the analysis of this study may help the planners for sustainable management of the water resource of the study area.
The forecasting must provide significant confidence to the administrators to build policies in KMC regarding rainwater harvesting to reduce urban floods and manage the city's water resources sustainably.
Even the same study can be applied to other cities where the best forecasting model can be used for water resource management.

Conclusions
The primary objective of the present study was to assess and contrast the efficacy of the ARIMA model in relation to other regression methodologies, encompassing linear regression and polynomial regressions ranging from second to sixth degree, for rainfall forecasting in Kolkata.The research has been performed utilizing a dataset consisting of 120 years of long-term annual and monthly rainfall data, covering the time period from 1901 to 2020.
In order to assess the efficacy of the models, the root mean square error (RMSE) was computed utilizing both the training and test datasets, encompassing the timeframe spanning from 2001 to 2020.The results suggest that out of all the regression methods examined for forecasting rainfall, the fifth-degree polynomial equation had the lowest root mean square error (RMSE), indicating its superior forecasting performance.In overall, this work offers significant insights into the relative effectiveness of various forecasting methods and highlights the potential of utilizing fifth-degree polynomial regression to enhance the precision of rainfall forecasting in Kolkata.It is crucial to note that while the ARIMA model met all of the assumptions in this study to forecast rainfall, the fifth-degree polynomial regression produced the best results when compared to ARIMA.On the other hand, while the r-squared value in the sixth-degree polynomial equation is higher than in the fifth-degree polynomial equation, the results showed that the RMSE in the fifth-degree polynomial regression is lower, indicating that this is the best model or technique for forecasting rainfall in this study.As a result, a consistently high r-squared value does not imply that this is the optimal prediction model.On the other hand, even if the ARIMA model fits all of its assumptions to predict rainfall, it does not always mean that it will always be the best model for forecasting rainfall.To recapitulate, rainfall forecasting is critical in Kolkata and since it is a city of urban flood during periods of heavy rainfall therefore it is essential to identify the appropriate model to forecast rainfall for effective water management.

Fig. 1 :
Fig. 1: Location map of the study area; (a): India, from which it shows the West Bengal; (b): West Bengal from which it shows Kolkata Municipal Corporation and (c): Kolkata Municipal Corporation, prepared by authors.

Fig. 2 :
Fig. 2: Land use Land cover map (2019) of the study area.Prepared by authors based on Sentinel 2A image, which were downloaded from USGS Earth Explorer; Date of acquisition: 31-03-2019.

Fig. 3 :
Fig. 3: Methodological flow chart of the study

Fig. 4 :
Fig. 4: Autocorrelation (ACF) and Partial Autocorrelation Function over 120 years of annual rainfall in Kolkata Municipal Corporation (KMC) plotted using python.

Fig. 5 :
Fig. 5: Actual and forecasted annual rainfall plotted using python based on 120 years of annual rainfall (mm) data for the ARIMA model to forecast the values.Here, the x-axis represents the forecasted value made by the model and the y-axis represents the accuracy of the prediction with standardized residuals.The larger the distance from the zero line reflects the worse prediction while the closer to zero reflects the better prediction.Except for two values for 1962 and 2012, the plot in this model indicates reasonable prediction because all of the values are between zero and plus or minus two.The normal Q-Q plot indicates that all of the values with the exception of the first one or two and the last one is just around the 45-degree line, confirming the model's assumption that the data is normally distributed across the years 1901 to 2020.The histogram plot along with estimated density reflects that the residuals are normally distributed which confirms the assumption of the model to predict the rainfall in this study.The normal density plot is shown on the green line, and the Kernel density estimation (KDE) plot is shown on the orange line, where the probability density function is represented on the y axis.

Fig. 7 .
Fig. 7. Linear and second to sixth degree of Polynomial Regression line plotted in excel quantile) plots are extremely useful for graphically analyzing and comparing two probability distributions by plotting their quantiles against one another.It is useful to verify the assumption that the dependent variable is normally distributed or not.If it is not normally distributed then it is required to be explain how the assumption is broken and what data points are involved.If the points on the graph are nearer to 45-degree straight line, the usual assumptions about allocation are met.The normal odds graph has been utilized in this investigation to see if the residuals satisfies the normal distribution assumption or not.