A Comparative Analysis of ARIMA and other Statistical Techniques in Rainfall Forecasting: A Case Study in Kolkata (KMC), West Bengal

doi:10.21203/rs.3.rs-1793167/v1

Download PDF

Research Article

A Comparative Analysis of ARIMA and other Statistical Techniques in Rainfall Forecasting: A Case Study in Kolkata (KMC), West Bengal

https://doi.org/10.21203/rs.3.rs-1793167/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Rainfall forecasting in urban environments is an important concern for city planners since it is linked to urban water management. In this study, the ARIMA (auto-regressive integrated moving average) model, as well as several regression approaches such as simple linear and second to sixth-degree polynomial regression equations have been used to forecast the annual rainfall based on 120 years of monthly and annual rainfall from 1901 to 2020 in Kolkata Municipal Corporation (KMC), West Bengal. It is a comparison of ARIMA and other regression techniques for forecasting rainfall using r squared and root mean square error (RMSE). The ARIMA model has been implemented through machine language using the python platform and other regression equations have been computed and analyzed in Microsoft Excel 2019. For using ARIMA, all assumptions were tested and the best order of the model was determined using the import auto-Arima package from the pmdarima.arima library and the stepwise model.aic function, which produced 0,1,1 as the best order of the model. The result shows that among all the regression approaches used to forecast rainfall the fifth-degree polynomial equation has the lowest RMSE making it the best model to forecast rainfall in this analysis.

comparative analysis

ARIMA Model

statistical techniques

rainfall forecasting

Rainfall is the primary supply of water for those whose entire existence is reliant on it. Rainfall prediction has become a major concern in recent years, attracting the attention of government agencies, industries, risk assessment agencies and the research community. Rainfall prediction models can aid in the preservation of people's lives and property while indirectly supporting the country's economy. Rainfall forecasting and quantification are essential for water resource planning and management (Huang et al., 1998). The ability to effectively forecast long-term rainfall is essential for water resource management (Serinaldi & Kilsby, 2012; W. Zhang et al., 2018). Long-term rainfall data prediction in meteorology can aid decision-making processes carried out by organizations responsible for catastrophe square avoidance (Poornima & Pushpalatha, 2019). Rainfall prediction is a difficult endeavor due to the non-linear character of climate processes. In recent years, data-driven (empirical) approaches have surpassed knowledge-driven (physical) approaches in terms of popularity (Ouyang et al., 2016). The effective use of several data-driven models in the field of hydrology has opened new dimensions for the usability of deep neural networks for time series analysis. Box and Jenkins proposed the autoregressive integrated moving average model (ARIMA), generally known as the Box–Jenkins model, for time-series forecasting (Box, G. E. P., & Jenkins, 1976). The Autoregressive model, Autoregressive Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA) models have all been used extensively in hydrological forecasting (Ayuba et al., 2018; George E. P. Box et.al., 2016). ARIMA is an enlarged form of the ARMA model, and it is one of the most effective models with a long history of use (Bari et al., 2015; Rahman et al., 2017; Wanders et al., 2017). Sequential modeling was employed to forecast monthly precipitation in India and demonstrate that a deep learning network can be used successfully for time series analysis in the realm of hydrology and related fields (Kumar et al., 2019). A comparative study using machine learning techniques has been used to build models for rainfall prediction (Oswal, 2019). To predict long-term pollution trends in Kolkata, India, a comparative study of different statistical and deep learning methods were conducted and it was discovered that statistical methods such as auto-regressive (AR), seasonal auto-regressive integrated moving average (SARIMA), and Holt-Winters outperformed deep learning methods are based on the limited data accessibility (Nath et al., 2021). The rainfall intensity of Coonoor in the Nilgiri district of Tamilnadu was predicted using regression techniques and other statistical models. The regression techniques employed for prediction were support vector regression (SVR), Random Forest (RF), and Decision Tree (DT), demonstrating that Random Forest is the best regression strategy for rainfall prediction (RF) (Tharun et al., 2018). A comparison of ANFIS, ARIMA, and the suggested Fuzzy based Curve fitting for weather forecasting was conducted using SSE, R2, RMSE, and MAE, and it was discovered that the curve fitting based on fuzzy logic outperforms ANFIS and ARIMA (Srikanth et al., 2016). Statistical downscaling local polynomial regression was used to derive future rainfall estimates in the catchment of the Idukky reservoir in Kerala, India (George et al., 2016). The rainfall in the city of Bengaluru, India, was forecasted using seasonal Naive, triple exponential smoothing and seasonal ARIMA time series models where many scale dependent error predictions methods and inferential analysis were used to assess the accuracy of forecasts from these time series models and the results suggest that the seasonal autoregressive moving average model delivers more accurate results (Joshi & Tyagi, 2021). A comparison of 4 different machine learning algorithms (K-Nearest Neighbor, Logistic Regression, Random Forest Classifier, and Support Vector Machine) in solar flare forecasting shows that Logistic Regression and Support Vector Machine algorithms perform exceptionally well in forecasting active region flaring potential (Sinha et al., 2021).

The purpose of this study is to determine the optimum model for predicting rainfall using the Box–Jenkin’s time series technique, specifically ARIMA, and other statistical techniques such as polynomial regression lines. In this study, a comparative analysis of auto-regressive integrated moving average (ARIMA), and other statistical techniques like a different degree of a polynomial have been used in Rainfall Prediction in Kolkata (KMC), West Bengal. For this purpose, 120 years of monthly and annual rainfall data of IMD (Indian Meteorological Department) from 1901 to 2020 of Kolkata has been retrieved from the Indian water portal “Water Resource Information System” (WRIS) and tries to forecasting rainfall using ARIMA, and different degree of Polynomial regression for identification of the best model.

The current study encompasses the entire Kolkata Municipal Corporation, which is located between 22 ° 27' 28"North and 22 ° 38' 20" North, and 88 ° 15' 50' East and 88 ° 28' 45 " East (Figure:1). The whole area under study is 187 km², which has been attained using ArcMap 10.2 by vectorization the Kolkata Municipal Corporation (KMC) map in UTM projection and WGS84 datum. The area under study is bordered on the north and north-east by the district of North 24 Parganas, on the south by the district of South 24 Parganas, and on the west by the Hooghly River. The entire region is located in the Ganges Delta, which is a natural recurrence. The research region has a tropical climate. The main seasons are summer, wet autumn, brief winter and summer which is characterized by excessive humidity. The southwest monsoon continues to be the dominant source of precipitation in the study region and the wettest months of the year are June, July, August and September.

The Kolkata Municipal Corporation's land use and land cover map have been generated using ERDAS IMAGINE 2014 using the Sentinel 2A (2019) image. The overall built-up area (figure 1) in KMC has been found to account for 82.23 percent of the total area in 2019. The huge built-up area in KMC combined with a diminishing trend in ground water level raises concerns about the area's long-term water availability. As a result, this study becomes critical for the city's long-term viability.

To achieve the aforementioned goal the model was built using the monthly and annual rainfall data, which has been retrieved from a web-based spatial data portal from India's water resource information system (https://indiawris.gov.in/wris/#/rainfall) from 1901 to 2020. For rainfall forecasting monthly rainfall data have been integrated to produce total rainfall on a periodic and yearly basis. In this paper, several statistical approaches have been used to study rainfall forecasting, including Autoregressive Integrated Moving Average Method (ARIMA) using python (10.3 version), linear (Ordinary Least Square Method) and polynomial regression in MS Excel. Finally, to find the best fit model the Root Mean Square Error (RMSE) has been calculated using observed and forecasted rainfall data from 2001 to 2020. The methodological framework of the study is as follows:

Rainfall Forecasting

The process of forming assumptions about the future values of investigated variables is known as forecasting (Box, G. E. P., & Jenkins, 1976). The serial correlation effect has been checked since it plays a vital role in assessing and ultimately reducing the uncertainty of rainfall forecasting before it was examined in detail.

Augmented Dickey Fuller Test (ADF Test)

Augmented Dickey Fuller Test is one of the most widely used statistical test when it comes to analyzing the stationary of a series. For the prediction of rainfall with ARIMA it is necessary to check the time series data whether the data is stationary or not. In this study to check the data whether it is stationary or not ADF test have been applied using python. For this test adfuller function have been imported from ‘statsmodels.tsa.stattools’ library in python.

Serial Correlation Effect

The Serial correlation or auto correlation is used to used to find a link between the variable's current value and any prior values that must be accessed. It's a an authentic way to find hidden trends and patterns in time series data that would otherwise go unnoticed. While using ARIMA to forecast rainfall, it has been assumed that the observed time-series data is serially independent. However, significant serial correlation coefficients in time series rainfall data may arise, necessitating the testing of serial correlation effect while reviewing a series of historical data.

The Autocorrelation and partial autocorrelation functions in Python 10.3 have been used to evaluate the significant autocorrelation coefficients with varying lagged values at a confidence level of 0.05. Lag-1 autocorrelation is often used to examine the influence of serial correlation in time series data. (Anderson, 1942). The simple correlation coefficient between the first observation N-1, Xt, t = 1,2,3,...,N-1 and the subsequent observations, Xt + 1, t = 2,3,...,N-1 is known as the lag-1 autocorrelation coefficient (Sharma & Singh, 2017). The formula for calculating the relationship between Xt and Xt + 1 is given below-

$${r}_{1}=\frac{\sum _{t=1}^{N-1} \left({X}_{t}-X\right)\left({X}_{t+1}-\stackrel{-}{X}\right)}{\sum _{i=1}^{N} {\left({X}_{t}-\stackrel{-}{X}\right)}^{2}} \left(I\right)$$

where, $\stackrel{-}{X}=\sum _{t=1}^{N}$is the total mean.

The coefficient r₁ is investigated to test its significance. The probability limitations of a two-tailed independent series' correlogram is illustrated below (Sharma & Singh, 2017).-

$${r}_{1}\left(95\text{\%}\right)=\frac{-1\pm 1.96\sqrt{N-k-1}}{N-k} \left(II\right)$$

where, N denotes the sample size and k denotes the lag.

The data are presumed to be serially dependent if r₁ is outside the provided confidence interval and serially independent if r₁ falls within the interval.

The Auto-Regressive Integrated Moving Average (ARIMA)

Autoregressive Integrated Moving Average (ARIMA) or Box-Jenkins have been extensively used as forecasting techniques. The autocorrelation function (ACF) and partial autocorrelation function (PACF) of the sample data were proposed as the main tools for determining the ARIMA model's order (Box, G. E. P., & Jenkins, 1976). The model is expressed as ARIMA (p, d, q), where p represents the order of the auto regressive process, d represents the order of the stationary data and q represents the order of the moving average process (P. G. Zhang, 2003). The ARIMA model is implemented in the following steps (Box, George E. P.; Jenkins, Gwilym M.; Reinsel, 1994).-

(i) Identification of Model: When time series data are stationary, the ARIMA model is useful. The first step is to determine whether the time series data is stationary or not before proceeding any further. Before forecasting with ARIMA, it is necessary to make a time series stationary if it has a trend or seasonality component.

(ii) Identification of PACF and ACF parameters: In order to use the ARIMA model, it is necessary to first identify the value of d (stationary data), the number of residual lag values (q) and the dependent lag value (p). The key tools for detecting q and p, as well as correlation are ACF (autocorrelation function) and PACF (partial auto correlation function) which displays the plot of ACF and PACF values for lag. The partial autocorrelation coefficient measures the similarity between Xt and Xt-k while assuming that the lab effect times 1, 2, 3..., k-1 are constant.

(iii) Build the optimal ARIMA Model:

There can be different ARIMA models based on the outcomes of the stationary detection and the determination of ACF and PACF. Hence, the auto-regressive parameters are determined. To identify the best order of the model in this analysis the auto_ARIMA function has been used which automatically gives the optimum order for this model. In Python, the auto_ARIMA function finds the best order for the model's parameters by employing a quick maximum likelihood estimation approach and a stepwise search based on minimum AIC (Akaike Information Criteria). The AIC is a fine-tuned technique for estimating the likelihood of a model to predict future values based on in-sample fit (Akaike, 1974). The best order of the model is that which produces the lowest AIC among all the other orders.

(iv) Forecasting: After obtaining the optimal model, forecasting for the subsequent period is possible. Forecasting using this method is frequently more efficient than forecasting using other time series methods.

(V) Residual Diagnostics to check the Model: In a time-series analysis every observation may be predicted using all prior observations, which are referred to as fitted values and the residuals are that which is deviated over after a model has been fitted. The linear regression hypothesis is tested using residual analysis, which determines if the error follows a normal distribution. The standardized residual graph, normal Q-Q plot and Histogram plus estimate density have been plotted in this study to check the white noise of the residuals.

(a) standardized residual graph: The standardized residuals are calculated by dividing the raw residuals by the overall standard deviation of the raw residuals which produces a consistent measure of prediction error. It is a metric that indicates how strong the gap between observed and predicted values is. This plot clarifies that the residuals are dispersed in a random fashion and the residuals can be demonstrated to be independent if the sequence does not display patterns like trend or periodicity.

(b) Normal Q-Q Plot: Normal Q-Q (quantile-quantile) plots are extremely useful for graphically analyzing and comparing two probability distributions by plotting their quantiles against one another. It is useful to verify the assumption that the dependent variable is normally distributed or not. If it is not normally distributed then it is required to be explain how the assumption is broken and what data points are involved. If the points on the graph are nearer to 45-degree straight line, the usual assumptions about allocation are met.The normal odds graph has been utilized in this investigation to see if the residuals satisfies the normal distribution assumption or not.

(c) Histogram plus estimated density (Kernel Density Estimators): The histogram along with estimated density have been used to assess whether the residuals are normal or not, depending on the interval values employed to categorise the data. A density plot is a continuous, smoothed form of a histogram derived from data. Kernel density estimation (KDE), the most frequent technique of estimate, have been used in this work.

Linear and Polynomial Regression

Regression analysis is a type of predictive modeling that is commonly used for forecasting, time series modeling and determining causal relationships between variables. In this work, a "least squares" method has been utilized to find the best-fit line to estimate rainfall using a linear regression model, which is expressed as follows

$$y=a+bx \left(III\right)$$

where, y represents estimated rainfall, ‘a’ represents intercept and ‘b’ represent slope which is coefficient of x.

In this paper polynomial regression have been applied to forecast the rainfall. In this study, the rainfall is treated as a dependent variable y, which is modelled as an nth degree polynomial in x, while time is treated as an independent variable x. When the pattern of rainfall trend is linear, the simple linear regression procedure works. However, if the data is non-linear, linear regression will not be able to create a best-fit line and will fail in such cases, hence polynomial regression has been applied in this study.

Model Validation

Lastly, to validate the above models compare the predicted values to the actuals and calculate the root mean squared error to find out the best fit model for forecasting the rainfall.

RMSE (Root Mean Squared Error)

The square root of Mean Squared Error (MSE) is RMSE, which measures the absolute fit of the prediction model to the data. As a result, the model's projected values are compared to the observed data points to determine how accurate the model is. So, the RMSE is the average prediction error, which is expressed as the follows

$$\text{R}\text{M}\text{S}\text{E}=\sqrt{\frac{\sum _{i=1}^{N} {\left({x}_{i}-{\widehat{x}}_{i}\right)}^{2}}{N}} \left(IV\right)$$

where, RMSE is Root- Mean Square Error, N is the number of data points, i is the i^th variable, ${x}_{i}$ is the actual rainfall and ${\widehat{x}}_{i}$ is the forecast rainfall.

In this paper, ARIMA and other statistical techniques have been used to forecast and determine the optimal model in rainfall prediction over Kolkata on the basis of 120 years of annual rainfall. Before applying the ARIMA model for prediction, the basic assumptions were tested and then the model has been compared to various regression techniques and discussed.

Augmented Dickey Fuller Test (ADF Test): The ADF test is a statistically significant test, that there is a hypothesis testing involved with a null hypothesis and alternative hypothesis and ‘p’ value are presented as a result of the test. In this study to check whether the the annual rainfall data of Kolkata is stationary or not the ‘adfuller(d.dropna)’ function have been applied in python. As per the test if the calculated ‘p’ value should be less than the significance level 0.05 the null hypothesis is rejected. Here, the calculated ‘p’ value is 0.0009, which is less than the significance level 0.05. So, it is inferred that the data is stationary as it rejects the null hypothesis and accept the alternative hypothesis. The result of ADF test which has been performed in python is shown below:

Table: Augmented Dickey Fuller Test

Serial Correlation Analysis

Serial correlation also known as autocorrelation is the correlation between two observations at different points in a time series data. When these auto correlations exist in data series it means that previous values have an impact on the current value. In the domain of time series analysis, it is prerequisite to study the autocorrelation and partial autocorrelation before modelling the series in order to get the better understanding about the datasets. These are mostly carried out to ensure that the presented data is a function of time and to provide evidence of that fact. Here, the annual rainfall over 120 years has a stationary pattern which means that there is no seasonal pattern and hence there is no need of differencing the data. The ACF and PACF which have been plotted to comprehend the auto correlation effect better as demonstrated below

The above ACF and PACF have been calculated with import statsmodels and using function acf() and pacf() correspondingly and then plotted using plot_acf() and plot_pacf() function. The autocorrelation plot for 120 years of yearly rainfall in Kolkata Municipal Corporation is shown above (Fig. 3). The horizontal axis shows the lag numbers between the elements of the datasets and the vertical axis shows the value of the autocorrelated function which can range from − 1 to 1. A vertical line corresponding to each lag is called a spike on the graph, which shows the value of autocorrelated function for that particular lag. The autocorrelation with lag zero is always one because it reflects the correlation between each term and itself. Statistical significance is assigned to each spike that rises above or falls below the significance zone. This indicates that the spike has a value that differs significantly from zero. When a spike is sufficiently away from zero it indicates presence of autocorrelation and when it is close to zero, it indicates absence of autocorrelation. So, from the above plot of autocorrelation of annual rainfall data over 120 years it is clears that all the spikes are closed to zero up to lag 21 except lag 1 which reflects that the data is not autocorrelated significantly. On the other hand the PACF, which helps to determine the model terms shows the correlation between a sequence and itself over time. The partial autocorrelations for lags 1,2 and 3 are statistically significant, and the other lags are very less significant as seen on the graph (Fig:). As a result, this PACF recommends fitting a third-order or fourth-order autoregressive model.

Rainfall Forecasting with ARIMA Model

In this study, the ARIMA model has been implemented in Python (10.3 version) language and used to forecast rainfall. To achieve the best results, the basic parameters of this model p, d, and q has been determined, which are dependent on the nature of the time series data. To determine the best order for this model, import the auto_arima package from the pmdarima.arima library and use the stepwise_model.aic function, which finds the optimum order for the model automatically as follows

The above results shows that the optimum order for the best model is 0, 1, 1 for the parameters of p, d and q respectively, which has the lowest AIC that is 1757.65 among the other orders. It has been then used as the order of the model with the function of stats ARIMA(df,order=(0,1,1) to get best results of the model which are as follows:

After determining the best model to forecast the rainfall it is required to validate the result. Therefore, the data has been divided into training data from 1901–2000 and test data from 2001–2020. The following algorithms has been used to forecast the next 20 years of annual rainfall from 2001 to 2020:

After that, it produces the predicted annual rainfall for the above mentioned 20 years periods which is showed in the following table (Table:1 ) and then plotted the actual and forecasted annual rainfall which is shown in Fig. 6.

Table 1

Forecasted annual rainfall from 2001 to 2020 as computed by the authors with ARIMA model using python
Years	Annual Rainfall (mm)	Years	Annual Rainfall (mm)	Years	Annual Rainfall (mm)	Years	Annual Rainfall (mm)
2001	1806.57	2006	1762.87	2011	1594.92	2016	1503.62
2002	1751.94	2007	1739.23	2012	1316.87	2017	1450.18
2003	1800.19	2008	1708.44	2013	1515.97	2018	1509
2004	1804.12	2009	1676.96	2014	1427.7	2019	1470.8
2005	1778.07	2010	1661.78	2015	1415.65	2020	1482.85

The ARIMA model may be used for modeling and forecasting rainfall but to enhance the accuracy of new model and forecasts it must constantly update the previous data with new data and need to validate the results with observed data. This information about predicted rainfall can be used for urban planning purposes, such as flood management and rainwater conservation, in the research region.

Residual Diagnostics to check the Model

In this study, the standardized residuals, typical Q-Q plot and histogram plus estimated density has been plotted to assess the model residual diagnostics, which are shown in the following (Fig. 7)

The above normalized residual plot reflects that the residuals are independent since they are scattered in a random form, which is a prerequisite assumption for the ARIMA model to forecast the values. Here, the x-axis represents the forecasted value made by the model and the y-axis represents the accuracy of the prediction with standardized residuals. The larger the distance from the zero line reflects the worse prediction while the closer to zero reflects the better prediction. Except for two values for 1962 and 2012, the plot in this model indicates reasonable prediction because all of the values are between zero and plus or minus two.

The normal Q-Q plot indicates that all of the values with the exception of the first one or two and the last one is just around the 45-degree line, confirming the model's assumption that the data is normally distributed across the years 1901 to 2020.

The histogram plot along with estimated density reflects that the residuals are normally distributed which confirms the assumption of the model to predict the rainfall in this study. The normal density plot is shown on the green line, and the Kernel density estimation (KDE) plot is shown on the orange line, where the probability density function is represented on the y axis.

Linear and Polynomial Regression:

In this study, simple linear regression and second-to-sixth-degree polynomial regressions have been employed to forecast the rainfall to determining the best model for effective forecasting. First, the 20 years of rainfall from 2001 to 2020 has been forecasted on the basis of 100 years of rainfall from 1901 to 2000 applying all of the above-mentioned techniques and then validated with actual rainfall from 2001 to 2020. So, in order to validate or to determine the best regression model the RMSE between actual rainfall and forecasted rainfall has been calculated with R square value for each of these regression models. The following are the graph of the last 20 years' projected annual rainfall from 2001 to 2020:

So, based on the above equations, the forecasted annual rainfall over the last 20 years from 2001 to 2020 has been determined for each regression models, which are as follows:

Table 2

Forecasted annual rainfall from 2001 to 2020 as computed by the authors in excel

Years

Forecasted Annual Rainfall (mm) using various regression techniques

Linear trend

Second Degree Polynomial

Third Degree Polynomial

Fourth Degree Polynomial

Fifth Degree Polynomial

Sixth Degree Polynomial

2001

1346.92

1321.12

1399.17

1570.02

1633.71

1561.25

2002

1352.34

1328.10

1396.69

1533.03

1577.40

1535.67

2003

1357.76

1335.06

1394.67

1499.65

1527.47

1510.20

2004

1363.18

1341.98

1393.08

1469.64

1483.54

1485.22

2005

1368.60

1348.87

1391.93

1443.11

1445.22

1461.07

2006

1374.02

1355.73

1391.20

1419.62

1412.14

1438.03

2007

1379.44

1362.55

1390.87

1399.10

1383.97

1416.36

2008

1384.87

1369.35

1390.94

1381.40

1360.36

1396.26

2009

1390.29

1376.11

1391.41

1366.38

1341.00

1377.88

2010

1395.71

1382.84

1392.25

1353.87

1325.56

1361.36

2011

1401.13

1389.54

1393.46

1343.73

1313.74

1346.79

2012

1406.55

1396.20

1395.03

1335.82

1305.27

1334.21

2013

1411.97

1402.84

1396.95

1329.99

1299.86

1323.67

2014

1417.39

1409.44

1399.20

1326.11

1297.26

1315.14

2015

1422.81

1416.01

1401.79

1324.03

1297.20

1308.60

2016

1428.23

1422.55

1404.69

1323.62

1299.45

1303.98

2017

1433.65

1429.05

1407.89

1324.74

1303.77

1301.18

2018

1439.07

1435.53

1411.40

1327.27

1309.95

1300.11

2019

1444.49

1441.97

1415.19

1331.08

1317.77

1300.62

2020

1449.91

1448.38

1419.26

1336.04

1327.04

1302.56

Identification of Best Model:

In this study, the coefficient of determination (R²) and Root Mean Square Error (RMSE) has been computed and considered in order to determine the optimum model for forecasting the rainfall. The minimum RMSE is found with the fifth-degree polynomial equation, as shown in the Table 3 among all the regression models. It is important to note that though the R² (coefficient of determination) has increased with increasing degree of curvature from straight line to sixth degree polynomial equation with higher R² being 0.2019 at sixth degree polynomial equation but the lower RMSE is found at fifth degree polynomial equation and due to over fitting not at sixth degree. R-squared indicates how well a regression model explains observed data or how well data fits the regression model. However, a big value of r-square does not always indicate a good regression model, which is proved in this study, as the quality of the statistical measure is dependent on a number of factors, including the nature of the variables used in the model, the units of measure used for the variables, and the data transformation used. The lowest RMSE has been found with fifth degree polynomial regression techniques that is 364.83 with compare to the other techniques or models. Although the ARIMA model have been used in this study and met all the requirements for forecasting rainfall but did not produce the best results. The r-squared value and RMSE has been shown in the following table no. 2.

Table 3

Order of ARIMA and Root mean square error with R square of various regression techniques calculated by the authors
Trend Line/Model	Equation/order	R ²	Root Mean Square Error (RMSE)
ARIMA	P, d, q (0,1,1)	---	420.95
Linear trend Line	y = 5.4207x + 1341.5	0.1611	383.90
Second Degree Polynomial	y = -0.016x² + 7.034x + 1314.1	0.1621	388.23
Third Degree Polynomial	y = -0.0017x³ + 0.2353x² − 3.168x + 1402.1	0.1685	377.75
Fourth Degree Polynomial	y = 0.0001x⁴ − 0.0284x³ + 1.9775x² − 42.728x + 1610.8	0.1946	367.16
Fifth Degree Polynomial	y = -2E-06x⁵ + 0.0006x⁴ − 0.0709x³ + 3.6006x² − 66.62x + 1696.8	0.1979	364.83
Sixth Degree Polynomial	y = -8E-08x⁶ + 2E-05x⁵ − 0.0023x⁴ + 0.0857x³ − 0.4045x² − 24.931x + 1586.5	0.2019	372

The current study compared the ARIMA model to other regression techniques such as linear regression, and second to sixth degree polynomial regressions in rainfall forecasting using 120 years of long-term annual and monthly rainfall data from 1901-to 2020 of Kolkata. Thus, in this study, the root mean square error (RMSE) has been evaluated based on the training data and test data from 2001 to 2020 and the results reveal that the fifth-degree polynomial equation has the lowest RMSE among all the regression approaches used to forecast the rainfall. It is crucial to note that, while the ARIMA model met all of the assumptions in this study to forecast rainfall, the fifth-degree polynomial regression produced the best results when compared to ARIMA. On the other hand, while the r squared value in the sixth-degree polynomial equation is higher than in the fifth-degree polynomial equation, the results showed that the RMSE in the fifth-degree polynomial regression is the lower, indicating that this is the best model or technique for forecasting rainfall in this study. As a result, a consistently high r squared value does not imply that this is the optimal prediction model. On the other hand, even if the Arima model fits all of its assumptions to predict rainfall, it does not always mean that it will always be the best model for predicting rain. To recapitulate, rainfall forecasting is critical in Kolkata and since it is a city of urban flood during periods of heavy rainfall therefore it is essential to identify the appropriate model to forecast rainfall for effective water management.

Conflict of Interest: There is no conflicts of interest to disclose.

Funding: There is no financial support for publication of this paper.

Data availability: The datasets analysed during the current study are available in WRIS portal.

Authors Contribution

Md Juber Alam ¹

Contributions: conceptualization, methodology, writing original manuscript, coding, and mapping.

Arijit Majumder²(*)

Contributions: methodology, reviewing and editing, supervision.

Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705
Anderson, R. L. (1942). Distribution of the Serial Correlation Coefficient. The Annals of Mathematical Statistics, 13(1), 1–13. https://doi.org/10.1214/aoms/1177731638
Ayuba, P., Journal, M. A.-S. W., & 2018, undefined. (2018). Comparative analysis of the performance of artificial neural networks (ANNs) and autoregressive integrated moving average (ARIMA) models on rainfall forecasting. Scienceworldjournal.Org, 13(1), 100–105. http://www.scienceworldjournal.org/article/view/18415
Bari, S. H., Rahman, M. T., Hussain, M. M., & Ray, S. (2015). Forecasting Monthly Precipitation in Sylhet City Using ARIMA Model. Civil and Environmental Research, 7(1), 69–78. http://www.iiste.org/Journals/index.php/CER/article/view/19069
Box, G. E. P., & Jenkins, G. M. (1976). Time series analysis: forecasting and control. San Francisco, CA: Holden-Day. [University of Wisconsut. Madison. WI and University OfLancaster, England], 1970, 1989.
Box, George E. P.; Jenkins, Gwilym M.; Reinsel, G. C. (1994). Time Series Analysis: Forecasting & Control (3rd Edition) (Third Edit). Englewood Cliffs, N.J. : Prentice Hall,.
George E. P. Box et.al. (2016). C_2 Meteorological Applications - 2015 - Valipour - Long‐term runoff study using SARIMA and ARIMA models in the United States.pdf. Wiley Online Library. https://doi.org/10.1111/jtsa.12194
George, J., Janaki, L., & Parameswaran Gomathy, J. (2016). Statistical Downscaling Using Local Polynomial Regression for Rainfall Predictions – A Case Study. Water Resources Management, 30(1), 183–193. https://doi.org/10.1007/s11269-015-1154-0
Huang, N. E., Shen, Z., Long, S. R., Wu, M. C., Snin, H. H., Zheng, Q., Yen, N. C., Tung, C. C., & Liu, H. H. (1998). The empirical mode decomposition and the Hubert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 454(1971), 903–995. https://doi.org/10.1098/rspa.1998.0193
Joshi, H., & Tyagi, D. (2021). Forecasting and Modeling Monthly Rainfall in Bengaluru, India: An Application of Time Series Models. International Journal of Scientific Research in Research Paper Mathematical and Statistical Sciences, 1, 39–46. www.isroset.org
Kumar, D., Singh, A., Samui, P., & Jha, R. K. (2019). Forecasting monthly precipitation using sequential modelling. Hydrological Sciences Journal, 64(6), 690–700. https://doi.org/10.1080/02626667.2019.1595624
Nath, P., Saha, P., Middya, A. I., & Roy, S. (2021). Long-term time-series pollution forecast using statistical and deep learning methods. Neural Computing and Applications, 33(19), 12551–12570. https://doi.org/10.1007/s00521-021-05901-2
Oswal, N. (2019). Predicting Rainfall using Machine Learning Techniques. Book. https://doi.org/10.36227/techrxiv.14398304.v1
Ouyang, Q., Lu, W., Xin, X., Zhang, Y., Cheng, W., & Yu, T. (2016). Monthly rainfall forecasting using EEMD-SVR based on phase-space reconstruction. Water Resources Management, 30(7), 2311–2325. https://doi.org/10.1007/s11269-016-1288-8
Poornima, S., & Pushpalatha, M. (2019). Prediction of rainfall using intensified LSTM based recurrent Neural Network with Weighted Linear Units. Atmosphere, 10(11). https://doi.org/10.3390/atmos10110668
Rahman, M. A., Yunsheng, L., & Sultana, N. (2017). Analysis and prediction of rainfall trends over Bangladesh using Mann–Kendall, Spearman’s rho tests and ARIMA model. Meteorology and Atmospheric Physics, 129(4), 409–424. https://doi.org/10.1007/s00703-016-0479-4
Serinaldi, F., & Kilsby, C. G. (2012). A modular class of multisite monthly rainfall generators for water resource management and impact studies. Journal of Hydrology, 464–465, 528–540. https://doi.org/10.1016/j.jhydrol.2012.07.043
Sharma, S., & Singh, P. K. (2017). Long term spatiotemporal variability in rainfall trends over the state of Jharkhand, India. Climate, 5(1). https://doi.org/10.3390/cli5010018
Sinha, S., Gupta, O., Singh, V., Lekshmi, B., Nandy, D., Mitra, D., Chatterjee, S., Bhattacharya, S., Chatterjee, S., Srivastava, N., & Brandenburg, A. (2021). A Comparative Analysis of Machine Learning Models for Solar Flare Forecasting: Identifying High Performing Active Region Flare Indicators. 1–15.
Srikanth, P., Rajeswara Rao, D., & Vidyullatha, P. (2016). Comparative analysis of ANFIS, ARIMA and polynomial curve fitting for weather forecasting. Indian Journal of Science and Technology, 9(15). https://doi.org/10.17485/ijst/2016/v9i15/89814
Tharun, V. P., Prakash, R., & Devi, S. R. (2018). Prediction of Rainfall Using Data Mining Techniques. Proceedings of the International Conference on Inventive Communication and Computational Technologies, ICICCT 2018, Icicct, 1507–1512. https://doi.org/10.1109/ICICCT.2018.8473177
Wanders, N., Bachas, A., He, X. G., Huang, H., Koppa, A., Mekonnen, Z. T., Pagán, B. R., Peng, L. Q., Vergopolan, N., Wang, K. J., Xiao, M., Zhan, S., Lettenmaier, D. P., & Wood, E. F. (2017). Forecasting the Hydroclimatic Signature of the 2015/16 El Niño Event on the Western United States. Journal of Hydrometeorology, 18(1), 177–186. https://doi.org/10.1175/JHM-D-16-0230.1
Zhang, P. G. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159–175. https://doi.org/10.1016/S0925-2312(01)00702-0
Zhang, W., Villarini, G., Vecchi, G. A., & Smith, J. A. (2018). Urbanization exacerbated the rainfall and flooding caused by hurricane Harvey in Houston. Nature, 563(7731), 384–388. https://doi.org/10.1038/s41586-018-0676-z

No competing interests reported.

SupplementaryMaterials.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

A Comparative Analysis of ARIMA and other Statistical Techniques in Rainfall Forecasting: A Case Study in Kolkata (KMC), West Bengal

Status:

Version 1

Abstract

Figures

1. Introduction

2. Study Area

3. Materials And Methods

Results and Discussion

Conclusion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1