Outbreak Prediction of COVID-19 for Dense and Populated Countries Using Machine Learning

Khakharia, Aman; Shah, Vruddhi; Jain, Sankalp; Shah, Jash; Tiwari, Amanshu; Daphal, Prathamesh; Warang, Mahesh; Mehendale, Ninad

doi:10.1007/s40745-020-00314-9

Outbreak Prediction of COVID-19 for Dense and Populated Countries Using Machine Learning

Published: 16 October 2020

Volume 8, pages 1–19, (2021)
Cite this article

Download PDF

Annals of Data Science Aims and scope Submit manuscript

Outbreak Prediction of COVID-19 for Dense and Populated Countries Using Machine Learning

Download PDF

Aman Khakharia¹,
Vruddhi Shah¹,
Sankalp Jain¹,
Jash Shah¹,
Amanshu Tiwari¹,
Prathamesh Daphal¹,
Mahesh Warang¹ &
…
Ninad Mehendale¹

8886 Accesses
41 Citations
Explore all metrics

Abstract

The Coronavirus Disease-2019 (COVID-19) pandemic persists to have a mortifying impact on the health and well-being of the global population. A continued rise in the number of patients testing positive for COVID-19 has created a lot of stress on governing bodies across the globe and they are finding it difficult to tackle the situation. We have developed an outbreak prediction system for COVID-19 for the top 10 highly and densely populated countries. The proposed prediction models forecast the count of new cases likely to arise for successive 5 days using 9 different machine learning algorithms. A set of models for predicting the rise in new cases, having an average accuracy of 87.9% ± 3.9% was developed for 10 high population and high density countries. The highest accuracy of 99.93% was achieved for Ethiopia using Auto-Regressive Moving Average (ARMA) averaged over the next 5 days. The proposed prediction models used by us can help stakeholders to be prepared in advance for any sudden rise in outbreak to ensure optimal management of available resources.

A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19

Article 06 July 2020

Youssoufa Mohamadou, Aminou Halidou & Pascalin Tiam Kapen

Coronavirus disease (COVID-19) cases analysis using machine-learning applications

Article 21 May 2021

Ameer Sardar Kwekha-Rashid, Heamn N. Abduljabbar & Bilal Alhayani

Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset

Article 27 November 2020

L. J. Muhammad, Ebrahem A. Algehyne, … I. A. Mohammed

1 Introduction

The SARS-CoV-2 coronavirus disease (COVID-19) originated in Wuhan, China sometime during December 2019. Within a month, more than ten thousand people were infected and hundreds died [1]. The initial outbreak caused several deaths, as the medical systems were not capable of handling many seriously ill patients. Till July 23, 2020 there were 631,680 deaths [2] reported across the world due to this pandemic. In a rapidly evolving pandemic, improper analysis and predictions of the number of patients results in an inefficient distribution of medical resources. Limited medical facilities and mismanagement of resource allocation can lead to additional severe cases and a decline in recovery rates. To cope with this situation, predicting the new cases which will arise in the future is very important. This can ensure optimal allocation of medical resources in the affected regions.

Data science in the predictive domain is an emerging field. In this study we have incorporated the principles of data science [3] for the prediction of COVID-19 progression. The outbreak of COVID-19 is a significant challenge for any government, with regard to the capacity and management of public health systems to face the catastrophic emergency [4]. The prediction model can help hospitals and healthcare management to properly allocate resources, thereby reducing the pressure and allowing the situation to be handled with relative ease.

We developed and tested 9 different predictive algorithms for 10 countries. It was noticed that the pattern of growth in the number of cases varied from country to country. The basic approach for the predictions was to train the models based on the dataset provided, but these models were not sufficiently accurate, as they were trained on only one class of dataset. As a result, the models were unable to accurately predict the number of new cases and, consequently, the existing techniques failed to utilize the resources in an optimized way [5]. Insufficient training data is also one of the reasons for the models to have low accuracy.

We tried 9 different standard machine learning (ML) algorithms for predicting the number of patients for the next 5 days. After getting a decent accuracy of 85%, we implemented these algorithms on datasets of different countries. We selected 10 countries with the highest population and the highest density for our work. By using the data of these countries, we trained standard prediction models using multiple ML algorithms and obtained different accuracy for each of the models for different countries.The different models gave high accuracy for different countries. However, there were variations in accuracy because of the different trends of change in COVID-19 patients for different countries.

The system flow diagram is shown in Fig. 1. The 9 different machine learning (ML) algorithms used were Auto-Regressive Moving Average (ARMA), Auto-Regressive Integrated Moving Average (ARIMA), Support Vector Regressor (SVR), Linear Regressor polynomial (LRP), Bayesian Ridge Regression (BRR), Linear Regression (LR), Random Forest Regressor (RFR), Holt-Winter Exponential Smoothing (HW), and Extreme Gradient Boost Regressor (XGB).

2 Literature Review

Multiple research works have been carried out to predict the outbreak of COVID-19. Vomlel et al. worked on the dataset of patients from STEMI and different classifiers used for predictions were, namely, Logistic Regression, LogitBoost, Decision Tree, NBC, Neural Networks, and the two versions of Bayesian Network Classifiers [6]. Kumar et al. [7] used the ARIMA model for predicting the outbreak in the top 15 European countries. Tuli et al. [8] proposed an ML model that can run continuously on Cloud Data Centers (CDCs) for precise prediction of spread and proactive development of strategic response by the government and citizens. Robust Weibull models fitted well on their dataset rather than baseline Gaussian models. Petropoulos et al. [9] introduced an objective approach to predict continuation of COVID-19 by live forecasting. They produce ten-days-ahead point forecasts and prediction intervals. A susceptible-exposed-infectious-recovered (SEIR) metapopulation model was used to predict the spread across all major cities in China, with 95% credible intervals [10]. Yang et al. [11] used the modified SEIR model to derive the epidemic curve. They used an artificial intelligence (AI) approach, trained on the 2003 SARS data, to predict the epidemic. Bhatnagar et al. [12] created a mathematical model for predicting the spread of COVID-19 in countries using various types of parameters and tested their model on real data of countries.

A segmented Poisson model was incorporated by the power law and the exponential law as proposed by Zhang et al. [13] to study the COVID-19 outbreaks in six major western countries. Maier et al. [14] have introduced a parsimonious model that captures the infected individuals and also population-wide isolation practices in response to containment policies. Li et al. [15] studied the transmission process of COVID-19. It used forward prediction and backward inference of the epidemic situation, and the relevant analysis helped relevant countries to make more appropriate decisions. Tomar et al. [16] have used data-driven estimation methods like, long short-term memory (LSTM) and curve fitting for prediction for the monthly number of COVID-19 cases in India and also the effect of preventive measures like, social isolation and lockdown on the spread of COVID-19. Kumar et al. [17] have applied cluster analysis, to classify real groups of infectious disease of COVID-19 on a data set of different states and union territories in India, based on their high similarity to each other.

Wu et al. [10] forecasted the prediction for only the major cities of China, whereas Zhang et al. [13] predicted for six major western countries. On the other hand, the proposed methods forecast the count for 10 highly and densely populated countries. SEIR, Poisson, ARIMA, and exponential smoothing model were reported for COVID-19 count prediction. However, we have incorporated 9 different ML algorithms for the prediction and also trained our models with the data of over 100 days, which was 3 times more than reported in the literature.

3 Methodology

We considered the top 10 countries with high population and high density for our outbreak prediction system. Since COVID-19 spreads majorly through human contact, it was imperative to consider only those countries with high density, as well as, high population. The dataset of the countries, namely, Bangladesh, India, China, Pakistan, Germany, Nigeria, Ethiopia, Democratic Republic of Congo, the Philippines, and Indonesia have been used. Initially, we identified a list of 20 most populated countries (Supplementary material S.10). Further, we obtained a list of countries with the highest population density (Supplementary material S.11). From these two lists, we identified the top 10 countries having the highest density as well as high population count (Table 1). We used 9 different machine learning algorithms for predicting the number of patients for the above-specified countries.

Table 1 Top countries on the basis of population and density

Full size table

The train data to test data partition was 94% and 6%, respectively. The algorithms predicted the rise in the number of cases in the next 5 days (Figs. 2, 3) for the countries specified in Table 1. The testing run-time for these algorithms varied between 2 and 5 s. The algorithms were tuned by an iterative approach between the normalized value of zero and one. For the tuning of individual parameters, partial and full auto-correlation was used.

3.1 Auto-regressive Moving Average (ARMA)

The ARMA model is the merger between Auto regressive (AR) and Moving average (MA) models, namely: the AR model, which tries to explain the momentum and mean reversing effects often observed in trading markets and the MA model, which tries to capture the shock effects observed in thermal noise. These shock effects could be thought of as unexpected events affecting the observation. So first we loaded the dataset, and then divided it into a test set and a train set. We trained the model based on the train set and test set comprised of values for which we had to make predictions. Then we made an ARMA model that was trained on the training data. In ARMA, the values of p and q were put inside the order of the model. These values changed depending on what the model fitted the best. Values of p and q are normally taken up to 6. The values of p and q varied for different training datasets depending on the best fit.

For a given series $a_{0},a_{1},\ldots a_{t}$, to implement the ARMA model we have to find the difference between data at different timestamps and make a new series altogether. This difference that we take form the d parameter of the model. Let us represent the new time series as $z_{0}, z_{1},\ldots , z_{t}$. The newly formed time series is stationary and represented as $z_{t} = a_{(t+1)} - a_{t}$. Usually the value of d is taken as 0 or 1.The last value of the z series will be given by:

$$\begin{aligned} z_{t}=\sum _{i=0}^{p}(\phi _{i}z_{t-i})+\sum _{j=0}^{q}(\theta _{j}\varepsilon _{t-j})+\varepsilon _{t} \end{aligned}$$

(1)

Now, if we want to predict the value at k th position, in future ie $k>t$, we have to get the answer in the original series means, we need a value of $a_{k}$ so we have to thus convert the 'z' series into 'a' series and it will be done as:

$$\begin{aligned} a_{k}=z_{k-1}+a_{k-1}=z_{k-2}+a_{k-2}=\sum _{i=1}^{k-l}z_{k-i}+a_{l} \end{aligned}$$

(2)

3.2 Auto Regressive Integrated Moving Average (ARIMA)

ARIMA is a predictive model that predicts future time series based on its past values. An ARIMA model is characterized by 3 terms: p, d, and q, where p is the order of the auto regressive term, q is the order of the moving average term and d is the number of differences required. So first we loaded a dataset, then divided into a test set and a train set. We trained the model based on the train set and test set comprised of the values for which we had to make predictions. Then we made an ARIMA model that was trained on the training data. The values of p, q, and d were put inside the order of the model. These values changed depending on what the model fitted the best. Values of p and q are normally taken up to 6 and d varied between 0 and 1. The values of p, d, and q varied for different training datasets depending on the best fit.

If there is given a time series $l_{0},l_{1},\ldots ,l_{t}$ and we want to predict the last term that is $l_{t}$ , then let the predicted last term be represented as ${\hat{l}}_{t}$. The actual last term will be given by:

$$\begin{aligned} l_{t}=\beta _{0}+\sum _{i=0}^{p}(\beta _{i}l_{t-i})+\sum _{j=0}^{q}(\phi _{j}\varepsilon _{t-j})+\varepsilon _{t} \end{aligned}$$

(3)

where $\sum _{i=0}^{p}(\beta _{i}l_{t-i})$ is Auto Regressive term, $\sum _{j=0}^{q}(\phi _{j}\varepsilon _{t-j})$ is Moving Average term and $\varepsilon _{t}$ is Error lag. Now for predicting ${\hat{l}}_{t}$,

$$\begin{aligned} {\hat{l}}_{t}=\beta _{0}+\sum _{i=0}^{p}(\beta _{i}l_{t-i})+\sum _{j=0}^{q}(\phi _{j}\varepsilon _{t-j}) \end{aligned}$$

(4)

Only the Error lag term is not present. The values of p and q are determined by ACF and PACF, where ACF stands for Auto Correlation Function and PACF stands for Partial Auto-Correlation function.

3.3 Linear Regression (LR)

It is a statistical approach for modeling the relationship between a dependent variable and a given set of independent variables. All the values in the dataset were plotted. After plotting the points, we created the best-fit line. A best-fit line is the one that minimizes the error, i.e., it should have a minimum difference between the actual and predicted values. We found the slope of the line and also its y-intercept. After getting the equation of the line, we were able to predict the new values, which is the number of patients in an individual country. The expression for representing a line is given as y = mx + c, where ‘m’ is the slope. The formula for calculating the slope is

$$\begin{aligned} m=\frac{\sum (x - {\bar{x}})(y - {\bar{y}})}{\sum (x - {\bar{x}})^{2}} \end{aligned}$$

(5)

where ${\bar{x}}$ and ${\bar{y}}$ are the mean values.

3.4 Linear Regressor Polynomial (LRP)

We used the polynomial feature function provided by sci-kit learn library of machine learning, where we can increase the power of the input variable and then fit and transform it on any desired model. Firstly, we imported the necessary libraries. We then imported the polynomial and linear regression functions from sci-kit learn. We instantiated a polynomial feature function with degree = 5 as a parameter. Then we fitted and transformed the input variable as well as the list of days for which we wanted to make the predictions. After that, we instantiated the linear regression model with parameters normalize = True, and fitintercept = False and then fitted the model using a new list made by applying polynomial features. Now we can use our model for predictions of COVID-19 cases on any particular day by using the list made by applying polynomial features to the list of days for which a prediction of COVID-19 cases is desired. Polynomial regression is a model based on a mixture of dependent and independent variables represented by m and y, respectively, and $F_p$ is the polynomial function that tries to add variables of any power we need, which gives us the best results with the dataset taken.

$$\begin{aligned}&y=\alpha _{0}+\alpha _{1}x^{2}+\alpha _{3}x^3+\cdots \alpha _{n}x^n \end{aligned}$$

(6)

$$\begin{aligned}&y=\alpha _{0}+ \sum _{i=1}^m \alpha _{i}x_{i}+F_{p} \end{aligned}$$

(7)

where m = number of independent variables, y = dependent variable, $F_{p}$ = polynomial function (it tries to add variables of any power we need).

3.5 Bayesian Ridge Polynomial Regressor (BRR)

Bayesian linear regression is an type of linear regression. We have used a polynomial version of Bayesian Ridge Regressor to utilize the important relationship between the input variables and target variables, which can further be used for prediction of COVID-19 cases on any random day. We used a randomized search that needs dictionaries of parameters having a different range of values as list of values and names of parameters as keys that we need to experiment with in our model and see which set of parameters gives the best results.

The model was defined with the parameters $\omega$, $\alpha$, and $\lambda$ during the fitting. The regularization parameters and $\lambda$ being estimated by maximizing the log marginal likelihood. The initial value of the maximization procedure can be set with the hyperparameters $\alpha$_init and $\lambda$_init. There are four more hyperparameters $\alpha 1$, $\alpha 2$, $\lambda 1$, $\lambda 2$, of the gamma prior distributions over $\alpha$ and $\lambda$. These are usually chosen to be non-informative.

Bayesian Ridge estimates a probabilistic model of the regression problem as described above. The prior for the coefficient $\omega$ is given by a spherical Gaussian:

$$\begin{aligned} p(\omega |\lambda )=N(\omega |0,\lambda ^{-1}I_{p}) \end{aligned}$$

(8)

The priors over $\alpha$ and $\lambda$ are chosen to be gamma distributions, the conjugate prior for the precision of the Gaussian. The resulting model is called Bayesian Ridge Regression.

3.6 Support Vector Regressor (SVR)

SVR is a powerful algorithm that allows us to choose how tolerant we are of errors, both through an acceptable error margin($\epsilon$) and through tuning our tolerance of falling outside that acceptable error rate. Our original training dataset for every country was stated in a finite-dimensional state and so the sets to discriminate were not linearly separable in that space. To resolve this problem, our original finite-dimensional state was mapped into a higher-dimensional space. By doing this we could find the prediction of different countries in a non-linear approach. The model is defined as a comprehensive evaluation of the gram matrix along with the predictors x(i) and x(j). The gram matrix is a n x n dimensional matrix that contains the elements g(i, j). The process comprises obtaining a non-linear SVM regression model by replacing the dot product of the predictors with a nonlinear kernel function comprising G(x1, x2) as $\phi (x1)$ and $\phi (x2)$ , where $\phi (x1)$ comes out to be greater than G(x1, x2) and $\phi (x2)$ comes out to be less than the function modeled.

Some regression problems cannot be described using a linear model, we need nonlinear models. To obtain a nonlinear SVM regression model by replacing the dot product $x_1^{\prime }x_2$ with a nonlinear kernel function $G(x_{1},x_{2})=<\varphi (x_{1}),\varphi (x_{2})>$, where $\varphi (x)$ is a transformation that maps x to a high-dimensional space. Statistics and Machine Learning Toolbox provides the following built-in semi-definite kernel functions.

The Kernel function of linear dot product is as shown:

$$\begin{aligned} G(x_{j},x_{k})=x_{j}^{\prime }x_{k} \end{aligned}$$

(9)

The Kernel function of Gaussian is:

$$\begin{aligned} G(x_{j},x_{k})=exp(-\vert (x_{i}-x_{k})^2\vert ) \end{aligned}$$

(10)

The Kernel function of Polynomial is:

$$\begin{aligned} G(x_{j},x_{k})=(1+x_{j}^{\prime }x_{k})^q, \end{aligned}$$

(11)

where q is in {2,3...}.

3.7 Random Forest Regressor (RFR)

We used random forest regressor (RFR), as it fits several classifying decision trees. The sub-sample size was controlled with the maxsamples parameter. We loaded the specific model into our training environment and initiated all the parameters to random values. We got the same result every time we ran the model on the given dataset. Then we fitted this model on the dataset so that we could easily predict the number of COVID-19 cases on any day using our trained model.

The Random forest regressor model comprises parameters such as the number of trees, the number of features represented by B and M, respectively. Here the values of B and M are less than or equal to the dimensional value d. T(i) represents the tree at index i. The tree(i) is constructed in such a way that at each node a random value from a subset of features is chosen considering splits on those features only.

$$\begin{aligned} D=((x_{1},y_{1} ),\ldots \ldots \ldots (x_{n},y_{n} )) \end{aligned}$$

(12)

where D = observed data point.

The parameters are, B = Number of trees, M = Number of features, $x_{i}$ is d-dimensional vector, B,M $\le$ d, $T_{i} = tree T_{i}$.

3.8 XGBoost Regressor (XGB)

XGBoost stands for “Extreme Gradient Boosting” and it is an implementation of gradient boosting trees algorithm. Firstly, we imported the necessary libraries and instantiated XGBoost with nestimators=1000 and fit the model. With this, we predicted the number of COVID-19 cases on any day we wanted. In these ways, we used this model to get the number of COVID-19 cases on any particular day using a dataset of actual COVID-19 cases used in training the model.

The model is defined as a comprehensive mix of training losses and regularization measures along with squared loss function summed up in an interval varying from 1 to n. The purpose of optimizing training loss is because of its assistance in predictive models, while regularization enhances the generalization of simpler models. Additive boosting is with y and $\lambda$ as hyperparameters. Approximation techniques such as Taylor approximation have been used in generating the model.

$$\begin{aligned} Obj=\sum _{i=1}^n\ell (y_{i},\widehat{y_{i}})+\sum _{k=1}^K\Omega (f_{k}) \end{aligned}$$

(13)

where

$$\begin{aligned}&\sum _{i=1}^n\ell (y_{i},\widehat{y_{i}}) = Training \,Loss \end{aligned}$$

(14)

$$\begin{aligned}&\sum _{k=1}^K\Omega (f_{k}) = Regularization \end{aligned}$$

(15)

$$\begin{aligned}&L(\theta )=\sum {(\widehat{y_{i}}-y_{i})}^{2} \end{aligned}$$

(16)

where L($\theta$) is loss function and $\sum {(\widehat{y_{i}}-y_{i})}^{2}$ is the squared loss.

3.9 Holt-Winters Exponential Smoothing (HW)

In Holt-Winters Exponential Smoothing model we considered the seasonality to be additive. The forecasted value for each data element is the sum of the baseline, trend, and seasonality components. We use c to denote the frequency of the seasonality. The value of periods (1/frquecy) also depends on best fit and training. So we loaded a dataset, and then divided it into a test set and a train set. We trained the model based on the train set and test set comprised of values for which we had to make predictions. Then we did exponential smoothing on the training dataset with seasonality as additive. This model consists of periods over which we want exponential smoothing to take place. The value of periods varies depending on the best fit and by analyzing the graph of training.

The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations, one for the level $l_{t}$, one for the trend $b_{t}$, and one for the seasonal component $s_{t}$, with corresponding smoothing parameters $\alpha ,\beta$ and $\gamma$. Within each period, the seasonal component will add up to approximately zero, i.e., $\sum S_{t}=0$ for a particular period. Mathematically, Holt Winters Additive Model is represented as:

Forecast = Estimated level + Trend + Seasonality at most recent time point

Series equation is represented as:

$$\begin{aligned} {\hat{Y}}_{t+h|t}=l_{t}+hb_{t}+S_{t+h-c(k + 1)} \end{aligned}$$

(17)

The series has Level $(l_{t})$,Trend $(b_{t})+Seasonality(s_{t})$ with c seasons.

Level equation is represented as:

$$\begin{aligned} l_{t}= \alpha (y_{t} - S_{t-c}) + (1-\alpha )(l_{t-1} + b_{t-1}) \end{aligned}$$

(18)

The level equation shows a weighted average between the seasonally adjusted observation $(y_{t} - S_{t-c})$ and the non-seasonal forecast $(l_{t-1} + b_{t-1})$ for time t.

Trend equation is represented as:

$$\begin{aligned} b_{t}=\beta (l_{t}-l_{t-1})+(1-\beta )b_{t-1} \end{aligned}$$

(19)

The trend equation is identical to Holt’s linear method.

Seasonality equation is represented as:

$$\begin{aligned} s_{t}=\gamma (y_{t}-l_{t-1}-b_{t-1})+(1-\gamma )s_{t-c} \end{aligned}$$

(20)

The seasonal equation shows a weighted average between the current seasonal index, $(y_{t}-l_{t-1}-b_{t-1})$, and the seasonal index of the same season c time periods ago. The values of $\alpha$, $\beta$ and $\gamma$ usually range between 0 and 1.

3.10 Hardware, Dataset and Software Used

The models were trained on Windows 10 operating system with an 8th generation Intel i5 processor and 8 GB of RAM. The dataset was obtained from ourworldindata.org [18]. All the models were trained on Google Colaboratory, as well as Spyder using Python version 3.6.7 along with the assistance of libraries such as Numpy version 1.15, Matplotlib version 3.3.1, Pandas version 1.1.0, Scikit-learn version 0.23.1, XGBoost version 1.1.1, and Statsmodels version 0.10.2.

4 Results

As shown in Table 2, we used 9 different machine learning algorithms to predict the number of patients in 10 highly dense and populated countries. Among all the models for the various countries (Figs. 2, 3), we achieved the highest accuracy of 99.93% for Ethiopia (Fig. 3b and 4c) by using the ARMA model. ARIMA gave an accuracy of more than 85% most of the time for almost all countries. Almost all the models gave an accuracy of more than 80% at least for one of the 10 countries, except in the case of the Philippines (Fig. 3c).

Table 2 Accuracy values for all 10 countries using 9 ML algorithms

Full size table

We found different countries to have a different trend of increase or decrease in COVID-19 patients. Not every ML algorithm could give a very high accuracy for predicting the rise or fall in the cases for each country. Our results showed that for Bangladesh (Figs. 2d and 4b), the LRP model showed the highest accuracy of 86.45%. For India (Figs. 2e and 4a), we got an accuracy of 99.26% using the ARMA model. China (Fig. 3d and 5d ) had a prediction value of 82% using the XGB model. For Pakistan (Figs. 2c and 5b), the accuracy was 87.91% using the BRR model. For Nigeria (Figs. 2b and 4e), the accuracy was 98.06% using the ARMA model. Democratic Republic of Congo (Figs. 3e and 4d) showed the highest accuracy of 91.96% by using the LRP model. Indonesia (Figs. 2a and 5a) demonstrated the highest accuracy of 97.72% using the ARIMA model. For Germany (Figs. 3a and 4f), ARIMA gave an accuracy of 85.39%. Using the SVR model, we got a prediction accuracy of 50.54% for the Philippines (Figs. 3c and 5c).

Figure 4 shows bar graphs for different error percentages and their corresponding errors for the next 5-day predictions. In the case of India (Fig. 4a), Ethiopia (Fig. 4c), and Nigeria (Fig. 4e), the ARMA model gave the highest accuracy for the prediction as compared to the other models. For Bangladesh (Fig. 4b) and Democratic Republic of Congo (Fig. 4d), the LRP model proved to be effective, although the accuracy in the case of Bangladesh was low. In the case of Germany (Fig. 4f), the ARIMA model gave the least percentage error, while models like BRR and LR gave errors of more than 50%. For Indonesia (Fig. 5a), the ARIMA model gave the least percentage error, while the ARMA model was highly inaccurate, with a very high error percentage. In the case of Pakistan (Fig. 5b), the BRR model yielded better accuracy, whereas for the Philippines (Fig. 5c), none of the models made accurate predictions. The percentage error of all models was more than 40%. The XGB model proved best for prediction in the case of China (Fig. 5d). A range finder code was written that helps to improve accuracy. This code works on a range of predicted numbers from all 9 algorithms, rather than actual predictions by the individual algorithms. This combined approach helped us to improve accuracy by up to 8%.

5 Discussions

It was not possible to get the results using all the 9 algorithms for each country as there were no specific trends observed. For the Philippines, we got a very low accuracy because a sudden drop of around 1400 cases to 0 was seen and in the following day, the count increased by 4500. Also because of the change in the government rules, the COVID-19 count of the country changed drastically. Due to this change, the proposed models were not able to make predictions with high accuracy. According to our dataset for China, a particular day had approximately 2000 patients and after that day the rise observed in the number of cases was approximately 13,000. The number of total patients was observed to be around 15,000. All of a sudden, the number dropped by 11,000 the next day. The declining phase started after the drop for about the next 100 days. Due to the peak value, our training dataset had to be changed. We considered only those values after the peak where a declining trend could be seen. To date, China shows a decline in the curve, and hence we considered only the decrement values. The slope for China is decreasing, and hence the values after the peak were considered. Also, after considering this, 2 out of the 9 models failed to show good accuracy for China. The raw data received from countries like China and Philippines were not correct, because the government policies changed on February 17, 2020 and July 6, 2020, respectively.

5.1 Comparison with Other Methods

As shown in Table 3, we have compared our methodology with the other methodologies reported in the literature. Most of the literature has used the ARIMA model for the outbreak prediction of COVID-19. We have used 9 different ML models for the prediction of COVID-19 on the top 10 densely and highly populated countries. We achieved the highest accuracy of 99.93%, which was high as compared to the other methodologies reported. The highest accuracy was achieved by the ARMA model for Ethiopia. Poonia et al. [20] achieved the highest accuracy of 95% for India using the ARIMA model forecasting. The ARIMA model for India achieved an accuracy of 90.55%, which was high as compared to an accuracy of 70% obtained by Gupta et al. [21] using the ARIMA model and Exponential smoothing. In comparison, the SEIR model implemented by Wu et al. [10] gave an accuracy of 95% for prediction in Wuhan.

Table 3 Comparison of methodologies reported in existing literature

Full size table

5.2 Future Scope

Although the overall accuracy achieved was very good, we are still trying to implement prediction models using different algorithms that could give us higher accuracy. We are also planning to get a single standard model that can be used for any country, which may be a combination of different algorithms. We are planning to develop such ML algorithms that could give us an approximate duration of COVID-19 as a pandemic.

6 Conclusions

The study presented here outlines several technique of predicting new cases that would arise in a few days in the near future in any region during an expanding pandemic, so that there is the proper allocation of resources in those regions for higher recovery rates. The ARMA model gave the highest accuracy for the prediction of COVID-19 cases for Ethiopia. From the results obtained on all the models for all countries, it was found that ARMA proved to be the best model for India and Nigeria. ARIMA was best for Indonesia and Germany. LRP for Bangladesh and Democratic Republic of Congo. And BRR, XGB, SVR proved best for Pakistan, China and the Philippines, respectively. We got an accuracy of more than 80% for all the countries except the Philippines by any one of the 9 ML algorithms. The overall best model for the prediction was ARIMA. Generating high-accuracy prediction that could help in an optimized use of available resources along with pacing up the recovery graphs has been the main aim behind this exercise. These regions could potentially benefit from knowing the number of resources that they would need based on the predictions of the model. This model could help in lowering the cost of dealing with the pandemic and improve the recovery process in regions where it is deployed.

Data availability statement

All the data and codes used in this study, as well as, the supplementary material can be made available from the corresponding author, upon reasonable request.

Code availability

All the codes used in this study, as well as, the supplementary material can be made available from the corresponding author, upon reasonable request.

References

Liu Y, Gu Z, Xia S, Shi B, Zhou XN, Shi Y, Liu J (2020) What are the underlying transmission patterns of COVID-19 outbreak?—an age-specific social contact characterization. EClinicalMedicine 22:100354
Article Google Scholar
Temesgen A, Gurmesa A, Getchew Y (2018) Joint modeling of longitudinal cd4 count and time-to-death of hiv/tb co-infected patients: a case of Jimma university specialized hospital. Ann Data Sci 5(4):659
Article Google Scholar
Olson DL, Shi Y, Shi Y (2007) Introduction to business data mining, vol 10. McGraw-Hill/Irwin, Englewood Cliffs
Google Scholar
Li J, Guo K, Viedma EH, Lee H, Liu J, Zhong N, Gomes LFAM, Filip FG, Fang SC, Özdemir MS et al (2020) Culture vs policy: more global collaboration to effectively combat COVID-19. Innovation 1(2):100023
Article Google Scholar
Shi Y, Tian Y, Kou G, Peng Y, Li J (2011) Optimization based data mining: theory and applications. Springer, Berlin
Book Google Scholar
Vomlel J, Kruzık H, Tuma P, Precek J, Hutyra M (2012) Machine learning methods for mortality prediction in patients with st elevation myocardial infarction. Proc WUPES 17(1):204
Google Scholar
Kumar P, Kalita H, Patairiya S, Sharma YD, Nanda C, Rani M, Rahmani J, Bhagavathula AS (2020) Forecasting the dynamics of COVID-19 pandemic in top 15 countries in April 2020: Arima model with machine learning approach. medRxiv
Tuli S, Tuli S, Tuli R, Gill SS (2020) Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing. Internet Things 11:100222
Article Google Scholar
Petropoulos F, Makridakis S (2020) Forecasting the novel coronavirus COVID-19. PLoS ONE 15(3):e0231236
Article Google Scholar
Wu JT, Leung K, Leung GM (2020) Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. Lancet 395(10225):689
Article Google Scholar
Yang Z, Zeng Z, Wang K, Wong SS, Liang W, Zanin M, Liu P, Cao X, Gao Z, Mai Z et al (2020) Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions. J Thorac Dis 12(3):165
Article Google Scholar
Bhatnagar MR (2020) Covid-19: mathematical modeling and predictions, submitted to ARXIV. http://web.iitd.ac.in/~manav/COVID.pdf
Zhang X, Ma R, Wang L (2020) Predicting turning point, duration and attack rate of COVID-19 outbreaks in major western countries. Chaos Solitons Fractals 135:109829
Article Google Scholar
Maier BF, Brockmann D (2020) Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China. Science 368(6492):742
Article Google Scholar
Li L, Yang Z, Dang Z, Meng C, Huang J, Meng H, Wang D, Chen G, Zhang J, Peng H et al (2020) Propagation analysis and prediction of the COVID-19. Infect Dis Model 5:282
Google Scholar
Tomar A, Gupta N (2020) Prediction for the spread of COVID-19 in India and effectiveness of preventive measures. Sci Total Environ 728:138762
Article Google Scholar
Kumar S (2020) Monitoring novel corona virus (COVID-19) infections in India by cluster analysis. Ann Data Sci 7:1
Article Google Scholar
Max Roser EOO, Ritchie Hannah, Hasell J (2020) Coronavirus pandemic (COVID-19), our world in data. https://ourworldindata.org/coronavirus
Chintalapudi N, Battineni G, Amenta F (2020) COVID-19 disease outbreak forecasting of registered and recovered cases after sixty day lockdown in Italy: a data driven model approach. J Microbiol Immunol Infect 52(3):396–403. https://doi.org/10.1016/j.jmii.2020.04.004
Article Google Scholar
Poonia N, Azad S (2020) Short-term forecasts of COVID-19 spread across Indian states until 1 may 2020. arXiv preprint arXiv:2004.13538
Gupta R, Pal SK (2020) Trend analysis and forecasting of COVID-19 outbreak in India. medRxiv 2020.03.26.20044511. https://doi.org/10.1101/2020.03.26.20044511

Download references

Funding

No funding was involved in the present work.

Author information

Authors and Affiliations

K. J. Somaiya College of Engineering, Vidyavihar, Mumbai, 400077, India
Aman Khakharia, Vruddhi Shah, Sankalp Jain, Jash Shah, Amanshu Tiwari, Prathamesh Daphal, Mahesh Warang & Ninad Mehendale

Authors

Aman Khakharia
View author publications
You can also search for this author in PubMed Google Scholar
Vruddhi Shah
View author publications
You can also search for this author in PubMed Google Scholar
Sankalp Jain
View author publications
You can also search for this author in PubMed Google Scholar
Jash Shah
View author publications
You can also search for this author in PubMed Google Scholar
Amanshu Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
Prathamesh Daphal
View author publications
You can also search for this author in PubMed Google Scholar
Mahesh Warang
View author publications
You can also search for this author in PubMed Google Scholar
Ninad Mehendale
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ninad Mehendale.

Ethics declarations

Involvement of human participant and animals

This article does not contain any studies with animals or humans performed by any of the authors. All the necessary permissions were obtained from the Institute’s Ethical Committee and concerned authorities.

Information about informed consent

No informed consent was required as the studies does not involve any human participant.

Author contributions

Conceptualization was done by Ninad Mehendale (NM) and Aman Khakharia (AK). All the experiments/code executions were performed by, AK, Jash Shah (JS), and Sankalp Jain (SJ). The formal analysis was performed by AK, Vruddhi Shah (VS), Mahesh Warang (MW), and NM. Manuscript writing- original draft preparation, AK, VS, JS, SJ, and Amanshu Tiwari (AT). Review & editing, AK, VS, Prathamesh Daphal (PD), and NM. Visualization work carried out by PD, VS, NM

Ethic statements

All authors consciously assure that for the manuscript fulfills the following statements: 1) This material is the authors’ own original work, which has not been previously published elsewhere. 2) The paper is not currently being considered for publication elsewhere. 3) The paper reflects the authors’ own research and analysis in a truthful and complete manner. 4) The paper properly credits the meaningful contributions of co-authors and co-researchers. 5) The results are appropriately placed in the context of prior and existing research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khakharia, A., Shah, V., Jain, S. et al. Outbreak Prediction of COVID-19 for Dense and Populated Countries Using Machine Learning. Ann. Data. Sci. 8, 1–19 (2021). https://doi.org/10.1007/s40745-020-00314-9

Download citation

Received: 23 July 2020
Revised: 19 September 2020
Accepted: 06 October 2020
Published: 16 October 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s40745-020-00314-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Outbreak Prediction of COVID-19 for Dense and Populated Countries Using Machine Learning

Abstract

Similar content being viewed by others

A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19

Coronavirus disease (COVID-19) cases analysis using machine-learning applications

Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset

1 Introduction

2 Literature Review

3 Methodology

3.1 Auto-regressive Moving Average (ARMA)

3.2 Auto Regressive Integrated Moving Average (ARIMA)

3.3 Linear Regression (LR)

3.4 Linear Regressor Polynomial (LRP)

3.5 Bayesian Ridge Polynomial Regressor (BRR)

3.6 Support Vector Regressor (SVR)

3.7 Random Forest Regressor (RFR)

3.8 XGBoost Regressor (XGB)

3.9 Holt-Winters Exponential Smoothing (HW)

3.10 Hardware, Dataset and Software Used

4 Results

5 Discussions

5.1 Comparison with Other Methods

5.2 Future Scope

6 Conclusions

Data availability statement

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Involvement of human participant and animals

Information about informed consent

Author contributions

Ethic statements

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation