Machine learned hybrid Gaussian analysis of COVID-19 pandemic in India

This article discusses short term forecasting of the Novel Corona Virus (COVID −19) data for infected, recovered and active cases using the Machine learned hybrid Gaussian and ARIMA method for the spread in India. The Covid-19 data is obtained from the World meter and MOH (Ministry of Health, India). The data is analyzed for the period from January 30, 2020 (the first case reported) to October 15, 2020. Using ARIMA (2, 1, 0), we obtain the short forecast up to October 31, 2020. The several statistics parameters have tested for the goodness of fit to evaluate the forecasting methods but the results show that ARIMA (2, 1, 0) gives better forecast for the data system. It is observed that COVID 19 data follows quadratic behavior and in long run it spreads with high peak roughly estimated in September 18, 2020. Also, using nonlinear regression it is observed that the trend in long run follows the Gaussian mixture model. It is concluded that COVID 19 will follow secondary shock wave in the month of November 2020. In India we are approaching towards herd immunity. Also, it is observed that the impact of pandemic will be about 441 to 465 days and the pandemic will end in between April-May 2021. It is concluded that primary peak observed in September 2020 and the secondary shock wave to be around November 2020 with sharp peak. Thus, it is concluded that the people should follow precautionary measures and it is better to maintain social distancing with all safety measures as the pandemic situation is not in control due to non-availability of medicines.


Introduction
When the world was celebrating New Year's Eve in December 2019, China declared an outburst of a new virus in Wuhan city that is located in China's Hubei region and a residence for approximately 11 million people. The virus, was found to be linked with the family of singlestranded RNA viruses generally known as coronaviridae, which affects the species like reptiles, mammals and birds [1]. It was named Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-coV-2), if a living body gets infected by this virus and develops a disease then the disease is called COVID-19. It's transmission to humans have caused a panicking situation due to its virulent and pernicious effects on our bodies. A person having SARS-coV-2 virus may experience symptoms like: Cough; High body temperature; Shivering and fever; Breathlessness; Diarrhea; Headache and Body ache.
Further, it exhibits various astute qualities to sustain itself. Like travelling within us without getting detected (unless tested properly), infecting everything its host touches, reproducing itself liberally.
However, its origin has not been confirmed till now [2]. Initially, it was suspected to be originated in Wuhan's South China Seafood Wholesale Market and later on, new possible origins were supposed like some scientists claimed that the cross-species circulation may be from snakes to humans; however, this claim was disputed [3]. Coronaviruses (CV) are RNA viruses that are respiratory pathogens. Coronavirus transmission is defined as zoonotic i.e. between animals and people. They can cause benign seasonal illnesses like common cold to more severe public health emergencies like Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS). A novel strain of Coronavirus disease has been identified in 2019 which was not previously identified in humans [4]. The death rate from Covid-19 is considered very low for many age groups; but, the virus has turned out to be deadly for people above the age of 60 and due to its spread in over 200 countries, World Health Organization (WHO) on 11th February. 2020 declared it as a pandemic [5].
The COVID 19 fatality rate by age (as on February 11, 2020) was shown in Fig. 1 [6]. With an estimated reproduction number greater  than 1 (range 2.6-4.7), early reports predicted a potential Coronavirus outbreak [7]. This novel Coronavirus was named COVID-19 by WHO on Feb 11, 2020. In India, first case of COVID-19 was discovered in Kerala. The patient was identified as a female of Indian nationality; she was a student of Wuhan University and was isolated & declared stable by doctors later on. After this incident, a rapid response team was called for an emergency meeting; the meeting discussed ways of preventing the community transmission of this virus [8]. India reported almost no new cases of the virus until one month after this incident. During the month of March 2020 India experienced the true outbreak of COVID-19 and the number of new cases has continued to rise exponentially since then. In history, the world has gone through many pandemics. The Global Health Security Index-2019, which ranks 195 countries on health security as on March 20, 2020 [9] is shown in Fig. 2.
Corona viruses (Co-Vs) are majorly identified through respiratory in addition with the gastrointestinal-tract taints that can be hereditarily categorized as four key genera: Alpha coronavirus, Beta coronavirus, Gamma coronavirus, and Delta coronavirus. First two genera predominantly infected mammalia and last two primarily infect birds. Various kinds of human-destroying Co-Vs have been previously identified. These comprise of HCoV-NL63 and HCoV229E, that may belong towards Alpha coronavirus genus; and HCoV-OC43, HCoVHKU1, severe-acuterespiratory-syndrome-coronavirus (SARS-Co-V), and Middle-Eastrespiratory-syndrome-coronavirus (MERS-Co-V) belonging into Beta coronavirus genus. Coronaviruses were not globally known or recognized until the 2003-SARS pandemic followed by year-2012 MERS and, most recently, the 2019-nCoV outbreaks. SARS-CoV and MERS-CoV are considered highly virulent. Also, it is very probable that both SARS-CoV & MERS-CoV were communicated from bats to palm-civets & dromedary camels and further transferred ultimately to humans. Novel coronavirus (CoV)-''2019 novel coronavirus''/''2019-nCoV'' by the World Health Organization (WHO) is accountable for the recent pneumonia outbreak that started in early December 2019. The molecules that may perhaps enter into host cell and cause acute respiratory syndrome targeting towards coronavirus studied and forecasted impending COVID-19 spread cases for China plus some other regions using mathematical & traditional time-series prediction models [40]. Mathematical modelbased prediction at an early stage achieved for the outburst of this particular virus in China [41]. Extensive exploration of pneumonia outbreak via corona-genome originating from bat species [42].

Effects of the virus
Coronavirus can have harmful effects on our bodies and livelihood. Some of them are: 1. Forming Blood Clots in the Human Body Doctors and scientists across the globe are witnessing a surfeit of clotting-related disordersranging from innocuous skin bruise seen on the foot occasionally called as "Covid toe" to the life-threatening strokes & vain blockage. The issue is clear in clots that is thrombi, it forms in a patient's arterial catheters and filters used to support the failing of kidneys. More dangerous the blood clots, more impeding of the blood circulation in the lungs and breathlessness. 2. Causes Silent Hypoxia Researchers and doctors have revealed a medical condition known as "happy" or "silent" hypoxia, in this condition the patients have extremely low blood oxygen levels in their bodies and yet do not show any symptoms of breathlessness.
They are now supporting for its early detection as a means to avoid a deadly disorder called "Covid pneumonia", a malign condition found in patients that are severely affected by COVID-19. It is preceded by "silent hypoxia" that is a form of oxygen deprivation and is harder to elicit than regular hypoxia. In numerous cases the COVID-19 patients with silent hypoxia did not exhibit signs such as a lack of coughing or breaths until their oxygen levels fell to very low, at this point the risk of acute respiratory distress (ARDS) and organ failure is engendered.

Preventive measures
As the number of cases infected with Covid-19 disease reaches approximately 3 lakhs in India, it becomes more than ever important to prevent ourselves from the virus. Also, we can witness that many states are loosening the restrictions that were previously imposed to prevent the community spread. So, personal and individual hygiene is eminent [31]. Some of the measure that can ensure personal hygiene and safety are: 3. Avoid going in crowd places Going in crowed places highly increases the risk of getting infected with the COVID-19, it is because social distancing is not followed in these areas. 4. Always stay updated about the virus with trusted sources Staying updated from trusted sources will help in maintaining your welfare and safety.

Methodology
The different techniques and methodologies used for forecasting are given in flow chart [38] as Fig. 3.
Case Fatality Rate (CFR) = (total deaths/total confirmed cases)*100 (1) To control epidemiology the value of CFR should be minimum [22].
Cumulative active cases = Aggregate confirmed cases -Aggregate deaths--Aggregate recovered cases For control of pandemic the value of Cumulative active cases to be minimum with cumulative death cases as zero and cumulative recovered cases as maximum [18].

Nonlinear regression
Nonlinear regression is a type of regression examination wherein information is fit to a model and afterward communicated as a numerical capacity. Basic direct relapse relates two factors (X and Y) with a straight line (y = mx + b), while nonlinear regression relates the two factors in a nonlinear (curved) relationship. The objective of the model is to make the amount of the squares as little as could really be expected. The amount of squares is an action that tracks how far the Y perceptions differ from the nonlinear (curved) function that is utilized to foresee Y.

Box-Cox time transformation measure
The Box-Cox transformation is principally useful family of transformations [34,35].
Theorem:. Suppose a sample of n response values t1, t2, ⋯., t n .. Let δ be a value such that t + δ > 0. Compute the set of f(ti)s with respect to tis as: Natural log is applied in the case of ω = 0 instead of the aforementioned formula. It helps to define the measure of normality of resulting transformation. It is meant to moderate non-normal dependent variable into normal contour. The measure computes the correlation coefficient of normal probability. Correlation is simulated for the variables of probability plot and a scale of linearity of probability plot. Vertical axis encapsulates correlation coefficient of normal probability and horizontal axis stands for the values of ω. This stationarity test is applied towards the positive and negative values [36,37].

Autocorrelation & partial correlation function (ACF & PACF)
The statistics under consideration are said to have autocorrelation whenever the response variables, X i 's at time-domain, t i are determined to be correlated through the values, X i+d 's at time-domain,t i+d where, d  refers to the time increment that lies in the upcoming events [10,11,14,15]. It can be observed that within the long memory-process, auto-correlation deteriorates over time resulting in the power-law trend written as where, C -constant and p(k) -autocorrelation-function having lag k.
Generalizing, consider set of responses as: X 1 , X 2 , ........, X n at time: t 1 , t 2 , ........, t n along with k − lag auto-correlation function is represented via: where, X = X1+X2+.....+Xn n . The interpretations should be uniformly-sampled. Unlike crosscorrelation, ACF result in a correlation-coefficient signifying degree of resemblance of two response variables at time, t i andt i+k . ACF used to identify non-randomness in data and propagate appropriate timerepressiveness when data has no chaos. Whenever ACF is applied for locating apt time successive regression, there k − lag ACF gets charted.

Augmented Dickey-fuller stationarity test (ADF)
Basically, a unit root test to check stationarity as these unit roots can cause unpredictable results in the autoregressive models of time series analysis. Time series are different in comparison to the predictive modeling. As in modeling the assumptions exist that summary statistics of observations are consistent. In context to time series, these expectations are referred as time domain being stationary [19,20,33]. Timeseries is taken to be stationary when it does not contain trend or seasonal effects. So, these summary statistics computed on time domain is   [10,12,13]. In particular, it concludes how strongly a time series is defined by a trend.

Goodness of Fit, histogram and density function
Grading goodness-of-fit (GOF) for various-distributions one can get impressions for whichever distribution is satisfactory& whichever is not. From cumulative-distribution-function (CDF), derive histogram and the    probability-density-function (PDF) [24,25,26]. Theorem The measurement of discrepancies among observed and fitted values is regarded the deviation. For Poisson-responses, deviances take this form: First-term-identical towards binomial-deviance, demonstrating "twice a sum of observed times log of observed over fitted". Second term is the sum of differences between observed and fitted values, is usually zero [16,17,21,27].
Lemma For large-sections, distribution of deviation is nearly chisquared with n-p degrees of freedom, whenever n considers no. of observations; p for no. of parameters. Thus, deviance can be utilized directly towards testing goodness-of-fit of this model.

Coefficient of determination (R 2 )
R-squared, the statistical measure of the closeness of data to look for fitted-regression-line. It is also known referred as coefficient of determination, coefficient of multi-determination for multi-regressors. Description of R-squared is fairly simple; it is the percentage of retortvariable having variation which usually described through the regression. R-squared lies between 0 and 1 where 0 indicates that model describes none of variability of this response data around the mean. 1 determines that the prototype describes all of the variability around response data of the mean [28,29,30,32].

Results and discussions for model selection
For long term behavior, data sets of India from January 30, 2020 to October 15, 2020 is analyzed. The spread of COVID 19 in different states of India as on August 9, 2020 is shown in Fig. 4 [39]. Descriptive statistics for new cases; total cases; new deaths; total deaths; new recovered; total recovered; new active; total active and CFR is given in Table 1 with the details of correlation and coefficient of determination in Table-2. Parallel coordinate -plot for all cases is shown in Fig. 5. Table 2 Cross correlation for total recovered cases with respect to dates are shown in Fig. 6. The details of normality and white noise test for date; total recovered and date/ recovered cases are shown in Table 3 for different Statistics like Box-pieces for six degrees of freedom and 12 degrees of freedom; Liung Box for six degrees of freedom and 12 degrees of freedom; Mcleod Li for six degree of freedom and 12 degrees of freedom.
Mann-Kendall trend test of two tailed test for total recovered case is discussed at 95% confidence interval so value of alphas is 0.05. Test interpretation is considered as: H 0 : There is no trend in the series; Ha: There is a trend in the series. The value of Kendall's tau is 0.994 with S as 33,008 and Variance of S as 1939506.667 with p value as less than 0.0001. As the computed p-value is lower than the significance level alpha = 0.05, one should reject the null hypothesis H 0 , and accept the alternative hypothesis Ha. The risk to reject the null hypothesis H 0 while it is true is lower than 0.01%. Sen's slope is 12,275 with confidence interval as ]10767.131, 13831.833 [. The Mann-Kendall trend for total recovered cases is shown in Fig. 7 and total error is depicted in Fig. 8. For total recovered cases using Box Cox transformation, with lamda as zero as differencing as zero, polynomial regression is fitted with goodness of fit statistics R 2 as 0.691; adjusted R 2 as 0.69. In case of seasonal fitting goodness of fit statistics are R 2 as 0.002; adjusted R 2 as − 0.046. All trends are shown in Fig. 9 and Fig. 10. Fig. 11 shows the forecasted and trend analysis using ARIMA (2,1,0) model for total confirmed; total death; total recovered and total active cases with detailed values in Table 4.
For nonlinear regression of total recovered cases, degree of freedom is 232 with coefficient of determination as 0.992. SSE as 11642055841827.9; MSE as 51513521424.02; RMSE as 226965.904 with 200 iterations. Nonlinear regression with residuals and trends is shown in Fig. 12. According to the BIC criterion, the best mixture model is the Log-likelihood with 3 component(s). The EM algorithm did not converge. The maximum number of iterations should be increased for more than 50. Gaussian mixture model is used for the analysis of total cases. As NEC is greater than 1 so there is no clustering in the data. MAP classification and fitted distribution using Gaussian Mixture Modelling is Table 4 Forecasting ARIMA (2,1,0) model where A -Actual, P-Predicted, F-forecasted. shown in Fig. 13. From the figures and tables, it is observed that total confirmed, total death, total active and total recovered cases are highly correlated. For ARIMA (2,1,0) it is observed that the value of constant is zero for all cases. Total confirmed cases; total death cases; total recovered cases and total active cases are exactly fit to forecast using ARIMA (2,1,0) model but daily deaths initially show a perturbed or random pattern which is not perfectly fitted using ARIMA (2,1,0) model. But later it is showing similar patter as forecasted using ARIMA (2,1,0). It is observed that the actual and forecasted values using model ARIMA (2,1,0) from August 3, 2020 to August 11, 2020 are providing the better results. It is concluded that ARIMA (2,1,0) model gives the best fit for long term and short-term behavior. Nonlinear regression and Gaussian mixture model also show the same trend for total cases as forecasted using ARIMA (2,1,0). It is advised as number of cases are increasing so proper cautionary and health guidelines to be strictly adhered to fight with pandemic COVID 19 to remain healthy and safe.

Conclusion and future work
The forecasting of COVID 19 in order to prevalence as pandemic in India play an important role for the policy makers and health department to focus on the strengthening the surveillance system and reallocating the resources. It is observed that COVID 19 data follows quadratic  behavior and in long run it spreads with high peak roughly estimated in July 2020. Also, using nonlinear regression it is observed that the trend in long run follows the Gaussian mixture model. It is concluded that COVID 19 will follow secondary shock wave in the month from October 2020 end to mid November 2020. In India we are approaching towards herd immunity. Also, it is observed that the impact of pandemic will be about 441 to 465 days. Thus, it is concluded that the people should follow precautionary measures and it is better to maintain social distancing with all safety measures as the pandemic situation is not in control due to non-availability of medicines.
The time series model plays the important role in the prediction and controlling of the disease. The results of the study can help the policymakers to reallocate the resources like hospitals, staff and the facilities required for the critically infected peoples. The cases everyday increasing in the country and there is a need to pay more attention and utilization of the available resources. The analysis helps in the understanding the complex nature of spread of the disease. For further research, this method can be compared by the other models like Neural Networks and machine learning.