1 Introduction

In December 2019, the first case of COVID-19 has been identified at Wuhan City, China and it turns out to be a major pandemic during the first quarter of the year 2020. The novel corona virus happens to be a major problem in this decade because of its high impact on public health (Togacar et al. 2020). Across the globe, the number of reported cases is 106.61 Crores and the number of deaths is 23.16 Lakhs of which the mostly affected countries are US, India, Brazil, UK and Russia. In India, the number of cases reported during the first week of February 2021 is 1.08 Crores and the number of deceased being 1.55 lakhs (Velásquez and Lara 2020). The virus has its major impact in the elderly people and mostly with multiple health issues and the rate at which the virus spreads is in multiple folds. The virus posts a major challenge to the Government officials, health workers and researchers in controlling the spread and effect of the virus. Various measures like social distance and lockdown across the globe has been implemented for several months to control and prevent the spreading of the virus. It’s a great challenge for the researchers to understand the behaviour and the features that have the major influence in spreading or controlling of the virus. Hence, several mathematical models are developed to estimate and predict the number of infected cases and to identify the evolution pattern of the virus (Benvenuto et al. 2020; Ceylan 2020). From the literature it is observed that the models Susceptible, Exposed, Infected and Remove (SEIR) and SIR proved to be the effective approaches for forecasting the spread of the virus and it is observed that the SIR model is proved to be a better model when compared with SEIR model as per Akaike Information Criteria (Jia et al. 2019; Peng et al. 2020; Roosa et al. 2020; Liu et al. 2020; Zhihua et al. 2020). The models like SIR with Euclidean Network, Generalized Logistic Growth Model, Richards Growth Model and Gompertz Model have also been proposed in predicting the spread of the virus (Biswas et al. 2020; Wu et al. 2020).

In order to facilitate the medical assistance for COVID-19 patients, it is mandatory to predict the number of possible cases for well preparedness and to prevent the loss of live(s). Time-series based prediction of cases is one of the techniques that can be implemented using machine learning and deep learning algorithms. Supervised machine learning algorithms like LASSO regression, Support Vector Machine (SVM) and Exponential Smoothing (ES) have been implemented for predicting the spread of the virus and ES proved to the best model when compared with the other two approaches (Rustam et al. 2020). When it comes to deep learning approaches LSTM proves to be the best model as it is capable of handling time-based datasets.

In the literature it is evident that deep learning algorithms yields better results when compared with the traditional machine learning algorithms. The survey reveals that the prediction has been carried out for the developed countries and the data set considers the number of cases reported, infected, cured and deceased on day-to-day basis. However, the other parameters like population, health background of the region, climatic conditions, financial viability, education, medical facilities and various other features are not considered. Several studies reveal that the spread of virus has a close association with temperature conditions when tested using epidemiological analysis and mathematical modelling (Lowen et al. 2007; Barreca and Shimshack 2012; Zuk et al. 2009). In the proposed model, weather condition and population features are also included in predicting the COVID-19 cases along with infected, cured and deceased on a daily basis using deep learning algorithms CNN, RNN, BRNN, LSTM, and BLSTM.

The Concurrent Neural Networks (CNN) filters is capable of retrieving the relevant features from the sample input data. The concept of parameter sharing has been implemented in the CNN in which the filter is applied to the various parts of the input in extract the feature map. To address the issue of time dependent learning the concept of Recurrent Neural Networks (RNN) has been developed. The input for the subsequent set of rounds depends on the historical output and the hidden states are maintained. For handling the time series data, the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) models are available. Two independent RNN results are integrated and provided as input to the next round. The sequence for one of the RNN is forward time order and for the other RNNs input is given in the reverse time order. The results of the two networks are concatenated at every iteration and the results are summed up. To process the sequential data, the Bidirectional Long Short-Term Memory (BLSTM) has been introduced. To store the long-range context of information, combination of non-linear and linear feedback loops.

Section 2 discussion about the weather, population and COVID-19 data set, in Sect. 3 the implementation of models using deep learning algorithms and Sect. 4 discusses about the results and performance of the model.

2 Data set

The environmental factors pertaining to a specific region has an inordinate influence in the spread of the disease. In developing countries like India, the factors like population, sanitation, knowledge on hygienic, water, food and climate play a vital role in spreading of virus. The data set related to weather reports (Weather Data Set https://www.wunderground.com/; Kaggle 2020; github 2020) and COVID-19 (COVID-19 Data Set: https://github.com/CSSEGISandData) is collected from the various sources on day wise starting from January 2020 onwards. The proposed theory aims at identifying the relationship between the weather features like Temperature, humidity, dew, precipitation, wind and pressure across several major regions in India. In a similar manner, number of cases that are infected, deceased and the cases under treatment due to COVID-19 (Dash et al. 2021) are retrieved across all the regions of India on day-to-day basis. Table 1 represents the sample weather data and COVID-19 cases for the city Chennai, Tamil Nadu, India for a period of first ten days in the month of November 2020 and these are the features considered in building the model to predict the spread of the virus. Station wise data on daily basis is collected by the applying the concept of web scrapping. Figure 1 depicts the number of cases reported, deceased and recovered in India for the period from January 2020 to December 2020. In the graph it is observed that the cases are high during the monsoon seasons across India, more specifically in the months of September, October and November 2020. Another major issue to be considered in the rising of cases is due to the release of lock down gradually by the respective State Governments during the period. In India, till the end of August 2020 it is mandatory to register and get approval if the citizen is to move from District to another. However, from September onwards the rule was relaxed, which is also one of the major reasons for the rise in the number of infections.

Table 1 Weather and COVID-19 sample data set
Fig. 1
figure 1

COVID-19 infections in India between January and December 2020

Figure 2 shows the level of temperature, wind speed and humidity on 22nd July, 2020 in which the number of cases deceased was high.

Fig. 2
figure 2

Impact of climatic conditions and virus outbreak

The observation reveals that the virus spreads extensively when the temperature and humidity is high and it has been observed in the states of Tamil Nadu and Maharashtra.

Apart from the natural factors, the spread of the virus depends on the population on the region of interest. In the initial days, it is observed that the virus spreads extensively where the population is sufficiently large and the density is high. Figure 3a shows the population of India (projected) as on 30th December 2020 (Suresh et al. xxxx). The source of data set is from the Unique Identification Authority of India (UIDAI) a Government of India organization. Figure 3b shows the population density in India.

Fig. 3
figure 3

a Projected population of India (as on 30.12.2020). b Population density in India

To predict the impact of climatic conditions the models are built based on the data set. The size of the data set plays a vital role in the prediction process. The training data set, testing data set and validation data set are randomly chosen. The validation data set is isolated from the model building process (Trappenberg 2019). The formation of model is discussed in the next section.

3 Prediction of COVID-19 cases using weather and population

Figure 4 represents the generic flow of model building using deep learning approaches by consider the COVID-19 data set of India, the weather data that includes temperature, wind speed and humidity and the population data in the Indian subcontinent.

Fig. 4
figure 4

Workflow of prediction of COVID-19 by applying deep learning algorithms

The objective is to identify the corelation between the temperature, wind speed and humidity in spreading the virus. Population is another major attribute in identifying the rate at which the virus spreads. The data pre-processing is the next major task to be computed on the collected data set. The cleaned data is categorized into three sets: training, testing and validation. The percentage of data considered is in the ratio training: testing: validation is 80:10:10 (Trappenberg 2019). On the training data set the model is built by applying the Deep learning algorithms CNN, RNN and BRNN. Based on the level of accuracy, the model is tuned and the number of epochs is increased accordingly. Finally, the model is tested and validated with the appropriate data sets. The data set reserved for validation is not exposed during the training or testing phase (Lee 2019; Aslam et al. 2021; Bhuyan et al. 2021).

Feature selection is one of the major tasks in data pre-processing. In the proposed work, the features considered are temperature, wind speed, humidity, dew and population to identify the impact of the virus. Random Forest (Paul et al. 2018; Suresh et al. 2021; Homenda and Lesinski 2011) algorithms is applied for identifying the relative importance of the features. Figure 5a, b represents the ranking of features relating to the death and infections due to COVID 19. The feature temperature plays a vital role in the spread of the virus and it is clearly observed in both the number of cases infected and deceased.

Fig. 5
figure 5

a Feature importance on COVID 19 infected cases. b Feature importance on COVID 19 deceased cases

In the proposed work, the correlation between the weather attributes and the impact on number of deceased and infected COVID 19 cases has been carried out for the Indian Subcontinent. The dependent variable is the number of confirmed COVID-19 cases and is normalized by applying log transformation. The relationship between the temperature, dew, wind speed, rainfall, humidity, population and population density and the COVID-19 case are carried out by applying LASSO regression model. The LASSO regression model has the capability to reduce the impact of the variables that does not have major contribution in the prediction process (Roth 2004). As seen earlier, temperature and humidity have the major impact than the other features, therefore LASSO model has the ability to predict the correlation consequently. Based on the feature selection ranking, the lasso regression is applied on the attribute temperature and humidity. It is observed that, if the recorded average temperature on a given day is less than 80°F then the number of cases registered is less and when the humidity is 70%. Therefore, the threshold for the attribute temperature is set to 80 degrees Fahrenheit and the humidity is set to 70 percentage. The hypothesis is when the temperature and humidity is increased then the rate of spread of virus and number of deaths is also decreased. The experimental result reveals that there is an inverse relationship between temperature, humidity and the number of infected and deceased cases. The procedure for predicting the number of infected and deceased cases is classified into Model A and B.

Model A (Infected) predicts the number of infected cases against temperature and humidity and Model B (Death) predicts the number of deceased cases against temperature and humidity. Equations 1 and 2 represents the model for predicting the number of possible infections and deaths. The attribute temperature is the independent variable and the dependent variables are humidity and dew. The model A is evaluated based on the total population in the given region and the number of infected cases and model B is computed against the number of deceased cases. The variable α represents the rate of change of temperature on the region of interest and is computed by considering the mean of temperatures recorded. In the similar manger the humidity variable β is the rate of change of humidity and γ represents the rate of change of dew factor recorded in the region. Based on the error rate, the model is adjusted.

$$ A_{i} : \, \log \left( {I_{c} / \, T_{p} } \right) = \alpha \left( {t - 80} \right) + \beta \left( {h - 70} \right) + \gamma \left( {d - 75} \right) + \varepsilon $$
(1)
$$ B_{d} : \, \log \left( {D_{c} / \, T_{p} } \right) = \alpha \left( {t - 80} \right) + \beta \left( {h - 70} \right) + \gamma \left( {d - 75} \right) + \varepsilon $$
(2)

where Ai—Model A (number of infected cases), Bd—Model B (number of deceased cases), Ic—number of infected cases as on 22nd July 2020, Dc—number of deceased cases as on 22nd July 2020, Tp—total population in the Indian subcontinent, α—the rate of change of temperature, β—the rate of change of humidity, γ—the rate of change of dew factor, ε—training epoch of the neural network.

The model is trained, tested and valuated by applying the deep learning approach Concurrent Neural Network (CNN), Recurrent Neural Network (RNN), Bidirectional RNN (BRNN), Long Short-Term Memory (LSTM) and Bidirectional LSTM (BLSTM) by varying the number of epochs. The parameters for the deep learning algorithms for high level of accuracy are configured as: learning rate is set to 0.0005, number of hidden layers is 8, epoch is set to 500, timestep is 5. Figure 6a, b shows the model evaluation for deceased and infected cases respectively for the four quarters starting from Jan 2020 to Dec 2020. It is evident that the proposed model predicts the number of infected and deceased cases is almost close to the actuals.

Fig. 6
figure 6

Prediction of deceased and infected cases using deep learning algorithms

The level of accuracy is 93.23% in case of deceased across all the quarters and for infections it is 92.32%. The results reveals that the temperature. humidity and dew factor play a vital role in the spread of the virus.

4 Results and discussion

The proposed prediction model is evaluated by computing the indexes: Mean Absolute Error (MAE), Mean Square Error (MSE), Root Mean Square Error (RMSE) and R-Squared (R2). Table 1 represents the performance of the model by varying the temperature, humidity and dew factor on both the models and results of the evaluation metrics MSE, RMSE, R-Squared and MAE. To represent the actual differences between the actual and the predicted values in the dataset the metric Mean Absolute Error (MAE) is computed, the variation between the variables is called as Mean Square Error (MSE), the standard deviation is arrived by computing the square root of the MSE and is referred to as Root Mean Squared Error (RMSE). The proportion of variance in the independent variable is represented by R-Squared (R2) and its value is always less than one (Dash and Dash 2017). Figure 7 represents the evaluation of the deep learning algorithms against the indexes MAE, MSE, RMSE and R2.

Fig. 7
figure 7

Evaluation of RNN, BRNN, LSTM and BLSTM for predicted COVID 19 infections and deaths

The study provides a comparison of deep learning algorithms RNN, BRNN, LSTM and BLSTM for forecasting the COVID 19 cases (infected and deceased) in India. By considering the climatic conditions and population in India, the algorithm BRNN provides an enhanced result when compared with the other models. The other features like lock down, health conditions of the infected patients, other climatic conditions may also have a significant impact in the spread of the disease. The impact of the disease after the implementation of vaccination is also to be studied.

5 Conclusion and future scope

The spread of novel COVID-19 leads to the study of impact on climatic conditions and the disease. The factors temperature, humidity, population in a specific region plays a vital role in the spread of the virus. The mathematical models built on top of the attributes are experimented by applying deep learning algorithms on RNN, CNN and BRNN with LSTM and BLSTM. The factors like complete lockdown in India from April to May 2020, high temperature due to summer reveals that number of reported cases is low. Once the lock down is lifted and considerably there is a reduction in temperature across India, the number of reported cases started increasing. The experimental results reveals that the reduction in temperature leads to the increase in the number of cases. The level of accuracy is high. However, the accuracy can still be increased by regulating the model with a more accurate data set. The proposed model is restricted to the climatic conditions and the population related to the Indian sub-continent only and hence it is necessary to build a generic model which is capable of predicting the spread of the virus. The results suggest the officials to impose lockdown, maintenance of social distancing, medical emergency preparedness and increase the production and consumption of vaccination. At present in European countries, mutant of the novel COVID-19 virus is spreading rigorously, as future work it is proposed to study on the impact on climatic factors in identifying the variant of the virus.