Simple Correlation Between Weather and COVID-19 Pandemic Using Data Mining Algorithms

Currently, Indonesia is still struggling against the Coronavirus, the number of cases of the Coronavirus is increasing every day. The virus that began to attack around December 2019 has become a pandemic and is causing serious problems because there is still no vaccine to prevent transmission. The Government of the Republic of Indonesia as the party responsible for this pandemic has made various efforts including the lockdown method in the form of a Large-Scale Social Restriction (PSBB) policy. Besides that, another effort is to present a digital information portal that can be accessed and able to present the development status of coronavirus patients. With the development of information and communication technology, this research is a find correlation between weather and Covid-19 Using Data Mining Algorithm in processing data sources from the government to produce knowledge that can be used to make decisions. The research data used came from Covid-19 surveillance data of the Ministry of Health of the Republic of Indonesia and the climatic conditions in Surabaya Indonesia in the form of the average temperature (°C), the average humidity (%), the average duration of sunlight (hour) taken from the Department of Meteorology of the Republic of Indonesia. Based on the results of the study it was found that the weather affected the number of Covid-19 patients. The highest order of correlation is the average duration of sunlight (hour), the average temperature (°C), the average humidity (%).


Introduction
Coronavirus (COVID-19) is a new type of virus that was first identified in Wuhan, China in December 2019. The virus which then spread rapidly throughout China and all countries in the world has been declared a pandemic by the World Health Organization (WHO) [1]. Based on a study entitled "The Novel Corona Virus Originating in Wuhan, China: Challenges for Global Health Governance.", Said that on December 31, 2019, China reported a case of pneumonia, in Wuhan, Hubei Province, China, to the World Health Organization (WHO) is caused by the 2019-nCoV virus [2].
In Indonesia, the first COVID-19 case was identified on March 2, 2020. Until March 17, 2020, there were 134 confirmed cases from eight provinces, namely Bali, Banten, DKI Jakarta, West Java, Central Java, West Kalimantan, North Sulawesi, and Yogyakarta [3]. Figure 1 shows the distribution area of COVID-19 in Indonesia.
Some research in the field of Artificial Intelligence to overcome the Covid-19 epidemic is to use a model approach to identify the administration of "drug-repurposing" drugs to infected patients [4]. Other  [5]. Other research in the field of Deep Learning to make computed tomography (CT) image processing models, the results of the study show that the model is proposed when using 499 training data and 131 test data obtained accuracy reaches 0.901 accuracies with positive predictive value 0.840 and negative predictive value 0.982 [6]. In line with this research, Deep Learning shows high performance because they can We achieved an Area Under Curve (AUC) result of 0.994 with 94% sensitivity and 98% specificity (at threshold 0.5). [7]. In other studies using the Naive Method algorithm, the prediction results obtained from the classification of COVID-19 infected status obtained positive and negative data with each data that is positive 55.48% of 249 data and negative 44.52% of 249 data. In other words, if classified as a number, positive classification is 139 people and negative classification is 110 people from a total of 249 data. [8]. Besides, research using the C4.5 Algorithm produced a test with a confusion matrix of 3 (three) classes from the COVID-19 supervision category, the calculation yielded an accuracy of 0.9286 (92.86%) [9].
Currently, the topic of research on the correlation between weather conditions and the spread of viruses is very interesting, some of which are researchers [10][11][12][13][14][15]. In the case of the West Nile virus in the United States and Europe [16][17] studied the relationship between climatic conditions and SARS-CoV and obtained results that climate variables can be a cause of biological interactions between SARS-CoV and humans. Other research results also state that weather correlates very significantly with changes in death rates due to pneumonia [18]. Viral transmission is caused by several factors such as temperature, humidity, and population density affected [19]. Tosepu studied the correlation between weather conditions and COVID-19 Jakarta, Indonesia, and found significant results [20].
In this study, the ID3 algorithm was used to find a correlation between the number of confirmed COVID-19 patients per day with changes in weather conditions. The number of COVID-19 patients used came from COVID-19 surveillance data of the Ministry of Health of the Republic of Indonesia and the weather change variables used were the average temperature (° C), the average humidity (%), and the average duration of sunlight (hour) in Surabaya taken from Department of Meteorology, Republic of Indonesia.

Research Design
Data mining (knowledge discovery in databases (KDD)) is an activity of collecting historical data to find sequences, patterns, or variable relationships in large data. This data mining proceeds can be used to help make decisions in the future [21] [22].
One of the advantages of the decision tree algorithm is its ability to produce easy-to-understand knowledge structures with low computational costs for classifying new cases, the ability to handle symbolic and numeric input variables, providing a clear indication of which attributes are most important for prediction or classification [23].
In this study, we used the Decision Tree ID3 algorithm to classify the data and extract the rules for the data set, as shown in Figure 2. The decision tree resembles a tree structure. The tree is a hierarchical organization of node and link collection, where each node, except the root node, has one incoming link. Each node is a predictive feature and the link represents its respective value conditional variable. The decision rules of the decision tree are a way of representing knowledge. Based on the decision tree and based on testing can build decision rules. The ID3 algorithm does not guarantee an optimal solution. It selects the best attributes from the given set. It then divides the dataset across each iteration. ID3 can adjust training data. As a solution for this is, rather than a larger tree, a smaller tree is preferred. Although this algorithm determines a solution, it does not always guarantee an optimal solution. Figure 3 shows the distribution map of Covid-19 Patients in Surabaya.

Figure 3. Distribution map of Covid-19 in Surabaya City
The number of cases confirmed every day is collected from the official website of the city of Surabaya on page https://lawancovid-19.surabaya.go.id/visualisasi. The computerized data set at weather data was obtained from the Department of Meteorology, Ministry of the Republic of Indonesia. Data consists of a temperature average (°C), humidity average (%), duration of sunlight average (ss). The data used are from 1 June 2020 to 1 July 2020.

Study Area
This study was conducted in Surabaya. Surabaya is one of the major cities in Indonesia. The city of Surabaya as the capital of East Java Province is located on the north coast of East Java Province or precisely between 7 ° 9′-7 ° 21 ′ South Latitude and 112 ° 36 ′ -112 ° 54 ′ East Longitude.

ID3 Algorithm
The ID3 (Iterative DiChaudomiser 3) algorithm was first developed by J. Ross Quinlan at the University of Sydney [24]. He first presented ID3 in 1975 in a book, Machine Learning [24], volume 1, number 1. ID3 is based on the Concept Learning System (CLS) algorithm. ID3 is a supervised learning algorithm, [25] constructing a decision tree. The resulting tree is used to classify future samples. The ID3 algorithm builds a tree based on information (information acquisition) obtained from training instances and then uses the same to classify test data [25]. ID3 algorithm generally uses nominal attributes for classification without missing values. Figure 4 shows the ID3 algorithm pseudocode [26]. The selection of attributes in ID3 is used by the statistical property, which is called information gain. Gain is used to measuring how well an attribute separates training data into the target class. The attribute with the highest information will be selected. The gain value is calculated using the concept of information theory called entropy. Entropy is calculated using equation (1). Based on the formula above, Pa is the probability of sample S having a positive class. Pa is calculated by dividing the number of positive samples (S+) by the total number of samples (S) such that. P a = S+/S. Pb is the probability of sample S having a negative class. Pb is calculated by dividing the number of negative samples (S-) by the total sample number (S) so that Pb = S−/ S. The leaf part of a decision tree. In the ID3 algorithm, the entropy reduction is called gain information. A comparison of sample S with attribute A can be calculated information gain using equation (2).   Figure 5 shows several patients obtained from the official website of the city of Surabaya on page https://lawancovid-19.surabaya.go.id/visualisasi.    Table 1 shows the dataset used in this study. The strong motivation for this research is because it is interesting to conduct a study on COVID-19 in Indonesia. To achieve For this purpose, we are looking for a correlation between weather variables, namely the length of the sun's exposure, average temperature and average humidity for and confirmed cases of COVID-19 using the ID3 algorithm. This algorithm was built to understand the variables that most influence the COVID-19 growth curve.  Figure 4, the ID3 Algorithm finds that in a row the weather variables that affect the Covid-19 growth curve are the average duration of sunlight (hours), average temperature (° C), average humidity (%), The resulting decision tree is shown in Figure 7.

Conclusion
Weather is an important factor in determining the number of confirmed Covid-19 cases in Surabaya. Weather variables such as the average duration of sunlight (hours), average temperature (° C), average humidity (%) was significantly correlated with the number of patients confirmed cured with successive gain values of 0.683, 0.450, 0.030, and with accuracy 96.77%.