Ranking and Clustering Iranian Provinces Based on COVID-19 Spread: K-Means Cluster Analysis

ORIGINAL ARTICLE Introduction: The Coronavirus has crossed geographical borders. This study was performed to rank and cluster Iranian provinces based on coronavirus disease (COVID-19) recorded cases from February 19 to March 22, 2020. Materials and Methods: This cross-sectional study was conducted in 31 provinces of Iran using the daily number of confirmed cases. Cumulative Frequency (CF) and Adjusted CF (ACF) of new cases for each province were calculated. Characteristics of provinces like population density, area, distance from the original epicenter (Qom province), altitude from sea level, and Human Development Index (HDI) were used to investigate their correlation with ACF values. Spearman correlation coefficient and K-Means Cluster Analysis (KMCA) were used for data analysis. Statistical analyses were conducted in RStudio. The significant level was set at 0.05. Results: There were 21,638 infected cases with COVID-19 in Iran during the study period. Significant correlations between ACF values and province HDI (r = 0.46) and distance from the original epicenter (r = -0.66) was observed. KMCA, based on both CF and ACF values, classified provinces into 10 clusters. In terms of ACF, the highest level of spreading belonged to cluster 1 (Semnan and Qom provinces), and the lowest one belonged to cluster 10 (Kerman, Sistan and Baluchestan, Chaharmahal and Bakhtiari and Busher provinces). Conclusion: This study showed that ACF gives a real picture of each province's spreading status. KMCA results based on ACF identify the provinces that have critical conditions and need attention. Therefore, using this accurate model to identify hot spots to perform quarantine is recommended. Article History: Received: 12 November 2020 Accepted: 20 January 2021


Introduction
These days, a strange unwanted guest has drastically surprised the people in most countries all over the world. A 2019 novel coronavirus (2019-nCoV), a new member of the coronavirus family, mainly causes acute infection in the human respiratory system at early stages 1  COVID-19 has a human-to-human transmission and its incubation period varies between 2-14 days 7,8 . Although the transmission mostly occurs when a patient is symptomatic, the studies have indicated that it may also happen during the asymptomatic incubation period 9 . Clinical manifestations of this disease include fever, cough, shortness of breath, breathing difficulties, and other complications related to respiratory tract involvement so that, in more severe cases, the infection can cause pneumonia, severe acute respiratory syndrome, and even death 7,8,9 . Based on the warnings, older adults and people with serious heart disease, diabetes, and lung disease may hurt severely from the COVID-19 10 . Long infectious periods, rapid increases in the number of cases, and the absence of definitive treatment have made this disease a global challenge 11,12 . Until March 22, 2020, the total number of infected cases worldwide is 337,459, 14,640 of which have lost their lives. While China has the most total confirmed cases, the peak of mortality due to the COVID-19 has been reported in Italy with a total number of 5,476 deaths 13 . On February 19, 2020, the first death cases due to the COVID-19 were announced in Qom province (Iran), after that, the ascending trend of outbreak was started across the other provinces. According to the Iranian Ministry of Health and Medical Education (IMHME) reports, up to March 22, 2020, there were 21,638 confirmed cases in Iran with 1,685 deaths and 7,913 remissions, and Tehran province has been reported to have the highest number of infected cases among the other provinces. Although the authorities have started to confront this epidemic by shutting down the schools and universities, refusing non-local travelers in some provinces, reducing working hours for some days, implementing forms of remote work by many companies for the employees, requesting the individuals to stay at home in self-quarantine, etc., still there is a quick increase in the number of infected cases 5 . Given the rapid outbreak of the COVID-19 in Iran, this study was conducted in order to identify the spreading trend of this disease in Iran and all its provinces from February 19 to March 22, 2020, using the recorded data from the IMHME and also to find similar provinces according to the disease spread using K-Means Cluster Analysis (KMCA  16 . Evolutionary clustering which relies on KMCA was also used to analyze 43 million textual tweets about Covid-19 spread on Twitter between March 22 and March 30, 2020. Six clusters were identified by KMCA which were sorted in a decreasing order. The results indicated that "Covid-19 pandemic" term was the most trended tweet 17 . In 2020, different areas of Italy were classified according to SARS-CoV-2 spread. Results of KMCA depicted that there were three independent clusters 18 . According to what was shown in the literature review, no study used KMCA to classify Iran provinces based on the prevalence of COVID-19. So, This study was conducted to rank and cluster Iranian provinces using KMCA based on coronavirus disease (COVID-19) recorded cases during February 19 to March 22, 2020, and investigating the relationship between disease spread and characteristics of provinces such as population density, area, distance from the original epicenter (Qom province), altitude from sea level, and Human Development Index (HDI).

Data collection
During the study period, from February 19 to March 22, 2020, the daily regional distribution of the epidemic was closely tracked, extracted, and The information included the daily number of newly confirmed cases, remissions, and deaths due to the COVID-19 in all the 31 provinces of Iran within a 33-day interval. Based on the last population census conducted in 2016 in Iran, the population of Iran is about 79.9 million people. Table 1 presents the information on the cumulative frequency of the infected cases, population size, area, population density, distance from the original epicenter (Qom province), and altitude from sea level, and HDI for each province

Statistical analysis
KMCA was applied to identify the provinces with similar spreading patterns in terms of the COVID-19 outbreak based on Cumulative Frequency (CF) and Adjusted Cumulative Frequency (ACF), which provides a more accurate picture of the epidemic. KMCA is a distance-based algorithm, where the distances are calculated to assign a point to a cluster. In this method, each cluster is associated with a centroid, and the aim is to minimize the sum of distances between the points and their respective cluster centroid 18 . Each cluster represents the CF values, and all clusters are sorted in a descending order from first to last. For instance, the first and last clusters represent the highest and lowest CF/ ACF values. The number of iterations was set as 100, and the cluster centroid shows the average of CF/ACF values assigned to that cluster.
Furthermore, the spearman correlation coefficient was used to assess the strength of the linear relationship between ACF and the variables presented in table 1. All data analyses were conducted in the RStudio environment, Version 1.2.5042, and in Microsoft Excel 2016. The significant level was set at 0.05.

Ethical issue
This study was approved by the Ethics Committee of Shahid Sadoughi University of Medical Sciences (IR.SSU.SPH.REC.1399.195.)

Results
There were 21,638 infected, 7,913 recovered, and 2,299 death cases with COVID-19 in Iran during the study period. Figure 1 shows the number of newly infected, death, and recovered cases due to the COVID -19 from February 19 to March 22, 2020, in Iran. Qom province, as the seventh most populous city of Iran, was the first province of Iran contaminated with the Coronavirus on February 19, 2020, with 2 cases, and Bushehr province was the last province with five cases on March 5, 2020. For each province, the frequency of daily new infected cases and cumulative frequencies are presented in ( Among Iran's provinces, maximum and minimum CF was observed in Tehran with 5098 cases (23.56%) and Bushehr with 55 cases (0.25%). ACF values for each province are mapped in figure 4. By March 22, the ACF of the infected case was equal to 270.72, per one million, in Iran with the maximum ACFs belonged to Semnan (918.33) and Qom (911.56) provinces. Figure 5 shows the CF, new cases, recovered, and death per day in Iran. As demonstrated in figure 5, there is an increasing trend in CF in Iran. Results of the KMCA based on CF and ACF of the confirmed cases showed that all the 31 provinces of Iran are classified into 10 clusters according to the similarity index of Euclidean distance (Table 2). In clustering based on the CF, Tehran and Isfahan had a separate specific cluster, but in clustering based on ACF, they had a common cluster. Fars and Khuzestan, Chaharmahal and Bakhtiari and Bushehr, Gilan, and Alborz had the same cluster in both types of clustering. There was a significant positive relationship between HDI (r = 0.46, p=0.008), also there was a significant negative relationship between ACF and distance from Qom province (r = -0.66, p < 0.001). There was a negative and positive insignificant relationships between ACF and area (km 2 ) (r = -0.30, p = 0.10), height above sea level (r = -0.13, p = 0.47) and population density (r = 0.20, p = 0.27).

Discussion
The Coronavirus has crossed the geographical borders of various countries without any restriction and has caused all countries in the world to confront the challenges of the disease, death of their citizens, economic pressure, and other issues 19 . In 2020, Mahmoudi et al. compared distributions of Covid-19 spread in the United States of America, Spain, Italy, Germany, United Kingdom, France, and Iran and clustered those using fuzzy clustering. It was shown that Iran, Italy, Germany, and France are statistically similar 20 .
Although figure 2 depicted the daily number of new confirmed cases, there was an upward trend in most provinces. In general, the process of registering new daily cases of the Coronavirus from February 19 to March 22, 2020, in Iran's provinces can be divided into three categories of downward, upward, and irregular. Tehran, Qom, Mazandaran, Golestan, Khorasan Razavi, Khuzestan, and Chaharmahal and Bakhtiari provinces showed a downward trend, Fars, Yazd, Zanjan, Hamedan, Sistan -Baluchistan, and West Azarbaijan showed an upward trend, and other provinces showed an irregular trend.
In KMCA as a multivariate technique, each cluster is associated with a centroid, and the aim is to minimize the sum of distances between the points and their respective cluster centroid 21 .
In the present study, both CF and ACF were considered points, and each cluster included the provinces with close values. The results of KMCA based on CF of the confirmed cases also pointed out that among six provinces with the highest frequency of infected cases, Tehran, Isfahan, and Mazandaran provinces had an independent trend so that they had specific cluster, but Qom, Gilan, and Alborz had similar trend and common cluster. The results of KMCA based on the ACF of the confirmed cases showed a similar dual trend in the six provinces with the highest frequency.
As it is shown in table 2, KMCA based on CF declared that Tehran province with 13 million population (17.5% of Iran population) which was in the first cluster, had the first rank (maximum CF value) among polluted provinces, while it was not in the category of provinces with high infection (cluster 1) after adjusting CF for the population of each province (ACF). It seems that ACF gives a more realistic description of disease status in each province. According to this adjustment, contrary to the image created by the daily frequency and CF where Tehran, Isfahan, and Markazi provinces were not in good condition, the results of ACF showed that Semnan, Qom, Yazd, Markazi, and Qazvin provinces were in critical condition and  Vol (6), Issue (1), March 2021, 1184-95 Jehsd.ssu.ac.ir 1194 needed special attention. Similar to our results in terms of detecting hot spot regions, in 2020, Bazargan and Amirfakhriyan provided spatial clustering of infected cases by Coronavirus from February 22 to March 22, 2020, in provinces of Iran using Exploratory Spatial Data Analysis Approach (ESDA) and detected Tehran, Alborz, Qom, Mazandaran, Gilan, Qazvin, Isfahan, Semnan, Markazi and Yazd provinces as the main spatial propagation centers of this epidemic 22 . In 2020, Ramírez-Aldana et al. studied the spatial distribution of COVID-19 in Iran from February 19 to March 18, 2020 to identify areas with higher frequency of infected cases and the factors that affected the disease spread. According to the results of their study, provinces of Qom, Marzaki, Mazandaran, and Semnan in the Northern part of Iran around Tehran, as well as provinces of Alborz, Gilan, Qazvin, and Yazd, were high-risk regions and provinces of Bushehr, Homozgan, Sistan, and Baluschestan were low-risk regions 23 . These findings are inconsistent with our study results.
Contrary to Ramírez-Aldana et al., which depicted that low COVID-19 confirmed cases were in provinces with more literacy, our findings showed a direct relation between ACF and HDI, which is an indicator of education, life expectancy, and per capita income. The results of the present study also showed a significant negative relationship between the ACF of the confirmed cases in different provinces with distance from Qom, so that the epidemic was more in the provinces with a smaller distance from Qom.
This study had some limitations. Due to IMHME policy, provincial data was accessible until March 22, and after that, official data were reported only for the whole country so, obtaining data for each province was not possible.
This limitation did not affect the results of this study, and the classification tool used in this study achieved good results, which can be mentioned as the strength of this study and can also be suggested as a quick tool for classifying and ranking the areas engaged in Covid-19.

Conclusion
In a nutshell, COVID-19, as an infectious and destructive disease, has a considerable transmission speed in Iran. Both CF and ACF can be used to make decisions about disease status in a population, but as this study showed, ACF could give a more realistic and authentic picture of spreading status in each area.
In the present study, KMCA was applied to identify and rank the provinces with similar spreading patterns in terms of the COVID-19 outbreak as it gave good results; this method can be recommended to health policy makers to identify, rank, and separate polluted areas and quarantine them. KMCA results based on ACF recommend that Qom and Semnan provinces need special attention. By the way, these results are limited to study time intervals.
It seems that the outbreak of this infectious disease is more in provinces nearer to the original epicenter. Unexpectedly, provinces with a high level of HDI were more infected. It seems that the joint effect of having high education and more activity may lead to a more dynamic population, and therefore increase the disease transmission rate.