Combining rank-size and k-means for clustering countries over the COVID-19 new deaths per million

This paper deals with the cluster analysis of selected countries based on COVID-19 new deaths per million data. We implement a statistical procedure that combines a rank-size exploration and a k-means approach for clustering. Specifically, we first carry out a best-fit exercise on a suitable polynomial rank-size law at an individual country level; then, we cluster the considered countries by adopting a k-means clustering procedure based on the calibrated best-fit parameters. The investigated countries are selected considering those with a high value for the Healthcare Access and Quality Index to make a consistent analysis and reduce biases from the data collection phase. Interesting results emerge from the meaningful interpretation of the parameters of the best-fit curves; in particular, we show some relevant properties of the considered countries when dealing with the days with the highest number of new daily deaths per million and waves. Moreover, the exploration of the obtained clusters allows explaining some common countries' features.


Introduction
The spatio-temporal patterns of COVID-19 represent one of the most relevant themes for statistical research nowadays, given the crucial relevance of the pandemic disease in contexts of society such as economics and, of course, health.
This pandemic has heterogeneous implications on countries and regional realities. The most common example we can mention is given by the different applications of the so-called non-pharmaceutical interventions in the preliminary phases (see, e.g. [1,2]). These differences must be included in the premise of an effective exploration of COVID-19 repercussions.
Some authors deal with forecasting exercises of the future evolution of deaths and infections dynamics (see, e.g., [3][4][5][6][7]). In this respect, Ioannidis et al. [8] state that the reliability of the predictions related to COVID-19 is debatable for several reasons, including the relevant sensitivity of the estimates on the employed methodology.
This paper takes the opposite perspectivehence, overcoming the criticism raised by Ioannidis et al. [8] presenting a lookback of the spatio-temporal data related to COVID-19. It is worth mentioning some relevant contributions on the matter. Bartolucci and Farcomeni [9] propose the study of the cases of COVID-19 infections in the Italian regions by employing a model based on latent variables and estimating it through a Markov chain Monte Carlo (MCMC) algorithm. Still, in the context of MCMC, Lee et al. [10] discuss the propagation of COVID-19 in Scotland by adopting a Bayesian-type framework. In Schneble et al. [11], the registered death counts related to COVID-19 are modelled to monitor the dynamic behaviour of the infections on a small-area level in Germany.
Differently from the studies above, we consider a selection of countries and deal with the exploration of their daily data about COVID-19 new deaths per million. We combine a rank-size best-fit exercisebeing the size the considered variableand a cluster analysis of kmeans type. So, it is shown that a rank-size law of third-degree polynomial type provides high-quality goodness-of-fit parameters. The calibrated parameters feed the k-means cluster analysis based on the Euclidean distance, with k = 3 (the reasons for this choice are presented in Section 2). In this paper, we do propose a new method for clustering COVID-19 data in terms of countries as in Zubair et al. [12]; instead, we want to consider a statistical clustering technique well known in the literature and also widely used in the context of operations research, and apply it to a novel, relevant problem, with highly informative results. To assure data reliability and to reduce possible sources of biases in the data collection, countries are selected by taking those with a high value of the Healthcare Access and Quality Index (HAQ hereafter, see [13]). Moreover, to avoid distortions in the best-fit procedure, we have removed the outliers at a country level during the data pre-treatment phase. The results interpretation is grounded on the meaningfulness of the calibrated parameters in terms of the polynomial curve shape; the analysis of the obtained clusters allows highlighting regularities and deviations of the considered countries.
Other contributions present a cluster analysis of the data related to COVID-19 and are summarised in Table 1. For instance, James and Menzies [14], and Rios et al. [15] respectively employ k-means and hierarchical to analyze public policies along the time (former) and to make forecasting of pandemic waves (latter). In these studies, the time is considered because of the purposes of the researches. Similarly, Li et al. [24] run health parameters-based classification of the patients in Wuhan covering the beginning of the pandemic spread. Hutagalung et al. [17] perform a cluster analysis via k-means taking k = 3 to group South-East Asian countries. In this case, the time is not considered to get a more global view of the country's conditions during the pandemic. Similar works is done by Abdullah et al. [20] covering a broader set of countries or in Kumar's [23] research, where Indian territories are classified in terms of similar infection propagation, to pursue optimal monitoring strategies. On a different note, certain studies include additional features for investigating the pandemic. For example, Siddiqui et al. [16] consider the relationship between temperature in Chinese areas and the spreading of the disease to cluster China's territories. A similar exercise is done by Vadyala et al. [18] where humidity is also considered but for exploring Louisiana's pandemic related data. Kiaghadi et al. [21] consider a larger number of variables. The authors included in the clustering elements like "Access to Medical Services and Sociodemographic, Behavioral, and Lifestyle Factors" to determine the most vulnerable areas to COVID-19 in Harris County, Texas. Rizvi et al. [25] cluster 79 countries using socio-economic factors, disease prevalence and health system indicators considering COVID-19 confirmed cases and COVID-19 death cases. Zubair et al. [12] propose a methodological work where a k-means variation is introduced for the case of COVID-19 data.
Our work aligns with the one from Tuli et al. [26] for the early steps the authors take, even if they do not explore clusters but focus on forecasting. Tuli et al. [26] find estimation of new cases distributions. Among others, Weibull and Gaussian distributions on COVID-19 are used. With the modelled distributions, the authors can perform forecasts. We are close to the work by Machado and Lopes [22] as well. The authors model the infected cases in more than 70 countries and visualize the results via a clustering approach. Machado and Lopes [22] use daily log changes to create empirical distributions, and then they test multiple families of distributions to model the data. Then, with estimated distributions' parameters, the clusters are found.
Moreover, some papers treat COVID-19 by adopting a rank-size analysis approach. For example, Kennedy and Yam [27] use Zipf's law to detect COVID-19 data inconsistencies, while Jiang and de Rijke [28] employ power-law relationships to explore USA populations, deaths and infections. Vasconcelos et al. [29] "analyze the rank-frequency distribution of preprints servers, ordered by the number of COVID-19 preprints they host" and Small and Sousa [30] apply rank-size distributions to model the spatio-temporal evolution of COVID-19 in USA and China, but not directly with data regarding deaths or new cases.
More in general, the usage of rank-size laws is typically driven by robust compliance of the data to theoretical modelssee, e.g., the work of Ficcadenti et al. [31] for text analysis or Ficcadenti and Cerqueti [32] for earthquakes cost evaluations based on rank-size lawnamely, when the best fit is appropriate, the goodness of fit must result excellent. It is the case in this paper, as presented in the section devoted to the results. In studying other researches attempting to model similar information related to COVID-19, one can notice, for example, Table 1 by Tuli et al. [26] where the R 2 s are lower than those usually obtained with rank-size best fits. In addition, Machado and Lopes [22] write in section "Regression models for describing the spread of COVID-19", that "a single model with a limited number of parameters is not able to fit well the time series […] for all countries". So, even if they have found some goodness of fit comparable to those expected for rank-size compliance (e.g., they report an R 2 = 0.99 for Italian and Chinese data), the issue of identifying a model that works well for all the countries remain open. It also involves some consideration around over-fitting the data with many parameters and the increasing computational complexity in fitting and then clustering. Therefore, another advantage of the approach proposed in the present study is the rank-size relationships' capacity to create a unified environment where comparisons are possible. Namely, we can fit each country's data and compare the results, ensuring that the best fit capacity does not affect the clustering activity.
The rank-size approach has the advantage (in this case) of allowing the analysis without data's temporal feature. Namely, in sorting the observations of new deaths per million and ranking them, the dates in which the causalities occurred are no longer relevant to reach conclusions regarding the countries. In this way, the issues presented by Middelburg and Rosendaal [33] and Zarikas et al. [34] do not affect our analysis. Zarikas et al. [34] "compare the time series of COVID-19 regarding active cases or similar variables" to model the evolution of the pandemic. In our work, we do Table 1 Sample of recent studies related to our work for the methods employed and for the data used.

R. Cerqueti and V. Ficcadenti
Chaos, Solitons and Fractals 158 (2022) 111975 not make clustering based on different types of time evolution (e.g., strong, medium, mild etc.), but our ranking uses the number of new deaths per million in a certain time period. Furthermore, Zarikas et al. [34] concern solely the first wave while the present paper mixes different waves to capture information on the phenomenon as a whole. Besides, the study is advancing a unique combination of rank-size and clustering analysis to evaluate past realizations of COVID-19 patterns. This is novel in the literature to the best of our knowledge. The rest of the paper is organised as follows. Section 2 describes the considered dataset and presents the methodologies employed for the analysis. Section 3 contains the empirical results, along with a discussion of them.

Data and methods
The time series of daily new deaths per million by country has been downloaded from Our World in Data, Roser et al. [35]. The data source collects a comprehensive set of variables describing many features related to COVID-19, and it has been employed in several authoritative studies (see, e.g. those from [36][37][38]). Each country has a specific reference period, depending on the registered beginning of the pandemic propagation. All the investigated periods ends on April 18th, 2021when data have been retrievedwhile the starting points are reported in the last column of Table 2.
Countries are rather heterogeneous in terms of health care standards. This might create different reporting best practices, especially at the beginning of the pandemic (see, for example, [39]). To overcome this potential bias and obtain a more reliable dataset, we have chosen countries with a high level of HAQ presented by Barber et al. [13]. Such an indicator is listed for 195 countries. Despite the index published in 2017, the most recent levels are reported for 2015 when Andorra had the highest level with 94.6, and the Central African Republic had the minimum with 28.6. So, we have chosen to keep the 39 countries appearing in the last 20th percentile of the HAQ distribution in 2015; see the first two columns of Table 2 for the details. The same table contains the main descriptive statistics of the considered data at a country level for a more detailed overview.
We have preprocessed the dataset to make the best fit more effective and avoid distortions. First, we have removed the outliers in each analyzed series by applying an interquartile method. Namely, for each series, we have calculated the 75th (Q3) and 25th (Q1) percentiles; then, we eliminated all the observations being outside the In doing so, we have faced the so-called king and vice-roy and queen and harem effect, i.e. the deviations due to outliers at low and high ranks in the rank-size analysis (see, e.g., [40][41][42]). The country data presents tails on the right side only so that the eliminated observations always sit on the right side of the upper limit. Second, we have removed Andorra, Iceland, New Zealand, and Singapore from the investigated sample since they are not relevant in a rank-size context. Indeed, such countries luckily had just a few days in which deaths were experienced, so they present zero new deaths in most of the days in the period under analysis. Therefore, we have obtained N = 35 countries after this preprocessing phase. The rank-size analysis is implemented at an individual country level. Each country's new daily deaths per million (size) represent the (daily) sizes. The ranks are associated with the daily sizes in decreasing order. Specifically, we have given rank one to the day with the highest level of new deaths per million and the highest rank to the day associated with the smallest number of new daily deathspossibly, zero.
After many trials of different functional forms such as the Zipf-Mandelbrot (see [43,44]) and the Universal Law (see [45]), we have used for each country the following third-degree polynomial relationship which, as we will see, gives satisfactory best fit outcomes: where z is the size, r represents the rank related to size z and a, b, c and d are real parameters to be calibrated. To implement the best fit procedure, we have used the Scikit-learn Python's library [46] which leads to a parameters' estimation "by adding higher-order polynomial terms of existing data features as new features in the dataset" as reported by Bisong [47].
Once the best fit procedure is performed, each country i = 1, …, N remains associated to four calibrated parameters, collected in a vector Such parameters have been used in the countries' clustering procedure by adopting the "k-means++" Scikit-learn Python algorithm (see contributions from [46,48]). Anyway, the effect of random initialization has been tested, and it did not impact the results. To implement the k-means++ procedure, the parameters have been standardized over the considered countries. We set k = 3 after inspecting different possibilities via the Silhouette, Calinski, Davies and Dunn coefficients. The results of the evaluations are reported in Table 3, as it is possible to note, they suggest k = 3 as the best option. Indeed, obtaining adjacent neighbourhoods of clusters representing different countries' structures and regimes is preferred. So, taking k = 3 means selecting the condition where the distance between clusters is the minimum. Hence, the groups are close to each other, and the variance in the clusters is maximum, ensuring more comprehensive clusters' perimeters, namely a higher probability of capturing the countries behaving similarly, falling in the same clusters' area.
The proposed clustering algorithm selects the 3 clusters' centroids that minimize the within-clusters sum-of-squares criterion: where ‖x − μ‖ is the Euclidean distance between the four-dimensional vectors x and μ. Let us denote the centroids coming from the optimization procedure in Eq. (2) by μ 0 ð Þ , μ 1 ð Þ , μ 2 ð Þ ; they are associated to clusters labelled with "0", "1" and "2", respectively. Such optimized centroids lead to a classi- To summarise the proposed procedure, we report in pseudo-code what described above in Algorithm 1.
Algorithm 1. Summary of the rank-size analysis and k-means clustering. Table 3 Evaluation of the ks for the k-means cluster analysis. The reference for each index used is listed here in the same order they appear in the table, Rousseeuw [49]; Davies and Bouldin [50]; Dunn [51]; Caliński and Harabasz [52].  Table 4 Estimated parameters and clusters per each country. The last two columns respectively represent the rank at which the best fit of Eq. (1) presents a change in concavity and the maximum rank obtained for that country, namely the length of the series.

Results
The results of the best fits for Eq. (1) are reported in the first six columns of Table 4. The R 2 and the RSME are outstanding, proving the ability of Eq. (1) to represent the rank-size relationship. The interpretation of the calibrated parameters b a, b b, b c and b d leads to relevant comments related to the considered countries. The parameter b a is the intercept of the best fit curve with the y-axis. Hence, such a parameter is positively influenced by the highest level of new daily deaths per million experienced by the countries. We observe the maximum value of b a held by Hungary and the minimum one by Australia.
Differently, b b is associated with the slope of the decay; thus, not unexpectedly, it is always negative in our case. In particular, at low ranks, a high value of the absolute value of b b stands for a steep curve, while b b close to zero means that the curve is rather flat. The difference between such cases is the distance between the sizes at the low ranks, which is large for the steep cases and small for the flat ones. We observe that countries experiencing a single pandemic wave with low daily deaths have similar values. This explains why such countries have the parameter b b much closer to zerosee the case of Australia in Fig. 1; on the contrary, the maximum value of the absolute value of b b is scored by Slovenia.
The concavity is driven by b c, that is positive or negative according to a convex or concave shape of the curve at low ranks, respectively. In our case, such a calibrated parameter is always positive, except for Czechia and Italy. Therefore, for all the other countries, decrements of the lowranks' sizes decrease as the rank grows. This means that the highest number of daily deaths per million forms a peak in the overall distribution. The flatter shape of the concave curve at low ranks for Italy and Czechia points to more homogeneous values of the daily deaths per million at low ranks. Such behaviours are amplified as the absolute value of b c increases. We notice that Slovenia scores the maximum value of such a parameter in this respect.
Concluding, an easy computation gives that rank represents the unique inflection point of the best fit curve, where a change of concavity is observed. This quantity is reported in column "Inflection point" of Table 4. The highest value of the rank associated with the inflection point is scored by Czechia, while the lowest is by Slovenia. Such values further confirm the aforementioned logic, with Czechia having experienced a more recent and prolonged critical situation than Slovenia (see Fig. 1).
The standardized parameters b a, b b, b c and b d are employed to feed the clustering algorithm. The resulting clusters are reported in the column "Clusters" in Table 4 which, jointly with Figs. 1, 2 and 3, further informs about the features captured in fitting the data with Eq. (1). The cluster identified with the colour blue and the number "0" mainly contains countries with a relatively low number of deaths per million; Australia, Japan and South Korea obtain the lowest losses. This is further confirmed by sorting the results by b a. It is relevant to point out that the same countries appear with the lowest values of the calibrated parameter b b. Such a Fig. 2. Histograms of the estimated parameters divided by the resulting clusters, five bins per colour, are searched. On the y-axes, there are the relative frequencies that sum to one per cluster.
finding makes sense because when a series of deaths gets a shock, it takes time to get back to zero, so this is reflected in a gentler decay registered in the rank-size relationship. Interestingly, the cluster indicated in orange and identified by the number "1" is characterised by low values of b b and b d, but high values of b c. For example, the countries with the lowestbs are Slovenia, Hungary, Poland, Croatia and Belgium. The highest b c are reported by Poland, Croatia, Luxembourg, Belgium and Slovenia. The lowest b ds are scored by Slovenia, Belgium, Luxembourg, Portugal and Croatia. Despite the outlier removal procedure, these relatively small countries suffered losses with high daily peaks. Finally, the green cluster identified with the number "2" has countries sitting in the middle of the distribution when values are sorted by b b. Similarly, when they are ordered by b c except for Czechia and Italy, which present the lowest values of b c even if they belong to cluster number 2. To justify such a result, we look at the "Inflection point" column in Table 4. Indeed, for Czechia and Italy, the inflection points fall at low ranks, and from Fig. 1 we can see that the quoted countries have experienced lengthened periods of high new deaths per million. Montenegro and United States also show similar patterns having long periods of high levels of new daily death per million. However, for them, such a condition is quite evident over the investigated period; therefore, the situation does not allow for a change in concavity to happen early in the rank, even if Montenegro and United States belong to the same cluster of Czechia and Italy.
Moving on to an example, Fig. 4 contains a comparison of Italy and the UK. It allows concluding that Italy's change in the concavity at middle ranks lead the country to be in Cluster 2, on the other hand, the UK's number of days with high deaths per million presented at low ranks makes the country more suitable for Cluster 1.

Conclusions
This paper aims at providing a unified framework at a country level of the number of deaths for COVID-19 by moving from the daily data and through different waves. To this aim, we implement a rank-size analysis via a four-parameter third-degree polynomial on the series of COVID-19 new deaths per million registered in 35 countries. The statistical soundness of the results allows the identification of different regimes in the data. Specifically, it is possible to make comparisons between countries by using a rank-size approach because the best fits of Eq. (1) are statistically good in all the considered cases (see Table 4). We have provided a reasonable interpretation of the four estimated parameters in Eq. (1), hence capturing insightful information regarding the COVID-19 severity in the countries. In this respect, the clustering activity is grounded on the parameters calibrated from the best-fit exercise, and we group countries according to the phenomenon's features captured by such rank-size function's parameters. The main determinants are given by days with picks of deaths, the steadiness of casualties number, endured COVID-19 waves and other elements that affect the shape of the ranked data. In Fig. 3 a visual representation of the cluster profiles is reported, and in Table 4 the clusters are summarised. The clustering exercise leads to relevant information for policymakers. Indeed, Government and supra-national health institutions might carry out common strategies for contrasting the diffusion of COVID-19 and reducing its fatality rate. That can be done by investigating the similarities and divergences among countries described by the clustering procedure's results. Furthermore, countries' policymakers monitoring their own conditions and those of interrelated countries, for example, because of import/ export relationships, may benefit from the clusterisation to detect risks and define actions points. Namely, at a given point in time, the cluster to which a country belongs and the ones of its partners/ competitors provide proxies to evaluate the exposure to the pandemic, suggesting actions like stock up on essential resources for preserving economic interests or evaluating countries' common features. This is relevant, especially in the case of pandemics, because countries ahead experiencing waves can be seen as flags for interrelated countries (e.g., countries strongly connected via single transportation systems) not yet in the same situation; or because knowing that other countries are in the same cluster provides a view on their managerial abilities and infrastructures conditions. Moreover, the rank-size best-fit curve can effectively describe the pattern of the pandemic in that it provides a clear illustration of the ratios between the consecutive ordered ranked data. Thus, a clustering procedure rank-size based is able to distinguish the countries where the pandemic maintains a generally stable number of fatalities from those with remarkable high peaks of fatalities. So, a policymaker can gain several insights into the effects of countries' policies on the pandemic evolution.
It is important to notice that the rank-size approach allows for evaluations of the overall phenomenon without referring to specific periods and time ranges. In doing so, the proposed approach is free from biases associated with the time inconsistency of the data at country levels. Hence, we do not need here to implement timebased normalising procedures, which would demand an additional transformation of the data as reported by Zarikas et al. [34], and Middelburg and Rosendaal [33].
CRediT authorship contribution statement