Using country-level variables to classify countries according to the number of confirmed COVID-19 cases: An unsupervised machine learning approach [version 2; peer review: 1 approved, 1 approved with reservations]

Background: The COVID-19 pandemic has attracted the attention of researchers and clinicians whom have provided evidence about risk factors and clinical outcomes. Research on the COVID-19 pandemic benefiting from open-access data and machine learning algorithms is still scarce yet can produce relevant and pragmatic information. With country-level pre-COVID-19-pandemic variables, we aimed to cluster countries in groups with shared profiles of the COVID-19 pandemic. Methods: Unsupervised machine learning algorithms (k-means) were used to define data-driven clusters of countries; the algorithm was informed by disease prevalence estimates, metrics of air pollution, socio-economic status and health system coverage. Using the oneway ANOVA test, we compared the clusters in terms of number of confirmed COVID-19 cases, number of deaths, case fatality rate and order in which the country reported the first case. Results: The model to define the clusters was developed with 155 countries. The model with three principal component analysis parameters and five or six clusters showed the best ability to group countries in relevant sets. There was strong evidence that the model with five or six clusters could stratify countries according to the number of confirmed COVID-19 cases (p<0.001). However, the model could not stratify countries in terms of number of deaths or case fatality rate. Conclusions: A simple data-driven approach using available global information before the COVID-19 pandemic, seemed able to classify countries in terms of the number of confirmed COVID-19 cases. The Open Peer Review


Introduction
The ongoing COVID-19 pandemic has attracted the attention and interest of public health officers, practitioners, researchers and the general population. They all are working together to slow down the spread of the disease, thus reducing the number of severe cases and deaths. Their efforts have already produced relevant preliminary information on COVID-19 risk factors and the epidemiological profile of the disease 1-3 , with plenty more information not published yet (e.g., academic pre-prints).
The available evidence-published and unpublished-has mostly focused on the individual level; that is, they have studied the patients, their characteristics, disease progression and outcomes. Little has been studied about large populations and geographic areas; in other words, ecological evidence and research addressing study units other than the patients are scarce, though can reveal relevant and pragmatic information. In this line, research with novel analytical approaches, such as machine learning algorithms, is also uncommon.
Research at the country level could reveal potentially modifiable associated factors that individual-level data are still unable to study because of the limited number of observations. Moreover, machine learning techniques informed by country-level variables can provide classification algorithms useful to understand how countries may behave during and after the COVID-19 pandemic. Therefore, classification algorithms can reveal patterns to identify countries where the pandemic may have a similar effect. Countries could use this information to prevent worse-case scenarios given the cluster to which they belong. Global and regional organizations could use country clusters to organize similar aid to countries in the same cluster, while prioritizing clusters likely to experience the worse outcomes. Consequently, we aimed to develop a simple unsupervised machine learning algorithm informed by country-level variables before the COVID-19 pandemic, that can classify countries regarding the number of confirmed COVID-19 cases and deaths. That is, we aimed to answer: can country characteristics before the COVID-19 pandemic be useful to cluster countries according to COVID-19 outcomes (e.g., number of cases and deaths)? In so doing, we provide a preliminary framework to stratify countries with similar progression through the COVID-19 pandemic.

Data sources
We used different data sources to build a dataset with information on COVID-19, prevalence estimates of selected diseases, a socio-economic metric, an air pollution metric, and a metric of health system coverage ( Table 1). The unit of analysis was a country. Variables and specific data sources are shown in Table 1. Except for the COVID-19 variables, the other variables were used in the clustering analysis; that is, we used eight input variables for the cluster analysis: four diseases, air quality, gross domestic product per-capita, an universal health coverage index and the proportion of men in the country (Table 1). In other words, countries were clustered following unsupervised machine learning algorithms based on prevalence estimates Clustering a When a country did not have data for 2017, we used the latest available; when a country did not have any data on this source, we used data as reported by a Google search (this was the case for four countries).

Amendments from Version 1
The reviewers provided very interesting comments that improved our work. They requested further details on the methodology, variables selection and cluster analysis (Table 2). In comparison to the original version, the methods section includes more details. Similarly, they suggested to further elaborate on the discussion about the relationship between the input variables and outcomes. We followed this recommendation.
Any further responses from the reviewers can be found at the end of the article of the selected diseases, socio-economic status, air pollution and health system coverage (Table 1).
These predictors were selected because they are closely related to the COVID-19 pandemic, both from a clinical and public health perspective. We chose two chronic non-communicable diseases (diabetes and chronic obstructive pulmonary disease [COPD]) and two infectious diseases (tuberculosis and HIV/IDS). Diabetes seems to be very frequent among COVID-19 patients 10 . Although hypertension had a higher frequency than respiratory diseases 10 , we chose COPD because of the structural and pathophysiological pathways it can share with an acute respiratory disease such as COVID-19; the same logic would apply for tuberculosis. We chose HIV/AIDS because of the high potential of impaired immune response. We chose 2.5 particulate matter (particles of width <2.5 µm) as a metric of air pollution; 2.5 particulate matter has been related to severe acute respiratory syndrome 11 . Finally, we chose a metric of socio-economic status and health system coverage, which could impact on the probability of a person to adopt preventive care and access to appropriate healthcare should it be necessary.

Data analysis -clustering
Predictors. The variables used to develop the clustering model had different values between them, thus each of them carries a different variance. Because of this characteristic, it is relevant to standardize these variables to set reliable clusters without losing information. Consequently, before running the unsupervised clustering algorithms, the predictors were treated with an orthogonal transformation and then with principal component analysis (PCA).
PCA. The PCA is a technique within the remit of unsupervised machine learning algorithms. PCA follows an orthogonal transformation, which turns correlated variables into an uncorrelated set of variables. The PCA aims to create a set of characteristics, or components, that represents the relevant information from the original group of variables 12,13 . The PCA seeks to reduce the number of predictors while maximizing the variance.
In this work, and to avoid losing information explained by the original eight predictors, we prespecified three PCA components; the three PCA components retained a variance of 1. This method of obtaining 100% as an explained variance imply keeping 100% of the information explained by the original eight predictors. Moreover, these three components gave the most reliable clusters as reported in the results section. We used the PCA algorithm available in the Scikit-Learn library 14 .
K-means. This technique seeks to group heterogenous elements into homogenous clusters. This approach is considered a paradigm in unsupervised machine learning, because it assigns the elements into clusters which were unknown at the beginning of the analysis 15 . A few authors have used this methodology in clinical and public health research [16][17][18][19] .
There are different methods for unsupervised clustering depending on the data characteristics 20 . Given our data and aims, we chose a centroid-based algorithm: k-means. This approach works well when the clusters have similar size, similar densities and follow a globular shape.
Regarding the number of clusters that optimizes the function convergence to the centroids, we plotted the elbow function ( Figure 1) which, paired with epidemiological knowledge from the countries, supported the choose of five and six clusters ( Figure 1). That is, five and six cluster classified countries in groups with shared socio-demographic and epidemiological profiles. Although five and six clusters provided similar groups, six clusters classified central Africa with greater detail, which could be useful for these countries and regional organizations. Overall, the function cost (elbow plot, Figure 1), paired with the overall results (boxplots and maps), suggested that five or six cluster were a sensitive decision.
When there is a limited number of observations, as it is arguably in this analysis, the number of clusters around the "elbow" function ( Figure 1) provides similar information. At this point, it may be advisable to select the number of clusters which relates better to expert knowledge. Therefore, we used visual inspection of maps and plots to decide on the number of clusters that provide the best results, grouping countries in consistent clusters with similar background.
Post-hoc analysis suggested we made a sensible choice when selecting 5 and 6 clusters. A dendrogram with Euclidean distances showed that 5 clusters were the optimum number. Similarly, the Silhouette analysis revealed the largest average Silhouette score for 3 (0.43), 4 (0.48), 5 (0.44), and 6 (0.42) clusters; all other options from 1 to 10 clusters were below 0.40. As explained above, the visual inspection of maps suggested that 3 or 4 clusters did not provide a good classification. That is, countries with no strong similarities were clustered. Overall, our choice of 5 and 6 clusters was supported by the analysed metrics (dendrogram and Silhouette).
We used the k-mean algorithm available in the Scikit-Learn library, with five and six clusters, 500 iterations, and a fast initiation of convergence with k-mean++ 21 .

Statistical analysis
The COVID-19 variables-number of confirmed cases, number of deaths, case fatality rate and order when the first case appeared-were compared across clusters with the one-way ANOVA tests. Within clusters, pairwise combinations were analysed with t-tests adjusted for multiple comparisons with the Bonferroni method. The statistical analysis was conducted with COVID-19 data until March 23 rd , 2020. Analysis was performed in R (v3.6.1).

Ethics
This work analysed open-access data and did not involve any human subjects. No approval by an IRB or ethics committee was sought.

Data points
The clustering models were built with 155 countries and territories. Based on visual inspection of maps and boxplots, and on statistical parameters, the clustering models with three PCA components and five ( Figure 2A) or six ( Figure 2B) clusters performed the best to stratify countries according to COVID-19 variables ( Figure 3; data available with the manuscript). The median and interquartile range, of the variables used in the clustering analysis, are presented in Table 2.

Clusters prediction
The one-way ANOVA test comparing the confirmed number of COVID-19 cases across the five and six clusters, strongly  suggested there was a difference between groups (p<0.001).
Regarding the model with five clusters, the strongest differences were between clusters 0 and 1, 0 and 4, 1 and 2, 2 and 3, as well as 2 and 4 ( Figure 3, Table 3). Similarly, for the model with six clusters there were ten pairwise combinations with strong differences in the number of confirmed COVID-19 cases ( Figure 3, Table 3).
The proposed clustering with five groups did not stratify well according to number of total deaths (p=0.067);   adding one more cluster did not improve the prediction (p=0.864). None of the pairwise combinations revealed a strong difference ( Figure 3, Table 3). Overall, the same findings applied to case fatality rate for five (p=0.320) and six (p=0.373) clusters, with no differences in pairwise comparisons ( Figure 3, Table 3).
There was strong difference among cluster regarding the order at which each country had the first confirmed case, regardless of the number of clusters (p<0.001). For the model with five clusters, there were strong pairwise differences in all but four pairs ( Figure 3, Table 3). In a similar line, eight of the pairwise combinations in the model with six clusters revealed a strong difference ( Figure 3, Table 3)

Main results
Based on open-access variables at the country level, along with unsupervised machine learning algorithms (k-means), we developed a clustering model that can classify countries well regarding the number of confirmed COVID-19 cases. However, the model did not stratify countries well according to the number of deaths or case fatality rate.
The clustering model we proposed has potential applications. First, for each cluster we report a median and a range of number of confirmed COVID-19 cases. Although still early and deserving of further scrutiny as the outbreak progresses, the results could suggest that the number of cases in one country in one cluster will be within the proposed range for that cluster, unless one country performs below the expectation (i.e., exceeds the proposed range).
Unless there are substantial changes in the predictors used to define the clusters, these could signal countries that are particularly vulnerable or resilient for future respiratory outbreaks of this kind. Future research in a similar situation can test whether the proposed clusters also stratify countries well regarding the number of cases. Alternatively, the model could be tested with data of old respiratory pandemics to assess if it would have classified countries well.
Overall, considering the limitations of this work, the stage of the ongoing COVID-19 pandemic, and the general knowledge about this disease and its epidemiological profile, we provided a preliminary clustering model that could be useful to understand similarities and differences across countries, and how they may be affected by the ongoing pandemic.

Results in context
The input variables could potentially explain the clusters configuration. For example, cluster number four had the largest number of confirmed cases. This cluster also had the best universal health coverage index. It could be argued that such a strong health system is capable of performing tests to large populations, hence a large number of diagnosed cases. Conversely, cluster number two appeared to have the worst death rates; this cluster also had the largest tuberculosis prevalence as well as the smallest gross domestic product per capita and universal health coverage index. These epidemiological -large burden tuberculosis -and socio-demographic profiles could explain why the high death rates.
The cluster configuration herein presented did not seem to group countries closer to China, where the pandemic started.
In other words, countries with the first imported cases did not cluster together. This could mean that the selected input variables do not correlate well with, for example, travel frequency or population movement from China to nearby countries. Alternatively, this unexpected finding could suggest that the selected input variables are more relevant than proximity or connections between countries.
We are unaware of other studies that have aimed to classify countries based on simple open-access variables, and that can stratify the countries based on the number of COVID-19 cases.
Most of the previous research using unsupervised machine learning clustering algorithms on health research has focused on individuals and diseases [16][17][18][19] . This work complements the available evidence at the individual level with preliminary information on clusters at the country level, with potential relevant applications in the current COVID-19 pandemic. Nevertheless, future research should verify the accuracy and stability of our findings, so that they can be applied for this and future similar scenarios.

Strengths and limitations
We proposed a simple algorithm to classify countries regarding the number of confirmed COVID-19 cases. In that sense, this model and others can be easily applied and developed. However, there are limitations to acknowledge. First, one could argue that there were few predictors to define the clusters.
However, these were relevant variables that are freely available for research and analysis. Moreover, finding reliable, consistent and comparable information for all -or most-countries in the world may be challenging. This calls to researchers and international organizations to produce more information at the country level following similar methods that will allow global comparisons and analysis. Second, we did not find any strong evidence for the total number of deaths or case fatality rate. This could be because there are, fortunately, still very few deaths in most countries precluding strong comparisons. Our model can be tested again in the future, when the outbreak ends and there would be potentially more deaths, to assess whether the performance on this outcome improves. Third, we based our analysis on the confirmed number of cases and deaths. It is expected that this number may not reflect the actual number of people with the disease. In other words, it is more likely that there are more COVID-19 cases that have not been diagnosed or confirmed. This could be a limitation if we had aimed to predict the exact number of sick people, in which case we should have somehow accounted for the under-reporting.

Conclusions
Using readily available variables we developed an unsupervised machine learning algorithm that can stratify countries based on the number of COVID-19 confirmed and reported cases. This preliminary work provides a timely algorithm that could help identify countries more vulnerary or resistant to the ongoing pandemic.

Source data
The source data for this study are described in Table 1. This project contains the following extended data: • Datasets.zip (containing the pooled data used in this analysis).

Extended data
• Codes.zip (containing codes used in the analysis to develop the cluster and to assess its performance). Open Peer Review 1.
(major point) It still remains unclear what the "visual inspection of maps" entails with regards to how k was selected. Is this, for example, purely geographical or geopolitical? Or was the similarity of countries assessed on the basis of the input variables, or perhaps the outcomes? The elbow plot and silhouette score both point towards the k=4 solution. Given that cluster analysis is generally used to uncover "hidden" patterns in data, then perhaps "dissimilar countries" were grouped due to some unmeasured factor(s). If, however the k=4 solution showed no "interesting" segmentation with regards to the outcome variables then this should be stated as it is sensible to reject it on that basis.

2.
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Cluster analysis; phenotype discovery; airways disease; health informatics.
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

Author Response 10 Jun 2020
Rodrigo M Carrillo-Larco, Imperial College London, London, UK Q1. The authors have responded in detail to our review and we welcome the changes they have made. We only have two further points to make, both relating to the section discussing cluster number selection. Addressing the first point should be straightforward and is of lesser importance. Addressing the second point is in our opinion essential, as the selection criteria for the model parameters (in this instance k) should be clearly stated. A1. We thank the reviewer for the much relevant comments.

Q2. (minor point) Ideally, the method of selecting the number of k should be presented in the methodology section and the findings (of the elbow plot and silhouette scores) should be presented in the results section.
A2. If the editors allow, we would rather keep the manuscript as is. We agree that the findings of the cluster selection process may be shown in the results section. However, we focused this as an epidemiological work, that took advantage of a solid machine learning methodology. In that line, the results section shows the clusters, countries and their profiles, i.e., epidemiological evidence. All aspects of the machine learning analytical process were included in the methods section.

Q3. (major point) It still remains unclear what the "visual inspection of maps" entails
with regards to how k was selected. Is this, for example, purely geographical or geopolitical? Or was the similarity of countries assessed on the basis of the input variables, or perhaps the outcomes? The elbow plot and silhouette score both point towards the k=4 solution. Given that cluster analysis is generally used to uncover "hidden" patterns in data, then perhaps "dissimilar countries" were grouped due to some unmeasured factor(s). If, however the k=4 solution showed no "interesting" segmentation with regards to the outcome variables then this should be stated as it is sensible to reject it on that basis. A3. The last statement by the reviewer is a close representation of our process; however, we would not only use the word "interesting", but also "reliable" or "expected". We are sorry this did not come across in the last version. By "visual inspection" we meant that, based on general knowledge (geographical, geopolitical and epidemiological), 4 clusters grouped countries with little in common; in other words, based on prior knowledge, they did not have strong reasons to be together. It is not just that 4 clusters were uninteresting, but the configuration would not fully agree with prior belief; though 5 or 6 clusters would make more sense. As explained in our prior answer, we focused this more like an epidemiological work, thus we did not "blindly" follow the elbow or silhouette estimates, but tried to understand, based on prior knowledge, whether the clusters were sensible or expected. We have edited the methods section and included these lines: As explained above, the visual inspection of maps suggested that 3

or 4 clusters did not provide a good classification. That is, countries with no strong similarities were clustered. Visual inspection of the maps was based on geopolitical, geographical and epidemiological knowledge, in general and regarding the input variables. A segmentation in 4 clusters did not reveal interesting, reliable or expected groups; in other words, based on background knowledge, countries expected to be together were not. A segmentation in 5 and 6 clusters provided sensible results in accordance with prior knowledge. Overall, our choice of 5 and 6 clusters was sensible, based on prior knowledge and still supported by the analysed metrics (dendrogram and Silhouette).
The study design and results are, for the most part, clearly presented, and the article is wellwritten. However, information is lacking with regards to both methodological aspects as well as the presented findings of the study. Most importantly, it is not clear what question the study is trying to answer.

Is the study design appropriate and is the work technically sound?
1) In terms of the appropriateness of the study, besides the lack of similar studies in the literature, no further justification is given as to why this study design was selected. If the purpose of the study is to allow for prediction of COVID-19 outcomes, a predictive model might have been more appropriate. The rationale for selecting cluster analysis is not sufficiently explained. Furthermore, the selection of input variables seems to be based on their availability rather than evidence from the literature that would make them suitable candidates for inclusion. The authors mention in the discussion that these variables are "relevant", however this claim is not substantiated.

Are sufficient details of methods and analysis provided to allow replication by others?
2) The paragraph explaining the selection of principal components should be re-written as it is ambiguous whether the retention of three PCA components was pre-specified or whether keeping 100% of the explained variance was the original target. It is my understanding that four variables were used as input in the PCA and the first three components were selected, and that the three together explain 100% of the variance. It makes no sense for solely the third component to explain 100% of the variance, especially given that the output of PCA lists components in descending order of % explained variance.
3) Related to the comment above, It is mentioned that "three components gave the most reliable clusters". By which metric was reliability assessed? If this is to do with cluster stability, typically this entails re-sampling the data and verifying cluster stability with regards to the cluster characteristics using a metric such as the Jaccard coefficient 1 .
4) The following sentence in the section labelled k-means needs rephrasing: "Regarding the number of clusters that optimises the function convergence to the centroids, we estimated a cost function which supported the choose of five and six clusters". At the moment it is not clear which cost-function is being referred to and what is meant by estimating a cost function. I suspect the authors are referring to the standard k-means cost function, the sum of squared distances from each point's cluster centre.
5) It is not clear how the choice of 5 or 6 clusters was made. According to the elbow plot in Figure  1, the elbow point is at 4 clusters. It is also unclear how the clustering results were used for the purpose of selecting k "based on visual inspection of maps and boxplots". The maps in Figure 2 are fairly similar between the 5-and 6-cluster solutions and the boxplots in Figure 3 also suggest that clusters 0, 3 and 4 remain the same with some countries in clusters 1, 2 of the 5-cluster solution redistributed between them and with the additional cluster 5 in the 6-cluster solution.
6) There are more reliable metrics to aid with cluster selection, including the silhouette coefficient 2 , and the GAP statistic 3 . The elbow plot is simply a heuristic. The authors should at least explain their choice of method.

If applicable, is the statistical analysis and its interpretation appropriate?
7) Although appropriate, the statistical analysis lacks further interpretation. The usefulness of the model could be illustrated by evaluating the predictive value of cluster labels to answer the question "Are the labels more predictive than individual variables?" 8) The resulting clusters are difficult to interpret without a summary table of cluster characteristics in terms of the 4 input variables used in the analysis.

Are the conclusions drawn adequately supported by the results?
9) No specific conclusions are drawn in the discussion. What are the cluster characteristics and how are they associated with confirmed COVID-19 cases? Are the results expected, surprising? There is little discussion on the characteristics, whether present or absent in the model, that would drive the countries to cluster together with regards to the number of reported cases. A few example points for discussion are listed below.
10) It appears from the map distribution that the clusters loosely correlate with GDP -although without a summary table confirming this is hard to tell for certain. I am not an epidemiologist and neither is NA, therefore it is not our area to comment, but countries with higher GDP are more likely to perform more tests, and are thus more likely to have a higher number of cases. 11) Additionally, some countries are more connected than others (e.g. because of air travel), and the spread of COVID-19 is not uniform across the world (e.g. countries that are closer to China reported cases earlier) and therefore, different countries are at different stages of the pandemic. It would make more sense to separately cluster countries with similar exposure to the virus as well as comparable reporting standards.
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

Author Response 29 May 2020
Rodrigo M Carrillo-Larco, Imperial College London, London, UK

Reviewer #2
Q1. The study design and results are, for the most part, clearly presented, and the article is well-written. However, information is lacking with regards to both methodological aspects as well as the presented findings of the study. Most importantly, it is not clear what question the study is trying to answer. A1. We appreciate the comprehensive evaluation; the comments will most certainly improve our work. We have included more details about the methodology (please refer to answers 4, 5 and 6); moreover, we have further elaborated on the results and discussion (please refer to answers 7, 8, 9, 10 and 11).
More than pursuing a specific research question, we aimed to develop a classification model that, benefiting from simple and available ecological variables, could cluster countries according to COVID-related outcomes (number of cases and deaths). If anything, our research question would be: can country characteristics before the COVID-19 pandemic be useful to cluster countries according to COVID-19 number of cases and deaths? We have modified the last paragraph of the introduction to include this question.
Q2. In terms of the appropriateness of the study, besides the lack of similar studies in the literature, no further justification is given as to why this study design was selected. If the purpose of the study is to allow for prediction of COVID-19 outcomes, a predictive model might have been more appropriate. The rationale for selecting cluster analysis is not sufficiently explained. Furthermore, the selection of input variables seems to be based on their availability rather than evidence from the literature that would make them suitable candidates for inclusion. The authors mention in the discussion that these variables are "relevant", however this claim is not substantiated. A2. We agree that lack of evidence is not a strong justification, and we acknowledge we were not clear on our motivations. These have been further elaborated in the last paragraph of the introduction; these lines read: Therefore, classification algorithms can reveal patterns to identify countries where the pandemic may have a similar effect. Countries could use this information to prevent worse-case scenarios given the cluster to which they belong. Global and regional organizations could use country clusters to organize similar aid to countries in the same cluster while prioritizing clusters likely to experience the worse outcomes.
We certainly included variables that were readily available. However, we also chose variables that were closely related to the COVID-19 pandemic. The rationale behind our variable selection was explained in the paragraph immediately before the "Data analysis-clustering" sub-heading. In these lines, we elaborated on why we chose the selected variables, what their relationship may be with COVID-19, and why we did not choose other variables that could have been available as well. References were included to support our statements.
Q3. The paragraph explaining the selection of principal components should be rewritten as it is ambiguous whether the retention of three PCA components was prespecified or whether keeping 100% of the explained variance was the original target. It is my understanding that four variables were used as input in the PCA and the first three components were selected, and that the three together explain 100% of the variance. It makes no sense for solely the third component to explain 100% of the variance, especially given that the output of PCA lists components in descending order of % explained variance. A3. We apologise for the misunderstanding, as it was the consequence of a miscommunication. A priori, we decided on three PCA variables. We included eight input variables (please refer to answer 8) and applied the PCA. As you inferred correctly, these three PCA variables retained or explained 100% of the variance. As you correctly pinpointed, it made no sense for solely the third component to explain 100% of the variance. We have modified the text in the "PCA" sub-heading to better reflect this procedure: In this work, and to avoid losing information explained by the original eight predictors, we prespecified three PCA components; the three PCA components retained a variance of 1. This method of obtaining 100% as an explained variance imply keeping 100% of the information explained by the original eight predictors.
Q4. The following sentence in the section labelled k-means needs rephrasing: "Regarding the number of clusters that optimises the function convergence to the centroids, we estimated a cost function which supported the choose of five and six clusters". At the moment it is not clear which cost-function is being referred to and what is meant by estimating a cost function. I suspect the authors are referring to the standard k-means cost function, the sum of squared distances from each point's cluster centre. A4. We referred to the "elbow" plot ( Figure 1). We have rephrased this sentence to make it clearer, that we were talking about the "elbow" plot in figure 1. Please, refer to answers 5 and 6 for details about other modifications made regarding the analysis and cluster selection.
Q5. It is not clear how the choice of 5 or 6 clusters was made. According to the elbow plot in Figure 1, the elbow point is at 4 clusters. It is also unclear how the clustering results were used for the purpose of selecting k "based on visual inspection of maps and boxplots". The maps in Figure 2 are fairly similar between the 5-and 6-cluster solutions and the boxplots in Figure 3 also suggest that clusters 0, 3 and 4 remain the same with some countries in clusters 1, 2 of the 5-cluster solution redistributed between them and with the additional cluster 5 in the 6-cluster solution. A5. Selection of 5 and 6 clusters was informed, mostly, by epidemiological knowledge about the countries, and how these were clustered. We did not choose 4 clusters, as the elbow plot would have suggested, because some countries were clustered with others they have little in common, epidemiologically speaking. This is what we meant by "visual inspection of maps and boxplots". Mostly maps, though we also checked the boxplots. We have included a few lines the methodology section ("K-means" sub-heading) to explain our rationale: …That is, five and six cluster classified countries in groups with shared socio-demographic and epidemiological profiles. Although five and six clusters provided similar groups, six clusters classified central Africa with greater detail, which could be useful for these countries and regional organizations. Overall, the function cost (elbow plot, Figure 1), paired with the overall results (boxplots and maps), suggested that five or six clusters were a sensitive decision.
The maps with 5 or 6 clusters look similar. However, the map with 6 clusters classified countries in central Africa with greater detail. Although in the same sub-region, socioeconomic and epidemiological differences provide unique features to these countries, that a 6-cluster model can identify. We have also included this argument in the new lines (please, refer to the text in italic in the previous paragraph).
Please, for further arguments about the choice of 5 and 6 clusters, referrer to answer 6.
Q6. There are more reliable metrics to aid with cluster selection, including the silhouette coefficient 2 , and the GAP statistic 3 . The elbow plot is simply a heuristic. The authors should at least explain their choice of method. A6. We did not follow any of these methods because of the limited number of observations available; that is, the number of countries (analysis units) studied. Given the reduced number of observations, the elbow function would be fairly similar for the number of clusters close to the "elbow". At this stage, it is advisable to subjectively assess which clusters gives the best information or correlates better with expert knowledge, [1] , [2] rather than relying only on performance metrics. As requested, we have further elaborated on the rationale for the choice of method: When there is a limited number of observations, as it is arguably in this analysis, the number of clusters around the "elbow" function ( Figure 1) provides similar information. At this point, it may be advisable to select the number of clusters which relates better to expert knowledge. Therefore, we used visual inspection of maps and plots to decide on the number of clusters that provide the best results, grouping countries in consistent clusters with a similar background.
In addition, to further elaborate on our current choice of method, for clarity, transparency and consistency, we have conducted further analysis. First, the dendrogram with Euclidean distances showed the 5 clusters was the optimum number; this agrees with our current choice. The Silhouette analysis showed the metrics summarised in the table below. These show that the largest metrics (>40%) were retrieved for 3, 4, 5 and 6 clusters (please, see rows highlighted in green). After visual inspection of the maps with 3 and 4 clusters, we agreed that these did not classify or stratify countries well. In order words, there were countries in one cluster that may not have strong similarities (at least in epidemiological or socio-demographic terms). Consequently, 5 and 6 clusters appeared to be better options; again, the average silhouette score agreed with our original choice. We have included the following paragraph in the "K-means" sub-heading: Post-hoc analysis suggested we made a sensible choice when selecting 5 and 6 clusters. A dendrogram with Euclidean distances showed that 5 clusters were the optimum number. Similarly, the Silhouette analysis revealed the largest average Silhouette score for 3 (0.43), 4 (0.48), 5 (0.44), and 6 (0.42) clusters; all other options from 1 to 10 clusters were below 0.40. As explained above, the visual inspection of maps suggested that 3 or 4 clusters did not provide a good classification. That is, countries with no strong similarities were clustered. Overall, our choice of 5 and 6 clusters was supported by the analysed metrics (dendrogram and Silhouette).
Q7. Although appropriate, the statistical analysis lacks further interpretation. The usefulness of the model could be illustrated by evaluating the predictive value of cluster labels to answer the question "Are the labels more predictive than individual variables?" A7. We have further discussed (interpreted) about the relationship between the input variables, the cluster configuration, and how these relate to the outcomes. Please, refer to answers 9 and 10 for further details on the new text.
Although interesting, the proposed research question is beyond the aims of this work. The research question and justification have been further elaborated (please refer to answers 1 and 2). Arguably, any cluster may predict better than individual variables. That is a strong argument in favour or risk prediction models, above and beyond risk/prognostic factors alone.
Q8. The resulting clusters are difficult to interpret without a summary table of cluster characteristics in terms of the 4 input variables used in the analysis. A8. We have included a table showing the median and interquartile range of the eight input variables across clusters (Table 1).
There were eight input variables (Table 1); disease prevalence included 4 diseases. That is, four prevalence estimates hence the four variables in addition to air quality, GDP, universal health coverage index and proportion of male subjects in the country (eight input variables in total). We have included the following lines under the "Data sources" sub-heading to avoid confusions: …that is, we used eight input variables for the cluster analysis: four diseases, air quality, gross domestic product per-capita, an universal health coverage index and the proportion of men in the country (Table 1).
Q9. No specific conclusions are drawn in the discussion. What are the cluster characteristics and how are they associated with confirmed COVID-19 cases? Are the results expected, surprising? There is little discussion on the characteristics, whether present or absent in the model, that would drive the countries to cluster together with regards to the number of reported cases. A few example points for discussion are listed below. A9. We have further discussed on the cluster characteristics (input variables) and how these may explain the clusters configuration in relation to COVID-19 outcomes. These lines in the discussion section read ("Results in context" sub-heading): The input variables could potentially explain the clusters configuration. For example, cluster number four had the largest number of confirmed cases. This cluster also had the best universal health coverage index. It could be argued that such a strong health system is capable of performing tests to large populations, hence a large number of diagnosed cases. Conversely, cluster number two appeared to have the worst death rates; this cluster also had the largest tuberculosis prevalence as well as the smallest gross domestic product per capita and universal health coverage index. These epidemiological -large burden tuberculosis -and socio-demographic profiles could explain the high death rates.
Q10. It appears from the map distribution that the clusters loosely correlate with GDP -although without a summary table confirming this is hard to tell for certain. I am not an epidemiologist and neither is NA, therefore it is not our area to comment, but countries with higher GDP are more likely to perform more tests, and are thus more likely to have a higher number of cases. A10. We have further discussed how GDP, as an input variable in the clusters configuration, may relate to how the clusters reveal COVID-19 outcomes. Please, refer to the previous answer for details about the new text.
Q11. Additionally, some countries are more connected than others (e.g. because of air travel), and the spread of COVID-19 is not uniform across the world (e.g. countries that are closer to China reported cases earlier) and therefore, different countries are at different stages of the pandemic. It would make more sense to separately cluster countries with similar exposure to the virus as well as comparable reporting standards. A11. It would difficult to separately cluster countries with similar exposure to the virus; it would be more difficult to identify a threshold to define "similar exposure to the virus". This approach will make the clustering more complex, which we tried to avoid by selecting variables readily available yet closely correlated to COVID-19 (please refer to answer 2). In this line, comparable reporting standards are not a static measure. Countries have improved their reporting standards at different paces and through different means during the pandemic. Finally, both the exposure to the virus and reporting standards are characteristics of the pandemic. However, our aim was to use pre-pandemic characteristics.
We have further discussed the relevance of flights or connections. Please, refer to the discussion section for the new text ("Results in context" sub-heading): The cluster configuration herein presented did not seem to group countries closer to China, where the pandemic started. In other words, countries with the first imported cases did not cluster together. This could mean that the selected input variables do not correlate well with, for example, travel frequency or population movement from China to nearby countries. Alternatively, this unexpected finding could suggest that the selected input variables are more relevant than proximity or connections between countries.
Q12. Figure 1 needs axis labels. A12. We are providing a new figure with axis labels.