Subphenotyping of COVID-19 patients at pre-admission towards anticipated severity stratification: an analysis of 778 692 Mexican patients through an age-gender unbiased meta-clustering technique

We apply a meta-clustering technique to discover age-gender unbiased COVID-19 patient subphenotypes based on phenotypical before admission, including pre-existing comorbidities, habits and demographic features, to study the potential early severity stratification capabilities of the discovered subgroups through characterizing their severity patterns including prognostic, ICU and morbimortality outcomes. We used the Mexican Government COVID-19 open data including 778,692 SARS-CoV-2 population-based patient-level data as of September 2020. The meta-clustering technique consists of a two-stage clustering approach combining dimensionality reduction and hierarchical clustering: 56 clusters from independent age-gender clustering analyses supported 11 clinically distinguishable meta-clusters (MCs). MCs 1-3 showed high recovery rates (90.27-95.22%), including healthy patients of all ages; children with comorbidities alongside priority in medical resources; and young obese smokers. MCs 4-5 showed moderate recovery rates (81.3-82.81%): patients with hypertension or diabetes of all ages; and obese patients with pneumonia, hypertension and diabetes. MCs 6-11 showed low recovery rates (53.96-66.94%): immunosuppressed patients with high comorbidity rate; CKD patients with poor survival length and recovery; elderly smokers with COPD; severe diabetic elderly with hypertension; and oldest obese smokers with COPD and mild cardiovascular disease. Group outcomes conformed to the recent literature on dedicated age-gender groups. These results can potentially help in the clinical patient understanding and their stratification towards automated early triage, prior to further tests and laboratory results are available, or help decide priority in vaccination or resource allocation among vulnerable subgroups or locations where additional tests are not available. that MC4 includes patients of all ages (25%) without obesity and MC5 contains mostly young adults (66.7%) who suffer from obesity. This seems to suggest that obese young adults may behave as “older”, implying higher mortality 28,33 . However, we found just the opposite in young individuals without pre-existing comorbidities: MC1 and 3 have similar RRs albeit MC3 contains a significant number (59.27%) of obese patients or smokers. These findings suggest the role of habits cannot be considered alone, but always along with age, comorbidities and duration of unhealthy habits. Our results found that smoking is risk factors for severe COPD and cardiovascular, primarily in older patients – MC8, 10, 11. Therefore, it results reasonable that the longer the time as a smoker, the greater the incidence of severe disease. In young patients, however, the evidence of smoking’s negative influence is not so straightforward. Some reviews have presented current smoking as a protective factor versus former smoking, while it is clearly a risk factor versus never smoking 34 . Our results show that groups gathering young smokers have RRs which are not inferior to age-matched non-smoking groups, as proven by the RR of MC3 (95.22%, 34% smokers) versus MC2 (91.37%, 9.7% smokers). can explore the treatment and vaccination implications toward providing guidance on clinical triage and customize therapy, and also develop clinically robust subphenotype classification methodologies combined with the proposed two-stage cluster analysis. As the concern paid to efficient triage and personalized treatment increases, we facilitate further replicability of the study and generalization to other countries data by making available our experiment codes.


Introduction
In Mexico, mid-January 2020 reported the first cases of COVID-19. In early March 2020, the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was declared by the World Health Organization as a pandemic 1 . As of August 13th, 2020, a total of 20,439,814 confirmed cases of coronavirus disease 2019 (COVID -19) have been reported to the World Health Organization, and 744,385 lives have been lost 2 .
The COVID-19 pandemic has led to an unprecedented global healthcare challenge for both medical institutions and researchers. They have been making a huge effort to describe specific COVID-19 risk factors association, severity outcomes, and also personalized therapeutic options for COVID-19 patients being yet under investigation 3,4,5 . Recognizing different COVID-19 subphenotypes and their severity characterization may assist clinicians during the clinical course and research efforts. However, the availability of information to investigate such subphenotypes and consequent decision making often varies both according to the stage at which patients are in the COVID-19 clinical workflow -e.g., before admission, at admission, or during hospitalizationand according to the hospital access possibilities -e.g., hospitalized versus ambulatory patients-, especially in locations where hospitalization is difficult. In addition, the patient age and gender entail a potential correlation between subgroup characterization and their severity characterization which requires prudent usage in ML models.
Several studies have suggested potential COVID-19 subphenotypes, mainly within specific comorbidities such as pulmonary diseases or diabetes 6,7 or related to distinct genetic variants 8 . However, the Mexican population has its particularity due to a high prevalence of comorbidities, like hypertension, diabetes -a leading cause of death in 2020 9and obesity, which is leading the population to a undesirable risks for severe coronavirus outcomes than many other high-income countries 10 . Since distinct target population often present heterogeneous clinical characterization and severity outcomes, it remains crucial a transparent understanding regarding the characterization of COVID-19 subphenotypes in Mexican patients to help anticipate individuals' prognostic outcomes if one gets infected and evaluate subphenotypic severity presentations.
We describe the results of an unsupervised Machine Learning (ML) meta-clustering approach to identify potential subphenotypes of COVID-19 patients in Mexico based on previously existing comorbidities, habits and demographic features (i.e., age and gender). Stratification on gender and age groups was included for three primary reasons: (1) to reduce potential ML model's biases in representing the best the majority (e.g. the young adults) but not underrepresented groups (e.g. children and elderly) 11 ; (2) to reduce potential confounding factors from age and gender which are highly correlated with comorbidities, habits and mortality -i.e. age-gender clusters may help reveal more well-detailed patterns and phenotypical description-; and (3) to reduce interpretation biases, e.g. if one healthy cluster presents a mortality rate of 98.5% but includes patients from all ages, this specific mortality rate may vary across two patients from the same cluster whose age differ significantly (e.g. children versus adults). See section 1 of supplementary material for further details.
By using a population-based cohort of more than 700,000 patient-level cases, this is probably the largest cluster analysis about coronavirus patient-level cases to date. Other studies proposed unsupervised ML methods for aggregated population data 12 , CT image analyses 13,14 , molecular-level clustering 15 , or coronavirus-related scientific texts 16 . Several studies provided to date results from unsupervised ML on patient-level epidemiological data 17,18,19,20,21,22,23,24 . To our knowledge, however, none characterized agegender subphenotypes, nor aimed to a population-based study with solely the phenotypical information available at pre-admission towards automated risk stratification, and neither characterized the Mexican population that is generally more vulnerable due to its particularity in a high prevalence of comorbidities.
Performing an accurate triage upon admission, but especially in ambulatory settings, is often challenging, significantly depending on the patient information available to the physicians. This work, therefore, aims to characterize age-gender COVID-19 subphenotypes that may potentially establish target groups for triage systems to assist clinicians in efficiently allocating limited resources and prioritize vaccination among subgroups that are more vulnerable when they get infected during the pandemic. As these subphenotypes are based on easily available data, such as previous disease and lifestyle habits rather than COVID-19 related symptoms (e.g., fever and nausea), vital signs and biomarkers that are not often available in the first days of COVID-19 infection or difficult to obtain due to limited resources, our work therefore could support early triage prior to further tests and laboratory results and even provide guidance in areas where such tests are not available.

Results
Age-gender cluster analysis After evaluating the stratified clustering results, we selected the following number of clusters (k) for each specific age-gender group: <18-Male: k=5, <18-Female: k=4, 18-49-Male: k=7, 18-49-Female: k=7, 50-64-Male: k=9, 50-64-Female: k=8, >64-Male: k=8, >64-Female: k=8. This resulted in 56 age-gender clusters in total. Supplementary Table 5 describes the number of individuals for each age-gender cluster. By taking PCA scores, fed by the comorbidities and habits ratios of each group, we performed the second-stage metaclustering analysis and established 11 clinically distinguishable MCs. Figure 1 describes the relationships among different comorbidities and habits of the original 56 age-gender clusters through the first two principal components ( Figure 1A), also provides the correspondence to their assigned MCs ( Figure 1B) and their LOESS delineations for distinct severity outcomes ( Figure 1C to 1H).
The 56 clusters PCA analysis uncovered noticeable patterns and characterization among clusters of different ages in both genders. Young adults are prone to asthma and smoking habit; whereas the elderly was prone to many comorbidities such as hypertension, diabetes, obesity, COPD, pneumonia, and CKD. The results also show that obesity and smoking habit -both positively correlated-are strongly separated from immunosuppression and other diseases -both positively correlated-, implying these two pairs of features are negatively correlated in the studied data subgroups.
The LOESS models show that children took fewer days from presenting symptoms to hospitalization, showing higher ICU, intubation, and hospitalization rates than adults with similar conditions (Figure 1D, E, G, H). In contrast, MC3 -young obese cluster with moderate asthma and smoke rates-behaved inversely, implying that children, under similar clinical conditions, may receive priority regarding medical attention.
Inspecting the relationship between the PCA and LOESS models shows that CKD is significantly associated with a shorter survival length among deceased patients and an increase in intubation rates ( Figure 1E, D). Mortality constantly increases from children to the elderly, but the most severe zones are inclined toward pneumonia, CKD, and COPD ( Figure 1C) independently of the age groups.
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.    25 . Figure 2 reinforces that the children have a faster time from symptoms to hospitalization and are prone to ICU admission despite presenting similar clinical condition than adults (e.g., cluster <18M3 versus 50-64F5). Regarding gender discrepancy, females showed a better Recovery Rate (RR) despite presenting similar clinical conditions than males (e.g., >64M1 versus >64F1).
The phenotypes, demographic features and outcomes of each cluster group can be fully explored at http://covid19sdetool.upv.es/?tab=mexicoGov. All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Yes MC1 includes two healthiest clusters per each age-gender group, with a very high RR (90%). Most deceased patients in MC1 with pneumonia are older patients ( Figure 2). MC2 includes children and young individuals (mean age 18) with healthy habits and little incidence of relevant diseases (13% immunosuppression, 17% All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
MC4 includes individuals of all ages with healthy habits but, unlike MC1, most patients in MC4 have hypertension (41%) or diabetes (39%), but not both simultaneously. MC5 includes young adults with obesity (75%), diabetes (57%) and/or hypertension (69%). Despite of this dissimilarity, MC4 and 5 still have similarly high RRs, of approximately 80%. From MC4 onwards, all MCs have from 40 to 50% incidence of pneumonia as of case reporting; what does not exclude the possibility that some patients developed pneumonia days after. Noteworthy, in groups 4 to 11 more than 70% of deceased patients were diagnosed with pneumonia.
MC7 and 11 hold the lowest RRs (54% and 56%). MC7 includes older adults and elderly with common diseases -diabetes, hypertension and cardiovascular disease-plus CKD (81%). CKD stands out as the differential factor with similar MCs with low RRs, such as 6 or 9. MC11 is similar to 8 and 10; the key differences are the higher prevalence of smoking (78%, which doubles the former) and COPD (almost all patients, 91%), and a mean age eight years older (76 vs. 68 years). In addition, MCs that include older obese patients with smoking habits -MC8, 10, and 11-have significantly higher COPD and cardiovascular incidence, an association that does not occur with the young smokers -MC3.
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Variability among states and types of clinical institution
Regarding state variability, half of the Mexican states are prone to a higher probability in healthy clusters with better RR, lower hospitalization, ICU, and Intubation rates among each age-gender group ( Figure 3A, e.g., 18F2, 18M1, 18-49F1, 18-49M1, 50-64F1) and MCs ( Figure 3C), whereas another half behave inversely. Hidalgo, Baja California and Morelos represent healthiest groups as a contrast with Oaxaca, Coahuila de Zaragoza and Durango that represent the less healthy. Surprisingly, Mexico City showed significantly higher probability in healthier clusters than State of Mexico, albeit the population of their main urban areas are close, and both have similar resources and economic development level.
Regarding TCI variability ( Figure 3B, D), SSA, DIF, Private and Red Cross are prone to healthier young patients. This pattern occurred inversely in other TCIs, especially the Mexican Petroleum Institution, whose severe cluster probabilities are generally higher. The clinical institutions of the armed forces (SEMAR, SEDENA) were mostly healthy, intuitively with a higher probability of male patients. Noteworthy, among the three primary TCIs in Mexico, the public health system (SSA) is prone to mild-comorbidity and have relatively higher probabilities in healthy clusters among each age-gender groups, mostly in MC1 (57%) and 3 (16%); whereas in the two main social security systems (IMSS, ISSSTE) the situation is just the opposite. All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Discussion
To date, several reports used cluster analysis to describe heterogeneity or characterization in COVID-19 patient-level epidemiological data 17,18,19,20,21,22,23,24 . To our knowledge, none characterized populationbased data (778,692 patients) and neither analyzed the Mexican population to find potential subphenotypes through cluster analysis on age-gender controlled patient strata. Age-gender unbiased COVID-19 characterization is crucial for a robust subphenotype description as well as a deeper understanding of the inter-patient disease patterns based on exclusively pre-existing comorbidities and lifestyle habits. Such characterization could assist clinicians in anticipating individuals' risk if one gets infected, the poor outcomes, and morbimortality as well as providing bases to improve early triage systems with limited resources.
Previous literature have reported isolated risk factors and association of several severe disease progression. However, such information has a potential limitation in clinical decision-making progress. In this work, no single clinical variable or lifestyle habit was enough to characterize the COVID-19 subphenotypes; a typical phenomenon when the data has many categorical variables. This reflects the reality of clinical practice: patients do not usually fall into subgroups of "all good" or "all bad" outcomes, neither the patient outcomes are conclusive by one single variable. However, when considering the variables together, our study uncovered 11 clinically distinguishable MCs among 56 plausible age-gender clusters; these MC-defined subphenotypes alongside age-gender stratification may represent different disease mechanisms and outcome.
Each of the 11 MCs shows clinical consistency: their group outcomes can be potentially predicted from the proposed input variables, according to the literature published up to date. From an outcome perspective, a dividing line can be clearly drawn between MCs 1-5 and 6-11. While the former's have RRs always over 80%, the latter's overall survival never exceeds 70%. Several factors can explain these findings, mainly the age, habits, comorbidities. Since all MCs contain 30-60% of women within their input age-gender clusters, the association between gender and mortality is hard to see based only on MCs. However, age-gender cluster analysis showed clearly better outcomes in females despite presenting similar conditions than males; a phenomenon that also occurs between patients that have similar conditions from different age groups. Therefore, considering both age-gender clusters and meta-clusters is essential for a better characterization that reveals more relevant detailed information in COVID-19 subphenotypes.
Hereinafter, we discuss our results in accordance with both MCs and age-gender clusters and relate them with supporting literature to discuss clusters' clinical consistency through the associated risk factors, including age, habits and comorbidities, as well as on patient provenance and types of clinical institution.

Age
Two groups with very high RRs are MC2 and MC3, which contain children and young adults. Age may play a protective role against the disease for two reasons. First, as proven by MC3 versus all single-aged groups (MCs 6-11), pneumonia incidence is lower in young healthy groups; hence, good RR could be attributable to mild SARS-CoV-2 disease. Second, as shown by good RRs in MC2 -children with severe disease, response to treatment is probably also better at younger ages.
Besides, children -MC2-showed to receive priority regarding medical attention than adults with similar clinical conditions in Mexico. After discussion with Mexican clinicians, one explanation seems to be that in early ages, the decompensation or deterioration caused by a pulmonary disease is faster than adults and with a higher risk of death. While in adults, there is often some time margin to evaluate the patient condition's evolution before intubation or ICU admission, but not for children. Furthermore, if besides the presence of pneumonia, the groups are defined by conditions such as CKD and cardiovascular issues, a child who already has those issues could be perceived as much higher risk/more vulnerable than an elderly. These results are supported by recent literature, a study with a small cohort from Madrid 26 found 10% of 41 children with SARS-CoV-2 required ICU admission. Another study 27 showed that severe COVID-19 can also happen in small children and adolescents, where risk factors for ICU admission included age younger All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 24, 2021. ; https://doi.org/10.1101/2021.02.21.21252132 doi: medRxiv preprint than one-month, male sex, presence of lower respiratory tract infection signs and presence of a pre-existing medical condition.
Regarding the association between older age and outcomes, MC6 to 11 are exclusively composed of older adults and elderly with poor outcomes; except MC6 that includes 28.57% of young adults. However, overall survival cannot be explained only by age but along with the presence of comorbidities and habits: while MC11 shows the highest mortality and mean age, MC7 shows a similar RR with a mean age approximately ten years younger, similarly to those groups with better RRs. Besides, as widely described in literature 28 , older chronological age is not necessarily linked to higher mortality but physiological age. MC1 and 4 support this fact since, despite containing the same number of groups of each age, they show similar RRs (MC1 90.27%, MC4 82.81%) to those RRs of groups composed only of young adults with little incidence of previous disease (MC2 91.37%, MC3 95.22%) and those groups made of young adults with some frequent diseases, such as diabetes and hypertension (MC5 81.30%).
These findings support that, while a young age predisposes to mild disease 28,29 , habits and comorbidities may play a key role in predicting mortality in older patients with SARS-CoV-2 infection rather than considering only the chronological age itself. Noteworthy, the clustering for the individual age-gender groups with age >64 years, revealed that centenarians -individuals of over 100 years of age-repeatedly fell in the age-gender clusters with better outcomes. This fact conforms with the well-studied good health and low frailty scores 30 of this subpopulation. Therefore, age along with presence of habits and comorbidities are key factors to explain the dividing line between "high", "moderate" as well as the "low" RRs.

Habits
The role of obesity and smoking as risk factors for severe disease are complex, since they are both associated with the development of many conditions (e.g. COPD 31 or cardiovascular 32 ). In our study, the influence of obesity seems to be clear by comparing MC4 and 5: albeit both show diabetes, hypertension and moderate RRs (81-82%), they differentiate in that MC4 includes patients of all ages (25%) without obesity and MC5 contains mostly young adults (66.7%) who suffer from obesity. This seems to suggest that obese young adults may behave as "older", implying higher mortality 28,33 . However, we found just the opposite in young individuals without pre-existing comorbidities: MC1 and 3 have similar RRs albeit MC3 contains a significant number (59.27%) of obese patients or smokers.
These findings suggest the role of habits cannot be considered alone, but always along with age, comorbidities and duration of unhealthy habits. Our results found that smoking is risk factors for severe COPD and cardiovascular, primarily in older patients -MC8, 10, 11. Therefore, it results reasonable that the longer the time as a smoker, the greater the incidence of severe disease. In young patients, however, the evidence of smoking's negative influence is not so straightforward. Some reviews have presented current smoking as a protective factor versus former smoking, while it is clearly a risk factor versus never smoking 34 . Our results show that groups gathering young smokers have RRs which are not inferior to age-matched non-smoking groups, as proven by the RR of MC3 (95.22%, 34% smokers) versus MC2 (91.37%, 9.7% smokers).
Regarding obesity, its influence is not so clear in older groups since all have a high ratio in certain comorbidity. Still, in young obese patients without comorbidity (18-49M5 and 18-49F2), obesity seems unrelated to mortality. In conclusion, when evaluating habits, considering the patient's age and the unhealthy duration may help establish more useful prognostics and correlations.

Comorbidities
Diabetes and hypertension showed the highest prevalence among the recorded comorbidities. Their prevalence seems to explain the decrease in RRs from over 90% in MC1, 3 to 81% in MCs 4-5, all of which are young adult groups. In older MCs (6)(7)(8)(9)(10)(11) it results harder to evaluate independently since both diseases are present in nearly every group, not specifically characterizing any cluster except MC9 that represents older patients with both diseases simultaneously (>95%) alongside a low RR (66.43%). These accord with current literature that reported both diabetes and hypertension are independent risk factors for severe disease 28,35,36 .
Immunosuppressed patients fall mostly on MC6 -older adults with diabetes, hypertension, immunosuppression and other diseases. Noteworthy, immunosuppressed patients were not in the clusters with the lowest RRs. This conforms with some reports that described immunosuppression has not been confirmed as a relevant factor for disease severity, except for cancer patients 37,38 . MC6 also holds few CKD patients, a factor which has been studied as a key factor for disease progression 39,40 and it may be the cause for the immunosuppression in this group (Odds Ratio: 9.65 95%CI [9.05-10.28]) according to the prevalence of immunosuppression of CKD patients versus non-CKD patients.
MC7 is characterised by the high prevalence of CKD and other diseases. Here, RR decreases roughly 10% compared with other severe subgroups. We found CKD highly associated with mortality and a shorter survival length. This accords with a report that revealed CKD was the factor that best explains mortality 41 , implying CKD patient could be vulnerable.
MC8 is similar to 10 and 11 to some extent since they all have COPD, where MC 11 gathering the most COPD patients (>90%). Most COPD patients are elderly with comorbidities with poor outcomes which conforms to several reviews that reported COPD patients with an increased risk of severe pneumonia and poor outcomes when they develop COVID-19 42,43 .
Cardiovascular disease is homogeneously distributed among groups, particularly on MC7, 10 and 11. Nowadays, cardiovascular disease may be a double-edged factor, since it is a proven risk factor for COVID-19 severity, but some of the treatments used, such as ACE inhibitors, have also proved to protect against severe infections from SARS-CoV-2 44,45 .

State and Types of Clinical Institution
Reliable subphenotype characterization that reflects the geographical and healthcare settings from which they are ascertained is crucial 46 . To date, variability between Mexican states and TCIs regarding severity are rarely reported 47,48,49 , nor assessed for variability independently from age and gender. As an example, one state (e.g., Morelos) may show higher severity if it includes more elderly and male patients, but when we compare age-gender groups the results show that no severity difference exists in terms of probability within age-gender groups of the same age range.
The inter-state and TCI variability we found may be influenced by many factors such as the number and type -urban/rural-of population, sociocultural context, healthcare policy, the quantity of medical institutions and availability of resources, virus transmission level. Some states are more industrialized and more economical resources (e.g., Mexico City, Jalisco, the State of Mexico) than others (e.g., Oaxaca, Chiapas, Guerrero). The differences found between Mexico City and State of Mexico regarding healthy clusters distribution are hard to explain due to their proximity and similarities in the type of population and availability of medical resources.
One possible explanation for the differences in severity between social security institutions (IMSS and ISSSTE) and local public hospitals (SSA) is that SSA are administrated by the local states, and the resources among states often differ. This phenomenon could influence these institutions' quality and resources to attend their populations. Another supportive explanation is that when SSA receives severe patients and have insufficient medical resources, these patients can be transferred to the IMSS COVID-19 facilities. Consequently, this may saturate IMSS and deplete the limited resources due to an increasing number of patients, making the distribution of resources harder. These results conform with previous studies showing that the risk of death for an average patient attending IMSS and ISSSTE is twice the national average and 3 times higher relative to the private clinical institution 47 .
The complex correlation between severity and state/TCI implies a crucial socioeconomic and healthcare resource level inequality. Thus, both considering state and TCI combined with MCs and age-gender clusters may help lead to a better subphenotype characterization. All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 24, 2021. ; https://doi.org/10.1101/2021.02.21.21252132 doi: medRxiv preprint Limitations As a possible limitation, we excluded patients confirmed after September 30 to avoid possible analysis disturbance on survival outcomes, which impeded us using the most recent data whose epidemiological characteristics could have changed to some degree. Furthermore, the dataset did not include additional relevant information about the patients who were discharged, readmitted, vaccinated and neither the duration of comorbidities and unhealthy habits. Further studies with population-based data regarding subphenotypes characterization among discharged patients who received post-surveillance or were readmitted or vaccinated population are highly needed.
In summary, the analysis of COVID-19 subphenotypes from the proposed two-stage cluster analysis produced clinically coherent models with discriminative characterization and explainability over age and gender. The resultant eleven MCs provide bases for a deep understanding of the epidemiological and subphenotypic characterization of COVID-19 patients based on pre-existing comorbidities, habits, demographic characteristics, and on patient provenance and types of clinical institution, as well as revealing the correlations between the above characteristics to anticipate the possible clinical outcomes of each patient with a specific profile. These unbiased subphenotypes may help establish target groups for automated stratification or triage systems to support clinicians with the early triage prior to further tests and laboratory results, especially in those areas where such tests are not available, and prioritize vaccination among the general population as well as provide bases in planning personalized therapies or treatments. For example, a CKD patient could be classified into subgroups -MC6, 7-distinguished by pervasive differences in severity and comorbid patterns, and then compared with their inner age-gender groups whose profile coincide the most with our patient, enabling a personalized evaluation of the patient's prognostic outcomes and severity presentation.
Combining age-gender stratification and meta-clustering technique successfully revealed well-detailed informative findings and are potential for designing a novel data-driven model for the stratification of COVID-19 patients with the information available at pre-admission, or even to anticipate outcomes on the general population. Besides, the results shed light on robust conclusions about association and causality between subphenotypic presentation and clinical outcomes. Future study can explore the treatment and vaccination implications toward providing guidance on clinical triage and customize therapy, and also develop clinically robust subphenotype classification methodologies combined with the proposed twostage cluster analysis. As the concern paid to efficient triage and personalized treatment increases, we facilitate further replicability of the study and generalization to other countries data by making available our experiment codes.
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Data collection and processing
We used the publicly available COVID-19 Open Data by the Mexican Government 50 . As of 2 November 2020, the dataset comprises a total of 2,414,882 cases including demographic, comorbidities, habits, and prognosis patient-level data, for both positive and non-positive cases.  We derived five outcome variables related with the prospective patient's severity. First, the patient outcome, as deceased or not, from the date of death record. Second, the survival days since the date of symptoms. Third, the days from presenting symptoms to hospital admission. Lastly, we categorized the overall survival at 15 and 30 days after presenting symptoms.
After an assessment of potential temporal biases using temporal variability statistical methods 51 , and considering not significant temporal changes, we decided keeping the data from all the period of the study. Table 3 shows the list of studied variables. The supplementary material, in sections 3 to 6, describes additional information on the data quality and variability analysis, the original dataset description, baseline characteristics of the COVID-19 patients alongside descriptive statistics in age-gender groups of the study sample, and pregnancy association in outcomes.
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Meta-clustering methodology
We applied a two-stage subgroup discovery approach. In both stages, we used Ward's minimum variance method with Euclidean squared distance 52 to performed hierarchical clustering fed by a dimensionality reduction algorithm -Principal Component Analysis (PCA) 53 or Multiple Correspondence Analysis (MCA) 54 that took as input eleven variables including nine comorbidities -pneumonia, diabetes, COPD, asthma, INMUSUPR, hypertension, CKD, cardiovascular, and other diseases-alongside two unhealthy habits, namely obesity and smoking. In order to select the most representative PCA and MCA components to feed hierarchical clustering, we considered values with an eigenvalue higher than the average. Dimensionality reduction is known to help in the process of clustering by compressing information into a smaller number of variables, making unsupervised learning less prone to overfitting 55 , as well as to facilitate further visual analytics to prevent the potential ML black-box issue 56 .
In the first-stage, we applied individually hierarchical clustering analyses taking as input the MCA scores fed by comorbidities and habits at stratified groups according to gender and age (<18, 18-49, 50-64, and >64 years) to reduce potential biases and confounding factors, since age and gender are highly correlated with comorbidity, habits and mortality. Afterwards, we applied PCA and locally estimated scatterplot smoothing 57 (LOESS) model on the resultant age-gender clusters' features to visually explain their correlations and severity relationships. We created the cluster heatmap to help understand the characteristics of each age-gender cluster.
In the second-stage, in a wider perspective description of the population, we performed a hierarchical clustering again fed by PCA scores obtained via the resultant age-gender clusters taking as input their comorbidities and habits ratios. Then, we quantified the features of the resultant meta-clusters (MCs) representing via a table and also summarized these quantified features into a qualitative table to help interpret the main features of the resultant MCs.
For each subgroup analysis, we implemented cluster analyses from 2 through 12 clusters. The proper number of subgroups were obtained by combining a quantitative approach using Silhouette Coefficient 58 -which measures the tightness and separation of the objects within clusters, reflecting how similar an object is to its own cluster compared to other clusters-and a qualitative cluster analysis audited by the authors of this work, including medical, health informatics and ML experts from Spain and Mexico. We first selected the group of clusters that showed relatively better Silhouette Coefficient values, then adjusted the number for the most reasonable and clinically distinguishable groups regarding clinical phenotypes. This process was supported by the pipelines and exploratory tool we developed in previous work 59 .
Finally, we performed source variability assessment 60 using heatmaps to analyze the severity tendency among different data sources based on the clusters' probability distribution between Mexican states and several types of clinical institution (TCIs) where patients received medical attention.
The data processing and analyses were performed using RStudio (version 3.6) and Python (version 3.8). Temporal and source variability -data quality analyses-were performed using the EHRtemporalVariability 51 and EHRsourceVariability 60,61,62 packages. Figure 5 summarizes the full methodology. The methods developed in this work are available in our GitHub repository https://github.com/bdslab-upv/covid19-metaclustering. All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Data availability
The data of epidemiological and clinical patient-level open-source database in Mexico is publicly available at https://www.gob.mx/salud/documentos/datos-abiertos-152127 in Spanish. The English version and the studied sample of this dataset are available in our GitHub Repository https://github.com/bdslabupv/covid19-metaclustering.

Code availability
The codes that support the findings of this study are available at: https://github.com/bdslabupv/covid19-metaclustering.The results from 2 through 12 clusters for both gender and age subgroups are available at http://covid19sdetool.upv.es/?tab=mexicoGov.