Automated identification of patient subgroups: A case-study on mortality of COVID-19 patients admitted to the ICU

Background – Subgroup discovery (SGD) is the automated splitting of the data into complex subgroups. Various SGD methods have been applied to the medical domain, but none have been extensively evaluated. We assess the numerical and clinical quality of SGD methods. Method – We applied the improved Subgroup Set Discovery (SSD++), Patient Rule Induction Method (PRIM) and APRIORI – Subgroup Discovery (APRIORI-SD) algorithms to obtain patient subgroups on observational data of 14,548 COVID-19 patients admitted to 73 Dutch intensive care units. Hospital mortality was the clinical outcome. Numerical significance of the subgroups was assessed with information-theoretic measures. Clinical significance of the subgroups was assessed by comparing variable importance on population and subgroup levels and by expert evaluation. Results – The tested algorithms varied widely in the total number of discovered subgroups (5-62), the number of selected variables, and the predictive value of the subgroups. Qualitative assessment showed that the found subgroups make clinical sense. SSD++ found most subgroups (n = 62), which added predictive value and generally showed high potential for clinical use. APRIORI-SD and PRIM found fewer subgroups (n = 5 and 6), which did not add predictive value and were clinically less relevant. Conclusion – Automated SGD methods find clinical subgroups that are relevant when assessed quantitatively (yield added predictive value) and qualitatively (intensivists consider the subgroups significant). Different methods yield different subgroups with varying degrees of predictive performance and clinical quality. External validation is needed to generalize the results to other populations and future research should explore which algorithm performs best in other settings.


Introduction
In clinical research, subgroup analyses involve splitting all patients into subgroups, often as a means to make heterogeneous populations more homogeneous, or to answer specific questions about particular patient groups, types of intervention or types of study [1]. Such analyses can have drawbacks, namely (1) groups are defined manually by the researcher resulting in potentially suboptimal groups, and (2) groups can be simple, i.e., based on single variable (e.g., sex) and/or single thresholds (e.g., men versus women or under versus above 67 years). These drawbacks can be resolved by subgroup discovery (SGD) methods that aim to discover patterns in the form of rules induced from labelled data [2]. In the context of clinical subgroup analysis, SGD means the automated splitting of the data into complex subgroups, i.e., based on multiple variables and/or multiple thresholds.
Various SGD methods exist, e.g., APRIORI -Subgroup Discovery (APRIORI-SD), CN2 -Subgroup discovery (CN2-SD), Diverse Subgroup Set Discovery (DSSD) and Patient Rule Induction Method (PRIM) [3], as well as the improved Subgroup Set Discovery (SSD++) [4]. These algorithms discover subgroups that are represented as combinations of constraints on the variables (e.g., age >= 25 and BMI <19), which can also be interpreted as clinical rules. Typically, SGD methods differ from each other in the type of subgroup searching and selection (i.e., exhaustive: looks at all possible subgroups given the patient population, which requires large amounts of computation; or heuristic: finds subgroups faster and more efficient, but sacrifices optimality, accuracy, precision or completeness for speed, i.e., lower run time) and which quality measures are used for searching, e.g., unusualness, coverage, redundancy, and novelty, also known as weighted relative accuracy (WRAcc). These same quality measures are also used to assess the numerical significance of found subgroups. The clinical significance of subgroups can also be assessed by determining whether the subgroups are clinically relevant. To do this, on the one hand, we can consider variable importance (i.e., affect the overall risk prediction. On the other hand, clinicians could assess the subgroups (i.e., the rules describing the subgroups) to determine clinical relevance.
Multiple SGD methods have been applied to the medical domain [5][6][7][8][9][10][11], as well as specifically to the intensive care [12,13] but, to the best of our knowledge, not to COVID-19 patients. Furthermore, none of these studies extensively evaluated different SGD methods and the discovered subgroups. These studies provide either quality measures [5,7,11,12], predictive power of the discovered subgroups [6,13], or simply applied SGD methods to their problem and only qualitatively assessed the results of applying the method in terms of meaningful insight on their data and problem [8][9][10]. In contrast, our study assesses SGD methods and the discovered subgroups in terms of both quality measures and predictive power as well as it provides clinical validation.
This study proposes a new approach to systematically assess the numerical and clinical quality of automated patient subgroup discovery methods. Such an assessment is informative for clinicians that consider using SGD to perform complex subgroup analysis as to which SGD method is best applicable. SGD can pave the way to personalized medicine, and our approach can ease the implementation of SGD in clinical decision support systems. As a second contribution, we provide a case study on the prediction of hospital mortality in a registry cohort of ICUadmitted COVID-19 patients. We identified the subgroups that make clinical sense and there is much potential in using these subgroups in an automated way, for example for flagging, or as clinical decision rules.
The paper is organized as follow: Section 2 introduces the patient population, the SGD methods used, and their evaluation; Section 3 presents the discovered SGD groups and the evaluation results; Section 4 discusses our results; Section 5 concludes the paper.

Related work
Various subgroup discovery methods, e.g. Refs. [5][6][7][8][9][10][11], have been applied to the medical domain. Recent studies use subgroup discovery to identify subgroups of cancer patients [14,15], identify predictive factors for diabetic ketoacidosis [16], or discover subgroups of patients undergoing transcatheter aortic valve implantation with high model prediction error and their distribution over the centres [17]. Gamberger et al. [11] demonstrated the applicability of SGD analysis for in brain ischaemia. Abu-Hanna et al. [12] compared the established algorithms Classification and Regression Trees and PRIM in an SGD task on a large real-world high-dimensional ICU database. Nannings et al. [13] applied the PRIM to identify very elderly ICU patients at high risk of mortality and compared the results with those of a conventional logistic regression model. SGD has been used to find disease markers from gene expression data [18]. Techniques other than SGD are also used for identifying patient subgroups, including clustering [19][20][21][22], latent profile analyses [23], or a combination of clustering and subgroups discovery [24]. Subgroup discovery was also used to assess personalized treatment effects in order to identify patient subgroups that react exceptionally bad or well to treatment [25]. Multi-omics Clustering Variational Autoencoders (MCluster-VAEs) was used to extract representations on multi-omics data to discover cancer subtypes [26]. Risk profiles for negative and positive COVID-19 hospitalized patients were identified through partition around medoids clustering [27]. However, none of these studies did extensively consider the evaluation of different SGD methods and the discovered subgroups. We considered the subgroups defined by the discovered rules and we compared these with individual variables (estimated coefficients of a linear regression) to assess their predictive performance and redundancy.

Data
This study used prospectively collected data on all patients admitted between February 21st, 2020 and May 24th, 2022 with confirmed COVID-19 to a Dutch ICU extracted from the Dutch National Intensive Care Evaluation (NICE) registry. The NICE dataset contains, amongst other items, demographic data, minimum and maximum values of laboratory and monitor data in the first 24 hours of ICU stays, diagnoses (reason for admission as well as comorbidities), information on ICU admissions, i.e., hospital length of stay before ICU admission and referring specialism, ICU as well as hospital length of stay, and ICU as well as in-hospital mortality data [28]. Data is collected in a standardized manner according to strict definitions and stringent checks ensure high data quality [29]. The outcome variable was in-hospital mortality.
After variable selection (see Section 3.3), the used data consisted of about 60 variables. A total of 14,548 confirmed COVID-19 patients were included, of which 4000 patients (27.5%) died during their hospital stay. Survivors were significantly younger (59.5 vs 68.4 years old, p < 0.001), more often females (33.3% vs 27.3%, p < 0.001) and with slightly higher body mass index (29.8 vs 28.9, p < 0.001) than nonsurvivors. Table 1 and Table S3 show the descriptive summary statistics of the patient population.

Patient inclusion
Patients were considered to have COVID-19 when the RT-PCR of their respiratory secretions was positive for SARS-CoV-2. Surgery patients were excluded as they are typically admitted patients with COVID-19 rather than patients admitted because of COVID-19.

Analyses
Preprocessingincluded the handling of missing data and variable selection. Missing values were imputed by using the multiple imputation by chained equations (MICE) [30]. Variables with only one unique value (or almost, >= 99% frequency) were excluded.
Patient subgroupswere obtained by application of selected heuristic (SSD++ [4], PRIM [31]) and exhaustive (APRIORI-SD [32]) algorithms 1 . The algorithms were selected to form a diverse mix of algorithms based on association rules (APRIORI-SD), decision trees (PRIM) and inductive inference (SSD++). Per algorithm, we interpreted the subgroups independently of each other: a patient can belong to one or more subgroups, by definition of adherence to the subgroup's conditions: if the conditions fit the patient, it belongs to the subgroup.
Model optimizationthe three subgroup algorithms use parameters to control the learning process, called hyperparameters. These parameters need to be set such that the algorithm performs optimal. We did this optimization as follows. For APRIORI-SD, we performed a grid search for the number of subgroups (5, 10, 25, 50, 75, 100) and the maximum selector depth (5)(6)(7)(8)(9)(10). For PRIM, we performed a grid search for α (the degree of patience when looking for a sub-optimal solution) and β (the minimum size of the boxes found) with values 0.03, 0.04, 0.05, 0.06, 0.08, 0.1. For SSD++, we performed a grid search for the maximum selector depth (5-10) and beam, i.e., the pre-defined number of best partial solutions taken as candidates (25,50,100).
Numerical significanceof the obtained subgroups was evaluated by means of (1a) information-theoretic quality in terms of coverage, support, rule length, significance, novelty (WRAcc), confidence and redundancy (see Table S2 and [33] for their definition as well as Appendix A for an example of these measures), and (1b) formal evaluation of the benefit of subgroups for prediction. For the latter, we inspected whether it pays to increase the complexity of a prediction model by including subgroup indicator variables in order to improve prediction of the outcome. To this end, a logistic regression model was created with a backward stepwise variable selection model based on the Akaike information criterion (AIC) with the patient variables plus the indicators of the discovered subgroups [34]. The subgroup indicator variables evaluate to TRUE if and only if a patient belongs to the particular subgroup. We then inspected whether subgroup indicators were selected by the selection process. Also, we statistically tested, at the p = 0.05 level, whether to reject the hypothesis that the subgroups are redundant with a log likelihood ratio (ANOVA) test. For each individual subgroup indicator, we compared a logistic regression model with only the patient variables with a model that also included the subgroup indicator. Additionally, we did an ANOVA test comparing models with patient variables without subgroups to models with patient variables and all subgroup indicators.
Clinical significanceof the subgroups was evaluated by means of (2a) comparative analysis of the rule descriptions and a regression model, and (2b) expert opinion. For 2a, we informally compared the description of the obtained subgroups with the coefficients of a linear regression (LinR) model fit on hospital mortality (dichotomous outcome  was made continuous to provide more model flexibility). For the LinR model, we did backward stepwise variable selection, which was based on the Akaike information criterion (AIC). For 2b, we put forward the found subgroups (i.e., the rules describing the subgroups) to two intensivists (DD, DdL) with over 20 years of clinical expertise and asked them to evaluate, independent of each other, the rules as fit or unfit for the specific purpose of use by intensivists for triage on ICU admission of COVID-19 patients. If a rule was considered unfit, an explanation was asked for the evaluation. The form used for evaluation is available in Appendix B.

Statistical analysis
All the analyses were performed using Python v3.6 and R version 3.5.1 x64 with publicly available software packages. Notably, our implementation of APRIORI-SD is based on pysubgroup (https://pysubg roup.readthedocs.io) [35], PRIM is based on a publicly-available python implementation of PRIM (https://github.com/martinsps/PRIM), and SSD++ is based on the SSDpp-numeric (https://github.com/HMPr oenca/SSDpp-numeric). For the reporting of this study, we followed the TRIPOD statement (https://www.equator-network.org/reporting -guidelines/tripod-statement/). Table S4 describes the subgroups that were discovered with each of the SGD methods, and information-theoretic quality metrics are provided for each subgroup. The discovered subgroups vary largely between the three methods. Firstly, they differ in terms of the number of subgroups (APRIORI-SD: 5, PRIM: 6, SSD++: 62). Secondly, the subgroups themselves also differ. In PRIM and APRIORI-SD, subgroups mostly concern a small number of variables (age, haematological malignancy, chronic cardiovascular insufficiency and chronic respiratory insufficiency, cardiopulmonary resuscitation, for APRIORI-SD; age, number of chronic comorbidities, lowest bicarbonate, referring specialism, origin of admission, gender, wave of infection, highest serum urea, lowest thrombocytes, lowest creatinine, lowest systolic blood pressure, for PRIM). SSD++ has discovered most subgroups with highest variation in terms of variables per subgroup and total number of variables. Table 2 shows the results of each method in terms of informationtheoretic quality metrics. We observe large differences between the methods for coverage (average highest for APRIORI-SD, overall for SSD++) and significance (highest for APRIORI-SD, meaning that its groups have higher interest). The findings on the discovered subgroups (Table S4) are included in the metrics with the measured number of subgroups and their average length (i.e., number of variables). Table 3 shows the predictive performance of the discovered subgroups in terms of (a) whether a subgroup was selected in the variable selection with stepwise regression (with and without clinical variables), and (b) log likelihood ratio tests. These results show that the majority of the subgroups survived backward selection (but only about half for PRIM and APRIORI-SD when using subgroups together with clinical variables), which is indicative of additional predictive value over the patient variables. The log likelihood ratio tests also show significant added predictive value of the subgroups discovered by PRIM and SSD++, however not for APRIORI-SD. Fig. 1 summarizes the clinical significance of the discovered subgroups. For each method, the agreement shows the number of subgroups (with respect to the number of discovered subgroups) on which the clinicians agreed on whether the group was clinically relevant or not. The fit outlines the number of subgroups which were considered clinically relevant with respect to the number of subgroups for which there was agreement. The average fit averages the number of subgroups judged clinically relevant by each clinician and shows it with respect to the number of discovered subgroups. For one subgroup identified by SDD++, one clinician was undecided and it was counted as not clinically relevant (unfit). Overall, the intensivists found the majority of subgroups (n = 66, 91%) fitting for triage on ICU admission of COVID-19 patients. APRIORI-SD resulted in 5 out of 5 fitting subgroups for both intensivists; SSD++ in the same 58 out of 62 for both intensivists; from the subgroups discovered with PRIM, only one subgroup was considered fit by both intensivists. For APRIORI-SD and SSD++, the agreement was high (5 and 59 same ratings, respectively); ratings on PRIM were less homogenous, there was agreement on only two subgroups.

Evaluationclinical significance
When asked for an overall evaluation of the subgroups, both intensivists mentioned it was interesting for SDD++ to discover subgroups not only with a very high probability of dying, but also subgroups with very low mortality probabilities. However, given that the SDD++ groups were relatively small, it was suggested that performance metrics would be provided of the groups (we computed the performance metrics per group, but these were not shown to clinicians during their evaluation). Rules in the form of "not equal to", which were common in PRIM, were considered unintuitive. The APRIORI-SD subgroups were not considered very distinctive since the length of stay is long, but the mortality is around 0.5. Concerning the used variables, the origin of admission, i.e., the location just before the ICU admission (home, emergency room, ward, other hospital, etc.), was considered vague and not so clinically meaningful. Also, the variable indicating the infection wave was considered not useable in practice since new patients cannot belong to past infection waves.

Discussion
In this study, we performed quantitative and qualitative analyses of patient subgroups that were discovered automatically and have the form of rules (conditions) on the patient features. SG stands for subgroups. The number of subgroups for APRIORI-SD was pre-set as it is one of the model parameters. The best result for each measure is highlighted in bold.
Findings -For the quantitative analyses, we observed that the tested algorithms yield different results in terms of (i) the total number of discovered subgroups (ranging between 5 and 62), (ii) the number of selected variables (overall and per subgroup), and (iii) the predictive value of the subgroups. Concerning the qualitative assessment (by means of evaluation of the clinical relevance of the subgroups by intensivists), we make three overall observations. Firstly, the subgroups make clinical sense. However, secondly, including the (past) infection waves does not make sense for the purpose of (future) triage. Lastly, although many (62) groups were discovered with the SSD++ algorithm, there is much potential use for these subgroupseither in an automated way, for example for flagging, or as clinical decision rules. As for the clinical utility of the subgroups, APRIORI-SD and PRIM are considered less effective because the subgroups do not have added predictive value and the subgroups are deemed clinically less relevant. Especially APRIORI-SD subgroups were not good: the mortality in each group (i.e. the number of non-survivors divided by the number of patients in the group) was about 0.5 (many patients, especially the non-survivor belonged to multiple subgroups) whereas SGD is supposed to find distinctive groups (either with high or low outcome probabilities), which means the algorithm proved not effective. Finally, SDD++ resulted best from both clinical and numerical (predictive power and redundancy), although it was second best in terms of information theoretic measures after APRIORI-SD.
Strengths -The conducted analysis was very extensive by evaluating many quantitative measures (in general on algorithmic performance) as well as qualitative aspects of found subgroups (by means of expert Table 3 Stepwise AIC backward regression model with subgroups (a) and ANOVA of LinR models with and without subgroups (b).  The p-value is omitted when it corresponds to a non-statistically-significant result. consultation with questionnaires). Furthermore, the predictive performance of the subgroups was assessed extensively (by evaluating the subgroups as patient features, log likelihood tests, and stepwise feature selection) and separately from the internal validation during model development. Such an extensive and systematic approach as we undertook facilitates the use of algorithms in clinical practice. The analysed use case of ICU triage of COVID-19 patients included real world data. In this study, this case was rather illustrational and an example in support of the study's aim how to analyse and use SGD algorithms. However, since we used real world data, a follow-up clinical study for ICU COVID-19 triage with found subgroups can be readily undertaken (although it may depend on the virus variant, vaccination status and vaccine). Limitations -Three main factors limit generalizing the results of this study. Firstly, the use of a single country dataset is limiting, mainly as to which subgroups were found for the specific prediction task. Secondly, the subgroups were evaluated by (only) two intensivists (albeit from different institutions). Establishing broader common ground on the subgroups (possibly revised after external validation) may require a larger evaluation panel. Finally, we evaluated three SGD algorithms that we considered representative as explained above, but the sheer number of algorithms could warrant a more extensive analysis including more algorithms, and possibly related algorithms like association rules and clustering/phenotyping.
Implications -We showed that SGD methods can potentially be used in clinical practice. Our in-depth evaluation, which included clinical validation of the discovered subgroups, showed that SGD allows clinicians to identify clinically relevant subgroups for COVID-19 patients. SGD methods can be implemented in clinical decision support systems and our methodology can be used to validate SGD methods, also in another setting and for other outcomes. Subgroups can be interpreted as rules, which can be implemented in a clinical decision support system to identify high-risk patients. For instance, a newly admitted patient can be mapped to a subgroup by which the derived prognosis can be taken into account in treatment decision and can also be discussed with the patient and the family.
There are several ways to use the found subgroups in clinical practice. Such use may range from an automated algorithm for flagging patients who may have low survival probabilities, to use of the rules describing the subgroups in triage protocols. For direct clinical application of the found subgroups, one may need to consider that the threshold levels as used in subgroups are often extreme values (e.g., A-a gradient >450) and these may not occur often enough to justify inclusion in clinical practice. Concerning use for triage, the involved intensivists mentioned that thinking in subgroups or rules is the other way around from their usual way of thinking. For example, the intensivists think which patients do have a mortality of 80-100%, to which the answer is 80+ year old COVID-19 patients with >2 comorbidities, while subgroups are defined also on low risk of mortality. Noteworthy, some subgroups do not seem to represent ICU patients that are considered typical given by the variables that were used in the rules. However, typical patients vary during a pandemic. ICU patients in the first wave might have been a medium care or general ward patients in subsequent waves. Furthermore, age, creatinine and renal replacement therapy are known predictors of high mortality, but combining these variables with other variables to assess mortality remains difficult. Subgroup analyses can generate patient groups that are not considered as an important subgroup in clinical practice but yet help in rethinking the influence of variable on the outcome and generate new hypotheses.
Our study has implications for researchers and practitioners. We demonstrated how to assess the numerical and clinical quality of SGD methods to help clinicians to perform complex subgroup analysis as which SGD method is best applicable. SGD can pave the way to personalized medicine as our approach can ease the implementation of SGD in clinical decision support systems. The fact that APRIORI-SD was best in our case study according to information theoretic measures but was not for the other evaluations shows that our deeper evaluation results in a better choice of the best SGD method.

Conclusion
Automated patient subgroup discovery methods find clinical subgroups that are relevant both when assessed quantitatively (yield added predictive value) and qualitatively (intensivists consider the subgroups significant). Different methods yield different subgroups with varying degrees of predictive performance and clinical quality.
As future work, we propose to conduct further external validation studies to address the limitation that only one dataset was used. To establish broader common ground on the clinical relevance and validity of the subgroups by a larger evaluation panel, the qualitative analysis should be assessed in a broader Delphi study. Finally, several specific findings about the subgroups (e.g., non-typical ICU patients and particular variable interactions) need further follow-up. Future research is needed to explore which algorithm gives most benefit in other settings.

Other declarations
The investigators were independent from the funders; IV, and MCS had full access to the data, have verified the data, and take responsibility for the integrity of the data and the accuracy of the data analysis; the lead author (the manuscript's guarantor) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained. DD, NdK, DdL are board members of the NICE foundation that facilitate the data collection for this study.

Ethics approval and consent to participate
The study protocol was reviewed by the Medical Ethics Committee of the Amsterdam Medical Center, the Netherlands. This committee provided a waiver from formal approval (W20_273 # 20.308) and informed consent since this trial does not fall within the scope of the Dutch Medical Research (Human Subjects) Act.

Data and code availability
Data is available under stringent conditions as described on the NICE website https://www.stichting-nice.nl/extractieverzoek.jsp (in Dutch).
The code used for our analyses is publicly available at https://bit bucket.org/aumc-kik/subgroup-discovery/.

Declaration of competing interest
The authors declare that they have no conflict of interests.