Evolution of Clinical Phenotypes of COVID-19 Patients During Intensive Care Treatment: An Unsupervised Machine Learning Analysis

Background Identification of clinical phenotypes in critically ill COVID-19 patients could improve understanding of the disease heterogeneity and enable prognostic and predictive enrichment. However, previous attempts did not take into account temporal dynamics with high granularity. By including the dimension of time, we aim to gain further insights into the heterogeneity of COVID-19. Methods We used granular data from 3202 adult COVID patients in the Dutch Data Warehouse that were admitted to one of 25 Dutch ICUs between February 2020 and March 2021. Parameters including demographics, clinical observations, medications, laboratory values, vital signs, and data from life support devices were selected. Twenty-one datasets were created that each covered 24 h of ICU data for each day of ICU treatment. Clinical phenotypes in each dataset were identified by performing cluster analyses. Both evolution of the clinical phenotypes over time and patient allocation to these clusters over time were tracked. Results The final patient cohort consisted of 2438 COVID-19 patients with a ICU mortality outcome. Forty-one parameters were chosen for cluster analysis. On admission, both a mild and a severe clinical phenotype were found. After day 4, the severe phenotype split into an intermediate and a severe phenotype for 11 consecutive days. Heterogeneity between phenotypes appears to be driven by inflammation and dead space ventilation. During the 21-day period, only 8.2% and 4.6% of patients in the initial mild and severe clusters remained assigned to the same phenotype respectively. The clinical phenotype half-life was between 5 and 6 days for the mild and severe phenotypes, and about 3 days for the medium severe phenotype. Conclusions Patients typically do not remain in the same cluster throughout intensive care treatment. This may have important implications for prognostic or predictive enrichment. Prominent dissimilarities between clinical phenotypes are predominantly driven by inflammation and dead space ventilation.


Background
The world continues to suffer from the ongoing coronavirus disease 2019 (COVID-19) pandemic. COVID-19 is one of the deadliest pandemics in history with more than 360 million confirmed cases as of January 2022, and 5.6 million deaths. 1 COVID-19 and measures to control it cause significant social and economic disruptions and put unprecedented pressure on healthcare systems worldwide.
Despite major scientific effort, many aspects of COVID-19 remain poorly understood. This includes the identification of distinct clinical phenotypes in critically ill COVID-19 patients. The identification of clinical phenotypes could improve the understanding of disease heterogeneity and the response to treatment. Furthermore, it may enable prognostic enrichment, facilitating better outcome estimation for patient selection in trials and for shared decision making, like limitations of treatment. Finally, it could also provide predictive enrichment, which is the selection of patients who are more likely to respond to a given therapy on the basis of a biological mechanism. Predictive enrichment could thus identify potential targets for more personalized COVID-19 treatment throughout the course of critical illness, as previously shown for sepsis. 2 Previous attempts to identify clinical COVID-19 phenotypes using unsupervised machine learning techniques such as clustering analyses have revealed important signals of heterogeneity within the disease by identifying distinct clusters. [3][4][5][6][7] However, these analyses only considered data that is available on admission and up to 48 h thereafter. This static approach ignores the dynamics of the disease and the response to therapy. Thus, it is currently unknown whether clinical phenotypes in COVID-19 patients are stable and whether or not this could impact predictive and prognostic enrichment.
The Dutch ICU Data Warehouse (DDW) contains millions of data points on thousands of critically ill COVID-19 patients for the full duration of their treatment at one of 25 intensive care units in the Netherlands. 8 These data may be leveraged to gain new insights into the heterogeneity of COVID-19 by adding the dimension of time to the search for clinical phenotypes.
We hypothesized that this approach could reveal insights into the evolution of clinical phenotypes (ie how they change over time) and patient allocation to these clinical phenotypes over time. We therefore used the DDW to study this dynamic process, aiming to identify deterioration or recovery as well as potentially modifiable factors in critically ill COVID-19 patients throughout the course of their intensive care treatment.

Methods
All data was retrieved from the Dutch ICU Data Warehouse, a large, multi-center and full admission electronic health record data dataset. The DDW contains high-granular data on 3202 critically ill COVID-19 patients admitted to one of 25 Dutch ICUs between February 2020 and March 2021. All data are pseudonymized. The institutional review board of Amsterdam UMC, location VUmc waived the need for informed consent from individual patients and approved of an opt-out procedure.

Patients and Parameters
All patients in the DDW were eligible for inclusion if ICU mortality outcome data were available. Transferred patients were included if the transfer destination data were available. Relevant parameters were selected by a team of intensivists aiming to include a balanced representation of all organ systems. Among the clinically relevant parameters, only those for which at least 50% of the patients had a stored value for this parameter within the first 24 h of ICU admission were included. Determination of this threshold was done on the basis of the trade-off between the number of available parameters for selection, and the risk of introducing a significant bias in the dataset by imputation.

Data Preprocessing
The preprocessing and clustering procedure is visualized in Figure 1. All analyses were carried out in Python 3.8 (Python software foundation). To investigate the evolution of clinical phenotypes and patient allocation to those over time, multiple cluster analyses were performed for different time points during ICU treatment. To this end, consecutive data sets were created that each covered the data of the next 24 h, ie the first data set covered 0-24 h, the second data set 24-48 h, etc The selected number of days was set to 21, ensuring that a reasonable number of patients remained available for cluster analysis in the last data set (480-504 h after admission). Having a reasonable number of patients was necessary to ensure the validity of our results, as we performed several data aggregations, and too few patients result in unstable clustering. Therefore, the trade-off between time and patients was set at three weeks. Temporal data were aggregated as the mean or median value (as appropriate) for each patient and each parameter, for every data set. Medication data were transformed to binary parameters indicating whether the medication was administered during the 24-h period. For each parameter presenting with missing values, the Multiple Imputation by Chained Equations (MICE) procedure using the miceforest package was deployed to replace the missing data with substituted values. Default settings using lightgbm were followed and the amount of iterations was set to 15, based on the average percentage of missing values, as recommended by Bodner 9 and White, Royston, and Wood. 10

Clustering
The k-medoids clustering algorithm with Partitioning Around Medoids (PAM) 11 was used to cluster the datasets. Distance matrices as input for the clustering algorithm were created separately for each data set with the Gower's Distance metric. 12 This metric was chosen because of its ability to account for both numeric and non-numeric data. To determine the amount of clusters k for each data set, Consensus Clustering (CC) was performed with 500 iterations on samples with 80% data retention for each value of k between 2 and 7, as recommended by Monti et al. 13 These numbers are based on previous research, which has shown that similar patient cohorts are best divided into either two, three, or four clusters. [3][4][5] Both colorcoded heatmaps and cumulative distribution function (CDF) plots were used to visually identify the ideal amount of clusters k for each of the 21 data sets, ie the number of days included. To further support the choice of k, Silhouette Scores were calculated separately for each value of k. When the CC procedure did not yield a conclusive value, the Silhouette Score determined the number of clusters k.

Identification of Clinical Phenotypes
Within each identified cluster, parameters were aggregated by the mean or median as appropriate to facilitate identification and interpretation of clinical phenotypes. To determine whether clinical phenotypes differed significantly between datasets, a Kruskal-Wallis test was performed for continuous parameters, and a χ2 test for binary parameters. A Fisher's exact test was used when at least one of the expected frequencies for the binary parameters was <5. 14 The significance level was set at 5%.
Cluster labels assigned by clustering algorithms may vary per clustering batch, making it difficult to follow clusters over time. For example, a specific cluster could be assigned label 0 at day 1 and label 2 at day 2, since the assignment of cluster labels to the actual clusters is arbitrary in k-medoids. Therefore, the assigned cluster labels were re-ordered based on their similarity over time. For this, clusters adopted the cluster label from the preceding cluster with the lowest dissimilarity value calculated as the sum of the absolute differences between the aggregated parameters of two clusters. After re-ordering the cluster labels, a Sankey diagram was created that shows the evolution of clinical phenotypes and patient allocation to these clinical phenotypes over time.
Furthermore, to identify the evolution of a single patient in a cluster over time, individual patient cluster tracks were stored as lists of cluster assignments throughout their ICU stay, up to day 21 after admission. These tracks can be either homogeneous if consisting of the same cluster only, or heterogeneous if consisting of different clusters. Comparing homogeneous with heterogeneous clusters may indicate how stable a patient is within a cluster. Note that the length of each patient's cluster track is equal to the number of days the patient is admitted (with a maximum length of 21).

Results
The final dataset included 2438 out of 3203 critically ill COVID-19 patients since these patients had a registered ICU outcome. The remaining 765 omitted patients were still Figure 1. Flowchart of the methods used from collection up to analysis. Notice that the number of parameters is strongly reduced, that a data set is created and analyzed for each separate day, and that the number of clusters per day can differ.
admitted to the ICU at the time of data processing and were therefore not included in the analysis. The 2438 patients in the final dataset were admitted to the ICU during the first and second COVID-19 wave in the Netherlands. Figure 4 shows the number of patients that were admitted to the ICU per day during both waves, for those patients that were included in our analysis. For each patient in our analysis, an average of around 66.000 data points was available and mapped to 1331 different parameters. Of these, a selection of 135 parameters was considered clinically relevant by a team of intensivists. Subsequently, the parameters with more than 50% missing data were omitted. This resulted in a parameter set consisting of 41 parameters with, on average, 10.43% missing values per parameter (see Table A2). The parameters were categorized as either demographics, disease severity scores, cardiovascular, nephrology, respiratory, inflammatory, coagulation, or medication data (see Table A1). Cluster progression was analyzed until day 21 after ICU admission, leaving 569 patients for cluster analysis for the last day. Thus, 21 datasets with dimensions of 41 times the number of patients remaining in the ICU were available for analysis.

Baseline Characteristics
The baseline characteristics of included patients, stratified by ICU survival status, are given in Table B1 of appendix B. Survivors were significantly younger, had lower disease severity scores, fewer indicators of organ failure, lower levels of inflammation and coagulation markers. Of note, BMI was not significantly different between survivors and non-survivors.

Clinical Phenotypes on Admission
The first dataset was used to identify clinical phenotypes on admission. This data set included the aggregated data from the first 24 h of ICU treatment of 2438 patients. Consensus Clustering identified the ideal number of clinical phenotypes as 2. These clinical phenotypes are shown in Table 1.
Characteristics of clinical phenotype 0 relative to clinical phenotype 1a include lower disease severity scores, lower vasopressor use, higher respiratory system compliance, lower dead space indices, lower C-reactive protein levels and administration of anticoagulants, fluids and particularly muscle relaxants, opioids and sedatives but similar P/F ratios and a higher rate of administration of corticosteroids. Of note, the number of times tocilizumab was administered was very low in both clinical phenotypes and would not have been able to affect early CRP values, implying that the difference in CRP between clinical phenotypes is real. Mortality in clinical phenotype 0 was 21.5% versus 34.6% in clinical phenotype 1a. Figure 2 shows the evolution of clinical phenotypes and the number of allocated patients to each cluster from day 1 to day 21 of ICU treatment. Note that clusters 0, 1, and 2, correspond to phenotypes 0, 1a, and 1b, respectively. The maximum number of clusters per day was 3. An overview of characteristic of the identified clinical phenotypes at day 5 is shown in Table 2, whereas the characteristics at other time points can be found in appendix B. In addition to Figure 2, Table 3 shows the number of patients that were lost in our analysis due to discharge and death, for each day and for every cluster. Table 3 shows that the split after day 4, separating clinical phenotype 1 in clinical phenotype 1a and clinical phenotype 1b, separates a group of medium severity. In combination with Figure 2 and 3, this indicates that outcome in the worst severity group is less certain.

Evolution of Clinical Phenotypes Over Time
After day 4, clinical phenotype 1b is separated from clinical phenotype 1a. Characteristics of clinical phenotype 1b relative to clinical phenotype 1a include lower disease severity scores, higher respiratory system compliance, higher P/F ratios, higher urea-to-creatinine ratio, lower C reactive protein levels and less frequent administration of vasopressors, fluids and particularly muscle relaxants, opioids and sedatives but higher rates of antibiotics and corticosteroids administration. Patients within clinical phenotype 1b on day 5 have an estimated mortality of 23.9%, which remains similar until it starts to increase a few days before it merges again with clinical phenotype 1a after day 14. In many aspects, clinical phenotype 1b can be considered the moderate version of cluster 1, which is confirmed by the high rate of crossovers between clinical phenotype 1a and clinical phenotype 1b. Even though the mortality rate of clinical phenotype 1b is comparable to that of clinical phenotype 0, very few crossovers occur between these clinical phenotypes.
Clinical phenotype 0 has the lowest mortality rate, gradually declining from 21.5% at day 1 to 13.9% at day 21. Containing approximately one third of all patients at day 1, this clinical phenotype slowly shrinks in size. A relatively large number of patients in cluster 0 cross over to cluster 1 and vice versa during the first four days. After that, cluster 0 appears to be relatively stable with only small numbers of patients crossing over to a different cluster or vice versa.
Clinical phenotype 1a has the highest mortality rate, starting at 34.6% at day 1. After separation of clinical phenotype 1b at day 4, mortality in clinical phenotype 1a rises progressively to 55% between day 10 and day 13. After day 13, mortality gradually declines again to around 40% at day 21. Clinical phenotype 1a remains the largest clinical phenotype for most of the 21-day period.

Discussion
This is the first study to identify the evolution of clinical phenotypes as well as patient allocation to these clinical phenotypes over time throughout the course of intensive care treatment.
Our main findings are that in our cohort of critically ill COVID-19 patients, the identified clinical phenotypes do not remain stable but rather evolve over time, and that patients typically do not remain in the same cluster but rather cross back and forward between clusters over time. This indicates an improvement or decline throughout their intensive care treatment, which is not reflected in the static disease assessment scores taken on admission. Furthermore, this indicates that patients with an initial bad severity score can still have good outcomes. It would be interesting for future research to evaluate if a prediction model can be trained that is able to sequentially assess cluster switching. This novel finding of clinical phenotype instability may be best understood by introducing the concept of clinical phenotype half-life. This is defined as the time that an identified clinical phenotype has lost half of its allocated patients. Figure 3 reveals the clinical phenotype half-life to be between 5 and 6 days for clinical phenotypes 0 and 1a respectively, and about 3 days for clinical phenotype 1b. For our cohort, for example, it may be argued that clinical phenotype-based prognostication changes for about half of the patients within one week. Similarly, if a predictive enrichment strategy would lead to the enrollment of patients of a specific clinical phenotype identified on admission based on presumed benefit in that cluster, this presumed benefit would not last in over about half of these patients after one week as they have crossed over to a different cluster. Short to very short clinical phenotype half-life may have important implications for the use of clinical phenotyping on admission for prognostic or predictive enrichment.
In our study, up to three distinct clinical phenotypes were identified by cluster analyses, each with different mortality rates and characteristics. Only two clinical phenotypes could be identified at the time of ICU admission, with one that splits into two separate clinical phenotypes over time. At first sight, the separation in two clinical phenotypes on ICU admission appears to be largely driven by the need for invasive mechanical ventilation and hence explain the large differences in opioid administration and muscle relaxants. While it is reassuring that the algorithm identifies this heterogeneity, separation based on type of respiratory support would obviously be of limited clinical utility as it is easily observed at the bedside. Clinical phenotype 1a may be characterized as    hyperinflammatory and has higher levels of dead space ventilation, possibly indicative of undiagnosed pulmonary microclots or pulmonary embolism. These findings coincide with less frequent administration of corticosteroids and anticoagulants. The further separation of clinical phenotype 1b from clinical phenotype 1a reveals similar insights. Again, the heterogeneity appears to be strongly driven by inflammation and dead space ventilation, also with coinciding findings of less frequent administration of corticosteroids and anticoagulants. Taken together, these findings may confirm the previously underestimated role of coagulation and specifically undiagnosed pulmonary microclots or pulmonary embolism as well as the beneficial effects of steroids, and suggests that suppression of inflammation may also be relevant at later stages of the disease. These evaluations of different clusters may help in future research to identify the reasons why patients switch between clusters and set up studies which seek out how to obtain the optimal outcome in patients. It is difficult to compare our results to other clustering efforts given the different patient cohorts that were considered, ie hospitalized versus admitted to intensive care. 3,4 However, Rodríguez et al 5 only included intensive care patients for clustering and also found three distinct COVID-19 clinical phenotypes: a mild one, a moderate one, and a severe one. Their mild clinical phenotype has a crude mortality of 20.3%, which seems to correspond reasonably well to our clinical phenotype 0 in terms of mortality rate and some clinical characteristics, but the moderate and severe ones do not. However, these results are again difficult to compare with our findings due to different parameter sets to describe the clinical phenotypes, for example the inclusion of comorbidities and coexisting conditions and the use of different laboratory findings and treatments.
Bos et al performed cluster analysis on a similar patient cohort to identify respiratory subphenotypes of COVID-19 ARDS. 15 Their analysis included longitudinal clusters that incorporated data points during the first four days of admission which they found to discriminate and prognosticate better than only considering data on admission. This confirms the importance of including a temporal dimension in clinical phenotyping. However, our approach is different as we did not exclusively focus on respiratory data and determined clinical phenotypes for each single day of admission instead of grouping data over multiple days. By phenotyping beyond day four, our approach indeed revealed the emergence of a clinical phenotype beyond the first four days of admission.
Our study comes with limitations. First, this study used 41 parameters to describe the clinical phenotypes, which is relatively high and may impair the understanding of the relationship of the variables to the outcome. 16 Second, several parameters had a relatively high proportion of missing values (>30%) that were consequently imputed. Imputing missing data is undoubtedly important, but also difficult and complex. There is no gold standard for comparison which hampers the quality evaluation of the imputations. For these reasons, our clinical phenotypes should be interpreted with caution. Third, all of the 21 data sets that were created covered a 24-h period. The length of this period decides the granularity of the clinical   For example, after seven days, around 40% of the patients that started in cluster 0 have a homogeneous track up to day seven, ie they had never been in another cluster during these seven days. Please note that deceased or discharged patients are removed from the calculation, sometimes leading to an increase in the percentage within a cluster.
analysis we included patients that were admitted to the ICU between February 2020 and March 2021. Due to the length of this period, our results may have been influenced by different COVID-19 variants and implementation of new treatment strategies. We were not able to verify this since COVID-19 subtypes were not determined in the clinical setting and were thus not available for analysis. In addition, symptom duration and onset were not recorded and thus not taken into account, while this may provide valuable insight. Furthermore, vaccines were made available on the sixth of January 2021, prioritizing high risk patients and healthcare workers. We expect vaccines to have had minimal impact on our analyses, given the low admission rates for January, February and March 2021. However, we could not validate this assumption since vaccination status was not systematically recorded and therefore not available in our datasets. Lastly, this study is limited by the lack of a validation cohort. However, this is unavoidable due to the uniqueness of the data set used for this study.

Conclusions
Clinical phenotypes in critically ill COVID-19 patients do not remain stable and patients typically do not remain in the same cluster throughout intensive care treatment. Short to very short clinical phenotype half-life may have important implications for the use of clinical phenotyping on admission for prognostic or predictive enrichment. The heterogeneity between the identified clinical phenotypes seems to be strongly driven by markers of inflammation and the level of dead space ventilation.

Authors' Contributions
SS drafted the manuscript. SS, FB, PE, TD and LF were involved in data processing, analytics and interpretation. All authors contributed to data collection and critically reviewed the manuscript. All authors have full access to the data. Final approval of the article was given by all authors.

Ethics Approval and Consent to Participate
The Medical Ethics Committee at Amsterdam UMC, location VUmc waived the need for patient informed consent and approved of an opt-out procedure for the collection of COVID-19 patient data during the COVID-19 crisis.

Consent for Publication
Not applicable.

Availability of Data and Materials
The Dutch Data Warehouse is available for global collaboration, within the restrictions imposed by privacy laws and ethics. Information on how to access the data warehouse may be found at www.amsterdammedicaldatascience.nl.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.