Cluster Analyses From the Real-World NOVELTY Study: Six Clusters Across the Asthma-COPD Spectrum

,


INTRODUCTION
Asthma and chronic obstructive pulmonary disease (COPD) are the 2 most prevalent obstructive lung diseases, contributing a substantial burden to patients and to health care systems globally. 14][5] Between 4% and 66% of patients with obstructive lung disease have features associated with both asthma and COPD or have been given both diagnoses. 6,7][10][11] Most research investigating underlying mechanisms is also restricted to patients with a diagnosis of asthma or COPD.However, to provide a precisionmedicine, biology-led approach to treatment, 12 it is important to identify distinct disease subtypes (phenotypes) or precise underlying molecular mechanisms that are associated with distinct outcomes or treatment responses (endotypes).
Clustering analyses are hypothesis-generating methods that typically apply complex mathematical modeling to group patients by a range of variables.Clustering methodology varies, particularly with regard to the process of selecting the clinical variables for inclusion in the analysis.Variable selection may be "facilitated" by expert input or "nonfacilitated" and datadriven. 13,14The overall aim of clustering is to minimize variance between patients included in each group while maximizing differences between groups. 15he NOVEL observational longiTudinal studY (NOVELTY) is a large, global, prospective, observational study of patients from primary or specialist care with a physician-assigned diagnosis or a suspected diagnosis of asthma and/or COPD. 16,17The primary objectives of NOVELTY are to describe patient characteristics, treatment patterns, and burden of illness and to identify the clinical phenotypes and molecular endotypes (on the basis of biomarkers and/or clinical parameters) that are associated with differential outcomes for symptom burden, clinical evolution, and health care utilization over time.Here, as an initial step, we conducted prespecified and exploratory cross-sectional cluster analyses, using nonfacilitated and facilitated approaches to variable selection, in patients with physician-assigned diagnoses of asthma and/or COPD in NOVELTY.The aim was to investigate whether these clusters would identify distinct patient groups that might indicate specific underlying pathways or suggest future treatment targets, independent of conventional diagnostic labels.

Study design
The NOVELTY study design has been reported previously. 16OVELTY (NCT02760329) is a global, prospective, observational 3-year study that enrolled more than 11,000 patients 12 years or older with a physician-assigned diagnosis or a suspected diagnosis of asthma, COPD, or both (asthmaþCOPD) from primary and specialist clinical practices in 19 countries.Patients were excluded only if their primary respiratory diagnosis was not asthma or COPD, if they had participated in a respiratory interventional trial during the previous 12 months, or if, in the physician's opinion, they were unlikely to complete 3 years of follow-up.To avoid the high selectivity of regulatory clinical trials [8][9][10][11] and allow generalizability to patients in clinical practice, the study included patients treated for asthma and/or COPD in primary or specialist care, without specifying any conventional diagnostic or severity criteria.To ensure sufficient patients with severe disease, enrollment was stratified by physician-assigned diagnosis and physician-assessed severity (mild, moderate, or severe).The study protocol was approved in each country by the relevant institutional review board, and all patients provided written informed consent.

Baseline NOVELTY data set
The data set for these cluster analyses comprised the baseline clinical, functional, and initial biomarker data for patients enrolled as of March 5, 2018, from 18 countries (data from China were excluded because of a change in data transfer regulations in May 2019).The data set included more than 170 demographic, clinical, and biomarker variables, but physician-assigned diagnosis, physicianassessed severity, and medications were excluded from the cluster analysis to avoid bias from preexisting clinical concepts.Details of the methods for selection of variables and clustering are summarized herein, with full details in this article's Online Repository at www. jaci-inpractice.org.

Input variable selection
Two approaches were taken to variable selection (see Table E1 in this article's Online Repository at www.jaci-inpractice.org).The first (approach A) was a nonfacilitated, hypothesis-free approach 13,14 ; the framework is shown in Figure E1, A, in this article's Online Repository at www.jaci-inpractice.org.All clinical variables were plotted using histograms and a list of mean-shift outliers was derived using studentized residuals at a Bonferroni-adjusted P value of less than .001.After excluding extreme outliers, the following criteria were applied to the variables in the baseline data set: less than 10% missing data; data either continuously distributed or, if binary, with minor frequency greater than 5%.All clinical variables excluded because of missing data were confirmed to be represented by a significantly correlated variable that was more complete.A total of 79 clinical variables satisfied these criteria.Simulated annealing and genetic algorithm methods were then applied to arrive at a lowcorrelation variable set to act as surrogate for the whole. 18The final set of 13 input variables for cluster analysis included the following: age at onset of respiratory symptoms (years), basophil count, body mass index (BMI), dyspnea (modified Medical Research Council dyspnea grade), fractional exhaled nitric oxide (FENO), hemoglobin, hospital admission for exacerbation in the last 12 months, number of exacerbations in the last 12 months, prebronchodilator FEV 1 , smoking status, use of antibiotics for exacerbation in the last 12 months, white blood cell count, and years of education.Figure E2 (in this article's Online Repository at www.jaci-inpractice.org) shows Kendall's s correlations within the data set tested after selection of these variables.Eigenvectors were derived to ensure orthogonality, and the Pearson dissimilarity matrix was input into the cluster analysis.
In the second approach (approach B), selection of variables for clustering was guided by clinical relevance and incorporated a nonparametric approach to produce a dissimilarity matrix (unsupervised Random Forest).The framework is shown in Figure E1, B. Variables were prefiltered to a nonredundant set of candidate variables considered relevant to the analysis aims.The resulting 65 variables then underwent variable clustering analysis for mixed data, on the basis of principal-component analysis for continuous variables and factor analysis for nominal variables, using R package clustOfVar 19 ; Figure E3 (in this article's Online Repository at www. jaci-inpractice.org)shows the cluster variable dendrogram.On the basis of clinical expert input (authors R.H. and H.M.), 1 variable was selected from each of the 27 variable clusters.The final set of 27 input variables for cluster analysis included the following: age at baseline (years), biomass fuels ever used for cooking/heating (including open fire burning wood, charcoal, coal, or other biomass fuel), blood eosinophil count (Â10 9 /L), blood neutrophil count (Â10 9 /L), BMI, chronic bronchitis, coronary heart disease or heart failure, current smoker, depression or anxiety or antidepressant use, exacerbations in the last 12 months (n), family history (asthma/ COPD or allergies), FENO (parts per billion), hospital admissions for exacerbations in the last 12 months (n), hospital visits (emergency department or admissions) for other reasons in the last 12 months (n), lymphocyte count (Â10 12 /L), nonallergic rhinitis/sinusitis, number of daily maintenance medications for asthma/COPD in the last 12 months, parental smoking (father only, mother only, both, neither/unknown), prebronchodilator FEV 1 (% predicted), prebronchodilator forced vital capacity (% predicted), respiratory allergies (including allergic rhinitis, seasonal/perennial rhinitis/ sinusitis, eye, or mold allergies), sex, St George's Respiratory Questionnaire activity score, St George's Respiratory Questionnaire symptoms score, smoking pack-years, time (years) since respiratory symptom onset or diagnosis (whichever was earlier), and visits to specialist in the last 12 months (n). Figure E4 (this article's Online Repository at www.jaci-inpractice.org)shows Kendall's s correlations within the data set tested after selection of these variables.These variables were included in a Random Forest dissimilarity matrix, which was input into the cluster analysis.

Clustering
Cluster analysis was conducted using data for each of the aforementioned sets of input variables.In each case, participants were included if they had complete data for all included variables, because missing data for a trait has a different meaning from not having that trait.Each cluster analysis used partitioning around medoids: a more robust generalization of the k-means method. 20The objective of using partitioning around medoids was to find a set of clusters in which the members were as similar as possible, but as dissimilar as possible to members of other clusters.The method started by searching for k representative individuals (medoids), and the k clusters were constructed by assigning each individual to the nearest medoid.The Jaccard coefficient with 100 bootstrap samples was used to evaluate stability, as described by Hennig, 21 to assess the frequency in which pairs of participants were assigned to the same cluster across bootstraps.The clusters were numbered in an arbitrary manner.The characteristics of clusters obtained were described via summary statistics.
Prediction of cluster membership.For clusters identified with approach A, modeling was undertaken to identify predictors of cluster membership.Study patients were randomly allocated, stratifying by age, sex, and physician-assigned diagnostic label, to a training data set (70% of individuals) and a testing data set (the remaining 30% of individuals).Prediction analysis was undertaken using the maximally informative variables from input variable set A, excluding those with high loading in the cluster analysis.Gradient boosting was applied to determine the best predictors of the clusters, 22,23 and the 6 strongest variables were combined into a prediction model using the training data.Test performance was calculated for each cluster using the hold-out set to arrive at a series of binary schemes that could be subjected to receiver-operating characteristic analysis.All analyses were conducted using R 3.5.1. 24

Population characteristics
Data were available for 10,885 NOVELTY patients, with almost half (46.5%) enrolled from primary care, and of whom 5163 (47.4%) had biomarker data and 7608 (70.0%) had patient-reported outcome data.Baseline characteristics are reported in detail elsewhere. 17Overall, 52.7% of patients had a physician-assigned diagnosis of asthma alone, 12.3% asth-maþCOPD, and 34.9% COPD alone.Overall, 31.2%,34.9%, and 33.7% of patients had physician-assessed mild, moderate, and severe/very severe disease, respectively.
A total of 3796 patients (mean age, 59.5 years; 54% female) had complete data for all variables selected by approach A, and 2934 patients (mean age, 60.7 years; 53% female) had complete data for all variables selected by approach B; lack of consent for blood collection was the most common reason for exclusion.There were no clinically important differences between patients included in the cluster analyses and the whole NOVELTY population, except that there were fewer non-White patients, slightly more with type 2 (T2)-high biomarkers, and a higher proportion with 1 or more exacerbation in both cluster analyses, and there were fewer current smokers with approach B (see Table E2 in this article's Online Repository at www.jaciinpractice.org).

Cluster distribution by physician-assigned diagnosis
Using variables from approach A, 6 clusters were identified (Figure 1, A Overall, 67% and 75% of patients with a physician-assigned diagnosis of asthma alone appeared in clusters A1 to A3 and B1 to B3, respectively, and 89% and 87% of patients with a diagnosis of COPD alone appeared in clusters A4 to A6 and B4 to B6, respectively, but each of the physician-assigned diagnoses (asthma, asthmaþCOPD, and COPD) appeared in all the clusters (Figure 1; Tables E3 and E5).With approach A, 42% of patients with physician-assigned diagnoses of asthmaþCOPD were in the "asthma-like" clusters A1 and A2, whereas with approach B, only 18% of patients with physician-assigned diagnoses of asthmaþCOPD were in the "asthma-like" clusters of B1 and B2 (Figure 1; Table III).

Characteristics of clusters
Approach A clusters.As presented in Table I and Table E3, patients in clusters A1 and A2 were younger, with respiratory allergies, childhood onset of respiratory symptoms, wellpreserved lung function, and little clinically important breathlessness (modified Medical Research Council dyspnea grade !2).These clusters had the highest proportion with bronchodilator responsiveness (19% and 20%, respectively).However, there were marked differences by sex (female: cluster A1, 74%; cluster A2, 31%).Cluster A3, which was predominantly female, also featured respiratory allergies, but patients were older at onset of respiratory symptoms and had more breathlessness than patients in clusters A1 and A2 despite less airflow limitation (postbronchodilator FEV 1 /forced vital capacity < lower limit of normal); almost one-quarter had anxiety or depression.Almost one-third of the participants in cluster A3 were non-White.
In clusters A4 to A6, patients were older, mostly current/exsmokers, with less allergy history, greater airflow limitation, and more breathlessness.However, there were again marked differences by sex, with the proportion of female patients being 70%, 45%, and 24% in clusters A4, A5, and A6, respectively.Emphysema was most common in clusters A5 and A6 (32% and 30%, respectively).Cluster A6, which was predominantly male smokers/ex-smokers, had the lowest lung function and the highest prevalence of frequent productive cough, but a similar proportion had high T2 biomarkers (blood eosinophil count !0.3 Â 10 9 /L or FENO ! 30 parts per billion) and bronchodilator responsiveness as in the asthma-predominant clusters A1 and A2.Across the 6 clusters, there was little variation in BMI, exacerbation rate, indoor exposure to biomass fuels, blood eosinophil count, or FENO.
In the prediction model, the strongest predictors of cluster membership were age, weight, childhood onset of respiratory symptoms, prebronchodilator FEV 1 , duration of dust/fume exposure, and number of daily medications.The area under the curve for the receiver-operating characteristic curves ranged from 0.87 (cluster A1) down to 0.71 (cluster A4; see Figure E7 in this article's Online Repository at www.jaci-inpractice.org).

Approach B clusters. As presented in Table II and
Table E5, clusters B1 to B3 predominantly comprised female patients and clusters B4 to B6 predominantly comprised male patients.There was some variation by ethnicity, with North-East Asian participants mostly found in clusters B1 and B4.Three clusters (B3, B5, and B6) comprised patients with high levels of breathlessness (2 of which, clusters B3 and B6, also had higher exacerbation rates) and 3 clusters (B1, B2, and B4) comprised patients with fewer symptoms and exacerbations.
The clusters were further characterized as follows: cluster B1, respiratory allergies or high-T2 markers common, with normal lung function; cluster B2, high respiratory allergies or T2 markers, more airflow limitation, with frequent productive cough in a third of patients; cluster B3, breathlessness and frequent productive cough common, airflow limitation more severe, and both blood neutrophils and T2 markers high; cluster B4, less severe airflow limitation, but high proportion of current/ ex-smokers; cluster B5, more severe airflow limitation, high proportion of current/ex-smokers, and symptoms of breathlessness common; cluster B6, more severe airflow limitation, highest proportion of current/ex-smokers, very high proportions with breathlessness and frequent productive cough, and high blood neutrophils.

DISCUSSION
In the large, global, real-life NOVELTY study of patients with physician-assigned diagnoses of asthma and/or COPD recruited from primary and specialist care, cluster analyses using 2 different approaches to variable selection each found 6 identifiable but overlapping clusters, reflecting overlap of clinical and biological characteristics among patients in the clusters.Each cluster also included all 3 diagnostic labels.With both approaches, about 60% of patients with physician-diagnosed asthma appeared in 2 clusters characterized by respiratory allergies and younger age, and about 90% of patients with physician-diagnosed COPD appeared in 3 clusters characterized by current/ex-smoking and lower lung function.However, differences were seen between clusters, particularly with the nonfacilitated approach A, in features such as sex, ethnicity, breathlessness, frequent productive cough, and blood cell counts, suggesting the existence of phenotypes that differ from conventional diagnostic characteristics for asthma and COPD.

Previous studies
Several previous studies have applied cluster analysis to patients with either asthma or COPD, with substantial variation in their populations, biosampling, and variable selection.For clustering among cohorts with mild, moderate, and/or severe asthma, the Severe Asthma Research Program used clinical variables, 26 with later inclusion of FENO, blood, and bronchoscopic variables for a subpopulation, 27 and Haldar et al 28 used clinical and sputum variables.Some studies selected variables on the basis of clinical feasibility, including a Swedish population study that used questionnaire data alone 29 and the Airways Disease Endotyping for Personalized Therapeutics (ADEPT) study (which selected symptom control, airway hyperresponsiveness, and blood eosinophils), 30 or on the basis of a specific hypothesis; for example, a severe asthma registry study investigated clustering of only blood eosinophils, FENO, and IgE. 31 In patients with COPD, the ECLIPSE study, 32 the COPDGene study, 33 and other studies 34 have explored clustering approaches using clinical and physiological variables, with some studies including comorbidities, inflammatory profiles, and/or imaging.All these studies identified clusters within asthma and COPD, but because they were limited to asthma or COPD, they had limited ability to identify patterns that may be shared across these diagnostic labels.In addition, their populations and variables were siloed, with the asthma studies lacking data on emphysema or chronic bronchitis and the COPD studies excluding young adults and lacking data on allergies.Very few studies have used clustering approaches in patients with asthma and/or COPD; they include a small study in patients with severe asthma or moderate to severe COPD, 35 and a study in patients aged 40 years or older from random population samples in New Zealand and China. 36OVELTY is unique in this regard because it includes a large and broad spectrum of patients with a physician-assigned asthma, COPD, or asthmaþCOPD diagnosis recruited globally in a reallife setting from both primary and specialist care, with the same diagnosis-agnostic variables collected for all patients.

Methodologic aspects
In this analysis, we used 2 methods to select variables from among more than 170 baseline clinical features and readily available biomarkers in the NOVELTY data set: approach A was a hypothesis-free approach starting with reduction of variables to a low-correlation set, whereas approach B was a facilitated approach with variable selection based on adaptive grouping of clinically related traits.As might be expected, the latter showed a clearer divide between physician-assigned labels of asthma and COPD across the clusters (Figure 1; Table III), but the fact that 2 different approaches identified similar and complementary clusters strengthens the analysis.With the clinically directed approach B, patients with asthmaþCOPD clustered more with COPD-only diagnoses, whereas with the nonfacilitated approach A, they clustered more with asthma-only diagnoses (Table III).This may have clinical importance, because multiple studies have shown that patients with diagnoses of both asthma and COPD are more likely to die or be hospitalized if they are treated as if for COPD with long-acting bronchodilators alone (without inhaled corticosteroids), whereas this is not found in patients with COPD alone. 37,38

Interpretation of results
Overall, each approach resulted in 3 clusters with features including younger age and respiratory allergies (with many patients having an asthma diagnosis) and 3 clusters with features including older age at onset of respiratory symptoms, current/exsmoking, clinically important breathlessness, more emphysema and often greater airflow limitation (with many patients having a diagnosis of COPD).However, the differences seen between clusters and between approaches in the distribution of several variables, including sex, age, age at onset, ethnicity, breathlessness, frequent productive cough, and T2 biomarkers suggest phenotypes that may differ from conventional diagnostic characteristics.The marked differences by sex between clusters in *The terms "Caucasian" and "African American" were used in the electronic case report form for recording patient ethnicity.†Allergic rhinitis/sinusitis (seasonal or perennial) or animal/mold allergy.zBiosamples for analysis of biomarkers were not collected from patients in Brazil (n ¼ 202).
xBlood EOS count !0.30 Â 10 9 /L or FENO !30 ppb.kIdentified from SGRQ by positive responses to both questions regarding cough and phlegm production on most or several days a week over the past 3 mo. 25pproach A are of particular interest because sex was not included following hypothesis-free testing to select variables that captured most variation in the data.However, the variability caused by differences between sexes was captured by other selected variables: prebronchodilator FEV 1 , smoking status, and weight.Possible drivers of differences between sexes in obstructive lung disease range from risk factors, genetics, pathophysiology, and presentation of disease to treatment and response. 39These crosssectional cluster analyses were the first step toward the primary objective of NOVELTY, which is to identify clinical phenotypes and molecular endotypes on the basis of biomarkers and/or clinical parameters that are associated with differential outcomes for symptom burden, clinical evolution, and health care utilization over time.The fact that these initial cross-sectional clusters based on conventional clinical features and readily available biomarkers were overlapping (eg, as visualized in Figures E5 and   E6) indicates overlap of clinical and biological characteristics between members in adjacent clusters.Clearly, therefore, these clusters do not reflect discrete underlying mechanisms, but ongoing analysis toward this aim will include identification of individual patient trajectories in these variables over 3 to 5 years of follow-up, and deep "omics" analysis of the stored biobank.The present findings suggest opportunities for precision medicine: for example, there are many potential underlying mechanisms for frequent productive cough (one of the most common traits found in NOVELTY 25 ), and high blood eosinophils or FENO may be either rapidly responsive to inhaled corticosteroids 40 or, in severe asthma, may be persistently elevated despite systemic corticosteroids. 41The ultimate goal is precision medicine, based on identifying specific molecular mechanisms; this has been extremely successful in oncology but remains challenging for complex chronic diseases such as obstructive lung kIdentified from SGRQ by positive responses to both questions regarding cough and phlegm production on most or several days a week over the past 3 mo. 25isease.In the meantime, the present findings emphasize the heterogeneity of asthma, asthmaþCOPD, and COPD, and the importance of avoiding a siloed, restrictive approach to research.For clinical practice, there is increasing interest in identifying specific treatable traits or modifiable risk factors and comorbidities that may direct specific treatment choices, independent of the diagnostic label, with a recent analysis describing the distribution and patterns of 30 treatable traits in the large NOVELTY population. 42

Strengths and potential limitations
Strengths of this analysis include NOVELTY's wide geographic coverage and large population of patients with asthma and/or COPD from both primary and specialist care.][10][11] Clinical, physiological, and biomarker variables were measured in a standardized manner, and the value of this approach was emphasized by the finding of respiratory allergies, emphysema, and frequent productive cough across the conventional diagnostic labels.Potential limitations are that there were few young patients in NOVELTY (18% were <45 years) and that patients with missing data for any of the included variables were excluded from the analyses.Reasons for missing data include lack of patient/ regulator consent for blood sampling for biomarker analysis and failure of some patients to complete patient-reported outcome questionnaires; the resulting reduction in the number of patients included in the analyses may affect the representativeness of the findings.In addition, most patients in NOVELTY were White (74.7%), which limits the ability to generalize the findings to other ethnicities.However, country and ethnicity were included in the initial set of 79 variables considered for input into the cluster analysis for approach A, but neither was ranked among the strongest drivers of variation in the data.In addition (as seen in Table E2), there were no major systematic differences between patients included versus the whole NOVELTY population, except that there were fewer non-White patients, slightly more with T2-high biomarkers, and a higher proportion with 1 or more exacerbation in both cluster analyses, and there were fewer current smokers with approach B.

CONCLUSIONS
These analyses in the NOVELTY population, based on facilitated and nonfacilitated selection of variables, each identified 6 overlapping clusters that crossed the conventional diagnostic labels of asthma, COPD, or asthmaþCOPD and revealed several discriminatory features that differed from conventional diagnostic characteristics.The overlap between clusters, reflecting the overlapping clinical and biological characteristics of patients included in these clusters, suggests that they do not reflect discrete underlying mechanisms.

;
Table I; see also Table E3 in this article's Online Repository at www.jaci-inpractice.org).The clusters had high mathematical stability (Jaccard similarities, 0.89-0.94;see Table E4 in this article's Online Repository at www.jaciinpractice.org),but adjacent clusters overlapped in the cluster visualization (see Figure E5 in this article's Online Repository at www.jaci-inpractice.org),reflecting overlapping clinical and biological characteristics among the patients in these clusters.Analysis based on variables from approach B also identified 6 clusters (Figure 1, B; Table II; see also Table E5 in this article's Online Repository at www.jaci-inpractice.org)with good mathematical stability (Jaccard similarities, 0.70-0.85;see Table E6 in this article's Online Repository at www.jaci-inpractice.org);again, clusters were overlapping (see Figure E6 in this article's Online Repository at www.jaci-inpractice.org),reflecting overlap of clinical and biological characteristics among the patients.

TABLE I .
Distribution of key characteristics across approach A patient clusters (for full results, see TableE3) BD, Bronchodilator; EOS, eosinophil; FVC, forced vital capacity; HRU, health care resource utilization; LLN, lower limit of normal; mMRC, modified Medical Research Council; ppb, parts per billion; PRO, patient-reported outcome; SGRQ, St George's Respiratory Questionnaire.

TABLE II .
Distribution of selected variables across approach B patient clusters (for full results, see TableE5) BD, Bronchodilator; EOS, eosinophil; FVC, forced vital capacity; HRU, health care resource utilization; LLN, lower limit of normal; mMRC, modified Medical Research Council; ppb, parts per billion; PRO, patient-reported outcome; SGRQ, St George's Respiratory Questionnaire.*The terms "Caucasian" and "African American" were used in the electronic case report form for recording patient ethnicity.†Allergicrhinitis/sinusitis (seasonal or perennial) or animal/mold allergy.zBiosamples for analysis of biomarkers were not collected from patients in Brazil (n ¼ 202).xBlood EOS count !0.30 Â 10 9 /L or FENO !30 ppb.

TABLE III .
Proportions of each diagnosis and category of severity, by cluster (row percentages) article's Online Repository at www.jaci-inpractice.org.Medical writing support, under the direction of the authors, was provided by Richard Knight (PhD), CMC Connect, a division of IPG Health Medical Communications, funded by AstraZeneca in accordance with Good Publication Practice 2022 guidelines.