Classifying chronic pain using multidimensional pain-agnostic symptom assessments and clustering analysis

Description


INTRODUCTION
Chronic pain is a global epidemic reflecting a health care crisis for the person suffering from it, their family, and society as a whole (1)(2)(3). More than 100 million individuals are affected by various chronic pain conditions in the United States alone, with medical expenses and lost productivity costing more than $635 billion annually and projected to become much worse (4)(5)(6). Primary chronic pain conditions present in various shapes and forms, commonly classified by anatomical location of experienced pain, from low-back pain and headaches to pelvic or bladder pain, including widespread nonspecific or overlapping pain (7,8). However, shared by all conditions is a global functional impairment that is manifested in the experience of multiple physical, "mental," and social health symptoms, reflective of the biopsychosocial model of shared etiological factors across chronic pain conditions (7,(9)(10)(11)(12). While various studies aimed to uncover and classify subgroups of chronic pain (13)(14)(15)(16)(17)(18)(19)(20)(21)(22), little is known whether a combination of domain-general symptoms agnostic to pain can be used to classify one's chronic pain condition and subsequently serve as potential markers for clinical diagnosis and prognosis (23,24). A symptom-based approach may also reveal potentially modifiable factors as targets for therapeutic interventions. Therefore, we suggest a reversal of the common practice; instead of assessing patient-reported symptoms as features of the a priori determined pain condition, we examined whether such symptoms may serve to classify current and predict future pain condition. If confirmed, our approach could be used to support personalized and efficient treatment of individuals with chronic pain.
We implemented unsupervised machine learning, specifically agglomerative hierarchical clustering analysis (AHCA) (25)(26)(27), on multidimensional patient-reported symptoms that assess physical, mental, and social health status factors, to identify idiosyncratic groups, or clusters of patients with chronic pain. Patients reflected a real-world clinical population with a heterogeneous mix of pain conditions seeking treatment at a tertiary academic pain clinic. As part of their routine initial evaluation, they completed multidimensional patient-reported assessments using Stanford's Collaborative Health Outcomes Information Registry (CHOIR) registry-based learning health system (Fig. 1A) (28,29). We used nine symptoms for clustering based on the National Institutes of Health's (NIH) Patient-Reported Outcomes Measurement Information System (PROMIS), which was designed and validated for precise and efficient measurement of health-related symptoms in patients with a wide variety of chronic health conditions (30). These symptoms were agnostic to nine pain-specific measures that we subsequently used to validate the diagnostic-like nature of the data-driven clusters independently.
Mechanistically, we aimed to uncover whether the multivariate pattern of symptoms and pain-specific measures characterizing each identified cluster reflects a general graded scale of severity or a differential pattern. Furthermore, given the centrality and comorbidity of mental health related factors with chronic pain, predominantly negative affect-related symptoms such as anxiety, depression, and anger (10,31,32), we expected these symptoms to be key drivers for the determination of cluster assignment, thus highlighting them as targets for treatment. We based cluster discovery on a training dataset of 11,448 patients and subsequently validated it in two additional datasets of 3817 and 1273 patients. The later dataset included follow-up assessments allowing us to examine whether cluster assignment at baseline would be predictive of pain-related measures at follow-up, thus providing potential prognostic-like validation of the identified clusters. Last, we examined the dynamics across assigned clusters between time points.

Demographic characteristics
Demographic characteristics of study participants are described in Table 1 (see also table S1).
Cluster2 was 0.36 SD and 1.05 SD, and Cluster3 was 1.20 SD and 1.54 SD, all worse than the norm in the clustering symptoms and pain-specific measures.
Although the pattern of severity also manifested in the number of self-reported body segments in pain, with Cluster3 indicative of potential widespread and/or overlapping chronic pain conditions, we found no significant associations between specific body regions (table S2 and fig. S1) and any of the clusters (Chi 2 = 3.23, P = 0.99; Fig. 2S). Cluster1 was only descriptively associated with more pain in the front of the head (14.07%) compared to Cluster2 (9.61%) and Cluster3 (8.22%; Chi 2 = 1.75, P = 0.41). Similarly, none of the demographic characteristics were significantly associated with any specific cluster (Table 1). We replicated the same pattern of results across clustering symptoms and pain-specific measures in the validation (table S3 and fig. S2) and the longitudinal datasets (table  S4 and fig. S3), except for pain duration, for which we found no differences between the three clusters (P > 0. 12). This supported the reliability and validity of the identified clusters. However, two critical questions arose that we addressed in the following two sections. (C) The plot shows the gap statistic values for different k number of clusters, and a red dashed line indicates the optimal solution of k = 3 since it is the smallest value of k that is within one standard deviation of the value of k that maximizes the gap statistic. The error bars represent one standard error of the estimated gap statistic. (D) The plot shows the percent contribution to the overall separability between clusters of each of the nine clustering symptoms, in the order of most contributing (depression = 15.20%) to least (emotional support = 3.19%).

of 19
Can we generate idiosyncratic groups of patients using only pain intensity?
The clustering solution identified three clusters portraying a graded scale of severity across all clustering symptoms and all pain-specific measures. Pain intensity is a common measure for assessing one aspect of the severity of pain and thus can be used to examine whether one such variable can similarly obtain a solution that generates idiosyncratic groups of patients. Although AHCA is commonly applied on multiple measurements, it technically requires one variable at minimum, and thus, we applied AHCA on the training dataset using only pain intensity. The dendrogram reflecting the results of the AHCA is shown in Fig. 3A. The gap statistic indicated an optimal number of one cluster (Fig. 3B Black, non-Hispanic nine-dimensional clustering symptoms (see fig. S4A for scree plot and fig. S4B for clustering symptoms' contribution to the first three principal components). To evaluate the separability between the clusters of the two solutions, we plotted the entire training dataset using the first three principal components and colored the data points based on the three clusters of the clustering symptoms solution (Fig. 3C) and of the pain intensity solution (Fig. 3D). The separability between clusters is clearly seen in the clustering symptoms solution,

Fig. 2. Cluster characterization and diagnostic-like validation.
A graded scale of severity is manifested across all clustering symptoms (A to I) and all pain-specific measures (J to R), such that Cluster1 reflects a low severity, Cluster2 reflects a medium severity, and Cluster3 reflects the worst severity. Raincloud plots combining jittered raw data, data distribution, and boxplots were generated using open source code (93). Complementary descriptive and inferential statistical information is provided in Table 2. (S) The plot shows the % endorsement of 11 body regions as distributed in each of the clusters. There was no significant association in the distribution of % endorsed body regions between the clusters (P = 0.99). NRS, Numerical Rating Scale; PCS, Pain Catastrophizing Scale.
while a substantial overlap is seen in the pain intensity solution, indicating that using pain intensity alone cannot capture a similar solution.
To what degree does the clustering solution reflect a latent mental health-related construct?
As we initially expected, the negative affect-or mental health-related clustering symptoms were the most important factors driving the clustering process and contributing most to the first principal component (42.2% of the explained variance; fig. S4B). To estimate the degree by which the underlying data structure of the clustering symptoms reflect a mental health-related construct, we calculated the Pearson correlation coefficient between the first three principal components derived by the above mentioned PCA and the PROMIS Global Health Mental subscale. To note, this measure was available for n = 10,835 of the training dataset. The first principal component explained 55.42% of the variance in the data structure ( fig. S4A) and had a correlation of r = −0.78 with the PROMIS Global Health Mental subscale (Fig. 3E). The correlation with the second and third principal components, which explained 12.28 and 8.83% of the variance in the data structure ( fig. S4), were r = 0.11 and r = −0.02, respectively. As expected, this reconfirms mental health as a key construct in the data's underlying structure but not the only.
To further illustrate this point, we split the range of possible PROMIS Global Health Mental scores into tertiles and labeled them in order of severity (1, 2, and 3) to match the clustering symptoms' cluster solution labeling. We then quantified the level of congruence between these two sets of labels by counting how many patients were assigned by each of the solutions to the same cluster label and how many were mismatched between the clusters (Fig. 3F). The level of congruence was 76.73, 50.44, and 71.77% for Cluster1, Cluster2, and Cluster3, respectively, and with an overall 62.26% congruence. Together, it is clear that mental health is a primary component in the data's underlying structure. Still, the proposed clustering solution reflects more than a latent mental health-related construct, particularly at the intermediate Cluster2.

Predictive validation and cluster dynamics over time
After controlling for the time between the two assessments (3 to 12 months), we were able to demonstrate substantial differences between clusters as identified at baseline in all clustering symptoms (Table 3 and Fig. 4, A to I) and pain-specific measures at follow-up (Table 3 and Fig. 4, J to Q). Cluster1 continued to reflect the least severe condition, Cluster3 the worst, and Cluster2 in between, again with substantial effect sizes across all comparisons (Table 3). These results validate the prognostic-like nature of the clusters and suggest that the graded scale of severity remains consistent at follow-up at the group level. Nevertheless, cluster identification at follow-up demonstrates that while most patients (n = 879, 69.05%) remained within their same cluster between the two time points, there were movements across clusters (Fig. 5A): 180 patients (14.14%) had an improvement in their condition and moved from Cluster3 to Cluster2 (n = 69, 5.42%) or Cluster1 (n = 6, 0.47%) and from Cluster2 to Cluster1 (n = 105, 8.25%); and 214 patients (16.81%) had a worsening in their condition and moved from Cluster1 to Cluster2 (n = 115, 9.03%) or Cluster3 (n = 4, 0.31%) and from Cluster2 to Cluster3 (n = 95, 7.46%). We compared the total movement of patients across clusters between time points (n = 394, 30.95%) to a bootstrapped distribution of patients moving across clusters within potential measurement error [mean (M) = 5.81% ± 0.54 SD; Fig. 5B], which indicated a significant amount of movement (t (df = 999) = 1477.18, P < 0.0001; Fig. 5C). This suggests that the changes across clusters are meaningful, potentially indicating an interaction between treatment effects and regression to the mean (33). Thus, cluster assignment is not a static condition; rather, various factors might affect the long-term dynamics across clusters, offering a window of opportunity for personalized interventions.
To provide an estimate of what entails a change in cluster assignment, we calculated the average of absolute change across the nine clustering symptoms' scores between the two time points, as well as the average number of symptoms that had an absolute change beyond the estimated measurement error. We compared these values between the group of n = 394 patients that moved across clusters and the n = 879 patients that remained in the same cluster. The patients moving across clusters between time points had a larger absolute change in symptoms' scores (7.59 ± 3.65) and a larger number of symptoms that changed beyond the measurement error (6.00 ± 1.79) compared to those patients remaining in the same cluster [5.18 ± 2.15, t(1271) = 14.72, P < 0.0001; and 4.96 ± 1.74, t(1271) = 9.78, P < 0.0001, respectively].

DISCUSSION
In our study, we offer a novel biopsychosocial-inspired approach to classify patients with chronic pain, resulting in the identification of three robust idiosyncratic groups of patients and generating putative markers that can classify current and predict future severity of chronic pain in a graded manner, regardless of their formal diagnosis or their underlying etiology. We applied a data-driven clustering algorithm on multidimensional self-reported symptom assessments that are agnostic to pain. These assessments were collected through CHOIR, Stanford's registry-based learning health care system (34), and belonging to more than 16,000 real-world patients seeking treatment at a tertiary academic pain clinic. The assessments can be completed using an electronic device from almost any place, in about 15 min, and with hardly any need for assistance from staff. These findings can be instrumental in supporting treatment selection and pain management in a personalized health care platform, especially in the current forward-triage approach to health care in which a clinician might not be able to physically examine a patient (35). Moreover, findings inspire further research into the biological and behavioral mechanisms that characterize the identified clusters.
The three identified groups reflect a graded scale of severity. They are therefore labeled Cluster1, Cluster2, and Cluster3, with higher numbers indicative of a more severe condition, as shown in all assessments, including those used for clustering and those specific for pain, except for pain duration since onset of chronic pain. The strongest drivers separating between clusters are the negative affect-related factors. No apparent demographic factors significantly differ between clusters. The overall group characteristics initially found on a subset of more than 11,000 patients reliably reproduced in two additional subsets consisting of about 5000 more patients. Moreover, one of these subsets comprising 1273 patients included follow-up assessments, the severity of which were predicted on the basis of the baseline cluster assignment. Examining the dynamics across clusters between baseline and follow-up assessments indicated that cluster assignment is not a static condition, suggesting that various factors might affect improvement or worsening of the 12 of 19 pain condition. Thus, beyond the diagnostic-and prognostic-like nature of these symptom-based putative markers, future clinical and research efforts should examine whether and to what extent they will indicate response to various treatments (23,24). This will be of importance to further determine the extent by which changes in cluster assignment reflect "real" change rather than potential measurement errors or other statistical phenomena.
A primary concern of chronic pain health care is identifying safe and effective treatments tailored to the patient's particular needs. Evidence-based approaches have been called to address this challenge by generating classification systems that focus on the fine-grained multidimensional and mechanistic substrates of chronic pain conditions (36,37). However, translating these systems into clinically interpretable and applicable tools is challenging, especially if these systems require costly and burdensome medical tests (38). Consequently, there have been growing efforts to generate empirical classifications of patients with chronic pain based on relatively simple assessments that may advance our understanding of the underlying substrates of chronic pain and potentially inform and support clinical decision-making (13)(14)(15)(16)(17)(18)(19)(20)(21)(22). These efforts differ in sample sizes (from approximately a hundred to thousands), type of chronic pain groups (heterogeneous, specific diagnoses, or even pain-free), type of measures (subjective and/or objective), type of analytic approach (e.g., supervised versus unsupervised algorithms), and the number of resulting clusters (mostly in the range of two to four groups), among other. While a detailed review of the various classifications approaches goes beyond the scope of the current work, two such solutions are noteworthy.
One of the first efforts to empirically classify patients with chronic pain (22) identified three clusters based on the multidimensional pain inventory (MPI) (39). The MPI assesses psychosocial and behavioral factors related to the experience of chronic pain, such as  Table 3. The graded scale of severity is manifested also here, such that those labeled as Cluster1 at baseline continue to have at the group level the lowest level of severity across all measures, and the same for Cluster2 and Cluster3 being the medium and worst severity, respectively.

of 19
pain severity and interference, affective distress, social support, and behavioral activities. The cluster labeled as dysfunctional had relatively high levels of pain and emotional distress, low levels of perceived life control and behavioral activation, and intermediate levels on various social related factors. The interpersonally distressed cluster had intermediate levels on most factors but was low on social supportrelated factors. The minimizer/adaptive copers cluster had low levels of pain and emotional distress, high levels of perceived life control and behavioral activation, and high social support. The subgroups initially developed using patients with heterogeneous chronic pain, later reproduced in other chronic pain diagnostic groups, such as low back pain, headache, and patients with temporomandibular disorder (TMD) (40). Mixed effects were found for the potential association between MPI-based classification and treatment outcomes (41)(42)(43)(44).
A more recent effort (14) used numerous clinical characteristics, psychosocial questionnaires, and measures of autonomic function and multimodal pain sensory testing in patients with TMD and TMD-free controls from several locations in the United States to identify three groups based on best possible characterization of chronic TMD. An adaptive cluster consisting mostly of the controls had better autonomic function, the lowest sensitivity to pain, and the lowest levels of various psychosocial characteristics and symptoms. A global symptoms cluster, half of which were TMD cases, had high levels of sensitivity to pain and of psychosocial characteristics and symptoms. Follow-up analysis indicated that TMD-free controls from this group had greater risk of developing first-onset TMD. An additional pain-sensitive cluster, a quarter of which were TMD cases, had intermediate levels of psychosocial characteristics and symptoms, coupled with heightened sensitivity to experimental pain. An algorithm based on a much smaller subset of the initial measurements, including muscle pain sensitivity, somatization, anxiety, and depression, reproduced and generalized the clusters in additional cohorts from different locations, including patients with chronic overlapping pain conditions and clinical patients most commonly diagnosed with TMD, fibromyalgia, trigeminal neuralgia, and headache (16).

of 19
In most clinical settings, as in research, chronic pain is predominantly diagnosed by the relevant anatomical location of pain (7,8). The symptom-based classification system proposed here, similarly to the clustering efforts just described, was inspired by the biopsychosocial approach (9)(10)(11)(12). We aimed to expand the common practice by integrating evidence-based and patient-centric information that go beyond the potential underlying objective location and manifestation of pathological disease and incorporate the subjective and personal experience and expression of the illness. However, unlike previous clustering efforts, we do so by focusing on domaingeneral patient-reported symptoms that are agnostic to pain and thus commonly considered secondary in classifying patients with chronic pain rather than a potential starting point. Unlike the dominant diagnostic system, we find no associations between specific locations of experienced pain and identified cluster. Nevertheless, the number of body regions in pain increased with severity, indicating that patients with widespread and/or overlapping chronic pain conditions suffer more than those with a more localized pain condition. In line with previous clustering approaches and other research (14,45), findings thus evidence a diminished reliance on specific anatomical locations of experienced pain when assessing and classifying the severity of impairment in primary pain conditions and potentially when considering treatment avenues. Together, although there are similarities and differences between the current and previous clustering approaches, it is clear that additional research is required to further corroborate the symptom-based classification system proposed here. Moreover, to maximize utility, future efforts should aim to integrate between data-driven clustering approaches as our own and those described above to provide the idiosyncratic complexity of pain within the currently used diagnostic systems.
As expected, negative affect-related symptoms emerged as key factors driving the clustering process. Researchers have previously demonstrated similar negative affect-related metrics to be central in clustering patients with chronic pain (13,14). As we further confirmed, a global measure of mental health was a key construct in the data's underlying structure. This reverberates with the crucial role of mental health in chronic pain (10,31) and as a potential underlying pathological mechanism differing between clusters. Moreover, there is currently an ongoing paradigm shift in psychology and psychiatry, calling for the classification of psychopathology as a hierarchy of continuous dimensions rather than describing it through discrete diagnostic categories (46). On top of the hierarchy is a global factor termed the "p factor," generally ranging from low to high psychopathological severity and cutting through all psychopathological disorders to account for their nonspecific and overlapping manifestation of symptoms (47,48). The resemblance to chronic pain is astounding. The empirical findings presented here suggest that pain as a field of study and treatment should consider establishing a similar hierarchy of continuous global transdiagnostic dimensions to improve the ability to address the challenges of chronic pain. Moreover, this echoes our contemporary perspective on the need for more synergistic interactions between the research and clinical fields of pain and mental health, specifically regarding the centrality of affective components to these fields (31).
Our findings confirm existing notions of a general graded scale of severity of chronic pain illness (45,49,50). However, our approach extends previous efforts in terms of the combination of scale, scope, computational approach, and especially in that we use multidimensional domain-general symptoms that are agnostic to pain. This is advantageous for two main reasons. First, it may highlight potentially modifiable targets for intervention. As we anticipated, negative affect-related factors, namely depression, anxiety, and anger, were key drivers in cluster assignment at the group level. Fortunately, there is a flourishing of treatment strategies aimed to reduce negative affect-related symptomatology (51)(52)(53)(54)(55)(56). Moreover, findings indicate that the distribution of symptoms severity and of patients across clusters is, to a certain extent, blended (e.g., Figs. 2, A to R, and 3C). This suggests that a health care clinician may consider the particular pattern of symptoms at the individual patient level regarding the assigned cluster and use this information to guide and support clinical decision-making contextually. For example, we may envision a patient assigned to the lower severity Cluster1 but has relatively high levels of sleep dysfunction that a clinician could address with specific sleep-related treatments (57).
A second advantage of the domain-general symptoms approach is that it may be implemented and potentially generalizable to other chronic illnesses requiring symptom management beyond the specific pathophysiology of their disease, like cancer, immune disorders, and cardiovascular diseases among many others. Notably, the graded classification of severity resonates with other illnesses that are characterized by a staged progression of disease, such as cancer (58), heart (59) and kidney diseases (60), diabetes (61), and more. Here, however, we did not use objective and etiological-based metrics, and future integration of genetic, metabolic, inflammatory, and/or anatomical and functional neuroimaging metrics can substantially improve our understanding of the biological mechanisms underlying the identified symptom-based clusters and potentially lead to improved (bio)marker properties (23,24).
The U.S. National Pain Strategy has drawn attention to a subgroup of patients with chronic pain-those with persistent highimpact chronic pain. These patients suffer from the most severe and debilitating illness, substantially restricting and interfering with daily life activities, and requiring increased health care expenditure (3,62,63). The prevalence of high-impact chronic pain is estimated to be between 5 and 15% of the adult U.S. population (10 to 30 million people) (3,5,63). Compared to lower but still clinically significant chronic pain, high-impact chronic pain was associated with unfavorable health outcomes, limitations in daily activity, negative coping strategies, elevated distress, increased health care costs, and higher usage and dosage of opioid medication (50,64). With the potential collateral personal, societal, and financial impact of long-term opioid medication (65), it is particularly crucial to better identify and understand people suffering from and at increased risk of high-impact chronic pain. Within our proposed classification system, Cluster3 may be reflective of such a group of patients: An overall most severe condition characterizes it, manifested at the group level by highest levels of pain interference, widespread and/or overlapping chronic pain conditions, low levels of physical function, fatigue, depression, and basically in every measure that we assessed. Early identification of these patients is essential for the provision of more comprehensive and costly pain assessments (e.g., psychological, medical, etc.) that better inform treatment approaches (e.g., physical or psychosocial therapy, medical interventions, etc.). Future research efforts may contrast current conceptualizations of high-impact chronic pain with the characteristics of Cluster3 and consider to what extent it supports this early identification.

of 19
There are notable limitations to our study. While our cohort is large, supporting generalizability of the sample and stability of the discovered clusters, it is still restricted to the San Francisco Bay Area and the outlining Northern California region and potentially also to patients who can afford specialized medical treatment in a tertiary academic clinic. Future efforts will need to generalize our findings to other locations with different demographic, sociocultural, and economic characteristics. In this regard, there are known demographic disparities related to pain health care (66, 67) that were not captured by the identified clusters. This may be attributed to the particular characteristics of the cohort (table S3), for example, being primarily White (53.87%) and with above college level of education (61.92%). However, there are some descriptive trends worth noting. Across datasets (table S3), Cluster3 was generally characterized by younger age, lower education level, more females, more patients identifying as Hispanic or Latino, less of them identifying as White, and less reporting being married. To note, the recent classification solution described above (14,16) similarly indicated more females, but in contrast found older age, both to be associated with the non-adaptive clusters. The range of ages differed between studies, with about 50% of the current sample older than 50 while the cohorts used by the previous clustering effort were mostly below 44. Although it is crucial to better understand the nature of such associations, it is essential to highlight that we can only address most of these factors through a necessary systematic change in health care.
In addition, in terms of cohort characteristics, we had no reliable data on formal diagnostic codes within our cohort. Although findings indicate no association between the identified clusters and anatomical pain location, which is commonly used to support formal definitions of chronic pain conditions, future efforts should examine the relationship between formal diagnoses of disease state and cluster membership. Notably, previous CHOIR studies were able to obtain and indicate a multitude of formal diagnoses (28, 29)including neuropathic, thoracolumbar, orofacial, visceral, and various musculoskeletal pain conditions, as well as fibromyalgia and complex regional pain syndrome, among many others-and these can be assumed to be part of the current cohort. Thus, our findings seem to be generalizable at least to varying types of chronic pain conditions.
The symptoms used for clustering are based on NIH's PROMIS system, which has potential limitations, since although the symptoms were validated for their psychometric properties (30,(68)(69)(70)(71)(72)(73), they are still based on subjective self-reports and thus prone to potential biases and demand characteristics. Other studies using various clustering approaches have incorporated more objective measurements, with a better characterization of their underlying mechanistic substrates. Most have used various multimodal pain sensory testing that map on to various nociceptive pathways (14,15,17,18,20,21). While incorporating objective measures with better understanding of their underlying pathophysiology is a clear next step for this research, using the PROMIS system offers substantial advantages. PROMISbased T scores are normed to the general U.S. population and thus easily comparative across cohorts. PROMIS is also an inexpensive and easily administered system, using short forms or computerized adaptive testing (CAT) to reduce time and patient burdens, and is already in wide usage in many settings, even beyond chronic pain, thus allowing others to take a similar approach as ours or to engage with our freely available cluster classifier (https://git.io/Jn8m1) for additional utilization in clinical and research settings. Moreover, previous findings show associations between various PROMIS measures and potential biomarkers in both pain and nonpain clinical contexts (74)(75)(76). Last, the clusters differ in pain-specific measure that are non-PROMIS based, such as pain intensity and pain catastrophizing, and this solidifies the validity and generalizability beyond PROMIS-based measures.
In conclusion, our symptom-based approach and findings offer significant diagnostic-and prognostic-like utility for a cost-effective, graded severity classification system of patients with chronic pain, potentially generalizable to other chronic illnesses. Our study's exploratory nature requires further research to reconfirm and generalize the identified clusters in different chronic pain cohorts, as well as experimental and mechanistic studies to uncover their etiological basis. Nevertheless, this system promises to support clinical decision-making, affecting the day-to-day functioning of patients with chronic pain, and encourages investigations into new treatment opportunities oriented toward a precision-and evidence-based approach to relieve the burdens of people suffering from chronic illness and improve their quality of life. It thus reflects a synergy between theory-driven scientific research, clinical care, and technological advancement that aims to facilitate personalized health care by closing on the bedside-to-bench-to-bedside loop.

General data acquisition procedures and dataset definition
Data were collected using Stanford University's CHOIR (http:// choir.stanford.edu), a registry-based, learning health care system that administers an electronic survey assessing self-reported demographic information, pain characteristics, and multiple domains of health status in real-world clinical settings (Fig. 1A) (34). Patients presenting for consultation at Stanford Pain Management Center locations throughout the San Francisco Bay Area and broader Northern California region, USA, with the main site located in Redwood City, complete the survey as part of their routine clinical care. While intended for completion at home using personal computers or hand-held devices, patients may complete the survey before their appointment at clinic check-in using a tablet computer. Survey completion is encouraged, yet optional, and is based on patients' willingness and ability to collaborate. Patients may therefore choose not to respond to certain items or assessments. These procedures were approved by the Stanford University School of Medicine Institutional Review Board (IRB). Informed consent was waived by the IRB, as CHOIR data were collected for clinical care and quality improvement purposes.
Data analyzed were from a retrospective review of all collected surveys since CHOIR's inception in October 2013 through August 2019. Our initial data extraction included 24,389 records, from which we removed records based on the following criteria: noncompleted or test records (6002), missing data in any of the nine measures used for clustering (as detailed below; 1651), duplicated records (136), and age below 18 years (62). From the resulting 16,538 surveys belonging to 16,538 different patients, we extracted a longitudinal dataset of 1273 patients with a follow-up survey between 3 and 12 months later and again with a minimal requirement of having complete data for the nine assessments used for clustering at both time points. We chose this time frame since 3 months is considered the minimal threshold for diagnosing primary chronic pain (7). The upper threshold of 16 of 19 12 months allowed to keep a substantially large proportion of the dataset for cluster discovery validation. The resulting 15,265 patients were randomly split on the basis of a 75%:25% allocation into a training dataset of 11,448 patients used for cluster discovery and an additional validation dataset of 3817 patients.

Demographic characteristics
Demographic characteristics included age, sex, ethnicity, race, marital status, and years of education.

Clustering symptoms
The nine symptoms assessing health-related functionality and used as the basis for the clustering procedures were from the NIH's PROMIS (30, 68-70, 72, 73). We divided these nine symptoms into three domains: (i) the physical domain (fatigue, sleep disturbance, and sleep impairment), (ii) the mental or negative affect domain (depression, anxiety, and anger), and (iii) the social domain (social isolation, emotional support, and satisfaction with social roles and activities). Response items are contextualized to the frequency of the experienced symptom in the past 7 days (e.g., "in the past seven days how often did you feel tired?" and "in the past seven days I felt worthless"), and responses were marked on a 1 to 5 scale (1 = never, 5 = always). Each measured symptom was completed using CAT, based on item response theory-derived metrics. CAT reduces the time needed to complete each measured symptom because patients respond only to a subset of items from the relevant PROMIS item bank. This subset of items is selected by the CAT algorithm to have the most information to precisely characterize the symptom for the patient, with a minimum of 4 (for adults) and a maximum of 12 items, although typically not more than eight items, taking a minute or two per measured symptom (30,71). The full range of total items responded by patients for the nine clustering symptoms was therefore between 36 and 108 items. Typically, completing all nine symptom measurements should take about 15 min (77,78).
Ultimately, a standardized T score for each PROMIS symptom is generated for each patient. A score of 50 reflects the mean of the U.S. general population, with an SD of 10. Higher T scores reflect more of the measures' symptom. We further extracted data of a PROMIS-based global health measure, specifically the Global Health Mental subscale that consisted of four items assessing general mental health, quality of life, satisfaction with social activities, and emotional problems (79). While for most measures such as fatigue or depression, higher T scores indicated a worse condition, for emotional support, satisfaction with social roles, global health mental, and physical function (see below), higher T scores reflected a better condition. Further details regarding measure development and validation are available at www.healthmeasures.net.

Pain-specific measures
Pain-specific measures were used independently of the clustering process to validate the diagnostic-like nature of the data-driven generated clusters in terms of pain-related constructs. A composite score of pain intensity was calculated by averaging three self-reported pain intensity measures. These measures used a common and validated (80) 11-point numeric rating scale of 0 to 10 (0 = no pain, 10 = pain as bad as you can imagine) for worst and average pain in the last 7 days and current pain. The number of body segments in which chronic pain is experienced was self-reported by patients, who were asked to mark locations of pain on a reliable and valid CHOIR body map scheme that included 36 anterior and 38 posterior symmetrical body segments for a maximum total of 74 segments (Fig. 1A) (81). This measure was used to reflect the extent of pain throughout the body. A group of physicians recoded these 74 body segments into 11 body regions (table S1 and fig. S1) subsequently used to examine specific locations in which patients experienced pain. There are 6 body segments in the male and female versions of the CHOIR body map that are labeled with different codes. Patients who do not report their gender are given the female version of the body map by default. These differences were re-coded to match for the correct body region across all participants. Pain duration was calculated as the number of months from onset of chronic pain that was self-reported by patients.
Additional measures assessed using PROMIS instrumentation included pain interference with daily life activities, pain behavior, and physical function (30,70). In September 2016 and moving forward, physical function was assessed using two separate measures in CHOIR, reflecting physical function of the upper extremity and lower mobility (82). Across the entire dataset, most patients had the two separate measures (61.32%). We analyzed each of the three physical function measures separately to be able to differentiate between them.
The last pain-specific measure was pain catastrophizing, reflecting maladaptive cognitions such as rumination, magnification, and helplessness, in response to actual or anticipated pain. Pain catastrophizing has been associated with poor outcomes, maintenance, and worsening of chronic pain illness (83)(84)(85). We used the Pain Catastrophizing Scale that previously demonstrated sound psychometric properties (86)(87)(88) to measure the frequency with which a patient engages in catastrophic thought patterns and consists of 13 self-reported items on a 0 to 4 scale (0 = not at all, 10 = all the time).

Cluster discovery
Hierarchical clustering is a well-established unsupervised machine learning technique that aims to discover groups or clusters of observations within a dataset without needing to a priori determine the specific characteristics of each cluster (25)(26)(27). Observations within the same cluster are expected to have similar characteristics, while different clusters are expected to have dissimilar characteristics. A cluster-tree diagram or dendrogram is a mathematical and pictorial representation of a cluster solution.
We implemented an AHCA on the training dataset and using the nine clustering symptoms, to assign each patient to a cluster. AHCA implements an iterative process in which the two most similar observations (i.e., patients or groups of patients) are fused to form a superordinate cluster until all observations belong to one single cluster. Two parameters important for this process are a distance metric that determines how similar observations are to each other and a linkage method to fuse similar observations. The agglomerative coefficient can then assess how tightly packed each cluster is within a cluster solution. We used the Euclidian distance metric combined with the Ward linkage method as it optimized the agglomerative coefficient compared to four other linkage methods (table S5).
We subsequently used the gap statistic to determine the optimal number of clusters, k (91). The gap statistic compares the within-cluster sum of squares of a certain k-clusters solution to the expected withincluster sum of squares under a null distribution with no clusters. An ideal solution will have a small within-cluster sum of squares and therefore a large gap statistic. We calculated the gap statistic for k between 1 and 10. The smallest value of k that is within 1 SD of the value of k that maximizes the gap statistic should be chosen as the optimal number of clusters.
We next aimed to determine the relative importance of each clustering symptom to the clustering process, i.e., to the separability between clusters. We computed the cluster centroid (25,27), which is the average value of each clustering symptom for all of the observations in that cluster, and then calculated the total Euclidian distance between all cluster centroids. The average amount each clustering symptom contributed to the distance between each clusters' centroid, divided by the total sum of all clustering symptoms' contribution to the total Euclidean distance between all cluster centroids, provides a percent contribution to the overall separability between clusters.

Cluster characterization, reliability, and validity
Univariate analysis of variance (ANOVA) and subsequent t tests were used to examine differences in clustering symptoms and in painspecific measures between the identified clusters, with Bonferroni correction applied to account for multiple comparisons. Chi 2 tests were used to examine the differential distribution of demographic factors and of body regions between the clusters and determine whether specific body regions were associated with any of the identified clusters. This sequence of tests was conducted initially on the training dataset to assess the clusters' diagnostic-like potential and subsequently on the validation and the baseline of the longitudinal datasets to assess the reliability of cluster assignment and validate the cluster's characteristics in other sets of patients. A nearest centroid classifier (25,27) was generated to assign or label a cluster to a "new" patient, based on the shortest Euclidian distance between the values of the clustering symptoms of that patient and each clusters' centroids.
Predictive validation and cluster dynamics over time Univariate analyses as described above were used to examine differences in clustering symptoms and in pain-specific measures between clusters as assigned at baseline, using data from the follow-up. To control for time-related effects, we added the number of days between the two assessments as covariate. This provided prognosticlike validation of the clusters.
Next, the nearest centroid classifier was implemented on the follow-up dataset to assess patient movement across clusters between the baseline and follow-up time points. Also, we used a bootstrap procedure (26,27) to assess whether patients' movement across clusters over time was due to potential error in measurement of the clustering symptoms or potentially to the clusters' ability to portray real improved or worsening of their condition. Since the PROMIS CAT engine uses a stopping criterion such that the standard error of the T score drops below a specified level of 3.0 T score metric points (92), we randomly jittered the original T score for each patient and for each of the clustering symptoms at baseline, within that error threshold, i.e., ± 3.0 T score points. The nearest centroid classifier was implemented on each patient's simulated data to assign a cluster. We then assessed movement across clusters between the simulated dataset and the follow-up dataset and calculated the number and subsequently the percent of patients moving across clusters. This procedure was repeated 1000 times to generate a bootstrapped distribution of the percent of patients moving across clusters within measurement error. This distribution allowed us to calculate the probability of the actual percent of patients moving across clusters between the baseline and follow-up time points being attributed to measurement error.

SUPPLEMENTARY MATERIALS
Supplementary material for this article is available at https://science.org/doi/10.1126/ sciadv.abj0320 View/request a protocol for this paper from Bio-protocol.