Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data

Abstract Objectives Most population-based cancer databases lack information on metastatic recurrence. Electronic medical records (EMR) and cancer registries contain complementary information on cancer diagnosis, treatment and outcome, yet are rarely used synergistically. To construct a cohort of metastatic breast cancer (MBC) patients, we applied natural language processing techniques within a semisupervised machine learning framework to linked EMR-California Cancer Registry (CCR) data. Materials and Methods We studied all female patients treated at Stanford Health Care with an incident breast cancer diagnosis from 2000 to 2014. Our database consisted of structured fields and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results Program (SEER). We identified de novo MBC patients from CCR and extracted information on distant recurrences from patient notes in EMR. Furthermore, we trained a regularized logistic regression model for recurrent MBC classification and evaluated its performance on a gold standard set of 146 patients. Results There were 11 459 breast cancer patients in total and the median follow-up time was 96.3 months. We identified 1886 MBC patients, 512 (27.1%) of whom were de novo MBC patients and 1374 (72.9%) were recurrent MBC patients. Our final MBC classifier achieved an area under the receiver operating characteristic curve (AUC) of 0.917, with sensitivity 0.861, specificity 0.878, and accuracy 0.870. Discussion and Conclusion To enable population-based research on MBC, we developed a framework for retrospective case detection combining EMR and CCR data. Our classifier achieved good AUC, sensitivity, and specificity without expert-labeled examples.

nosed with earlier stage breast cancer (stages 0-III) may develop MBC as a distant recurrence of the primary tumor after incident breast cancer diagnosis. Patients who were diagnosed as stage IV breast cancer have de novo MBC, as their breast cancers have spread to other parts of the body at the time of initial diagnosis. Despite substantial improvements in the treatment and prognosis of early stage breast cancer, less is known about changes in survival and other outcomes of MBC patients. 2,3 Although many clinical trials focus on the treatment of MBC, it is not clear to what extent trial outcomes correspond to real-world outcomes. 4 In addition, only a small fraction of MBC patients are trial-eligible and it is impossible to study all permutations of sequential drugs in clinical trials. Thus, there is a need to use population-level observational data to study MBC outcomes in this critical patient population.
However, there is a lack of suitable population-based data resources for the study of distant recurrence among breast cancer patients in the United States, as recurrences are not reported in most cancer registries. Although registries from the national Surveillance, Epidemiology and End Results Program (SEER) of National Cancer Institute record the initial cancer stage at diagnosis, the first course of treatment, and up to 1 year follow-up after initial diagnosis, they do not report longer term follow-up, during which time metastatic recurrences would be likely to occur. As a result, population-based cancer registries such as the California Cancer Registry (CCR) of SEER can only be used to identify de novo MBC patients, who have been estimated to represent only around one-quarter of all MBC patients 5,6 and whose disease may behave very differently from recurrent MBC. 7 Electronic medical records (EMR) contain large amounts of data collected during routine medical care delivery and have the potential to generate practicebased evidence. However, it has been challenging to make use of this abundance of data in part because of difficulties in identifying which breast cancer patients have had metastatic recurrences. 8 Thus, the profound gap in our knowledge about real-world treatment of MBC and how patients die of this disease still remains today.
Since identifying MBC cohorts via manual case review is prohibitively laborious, there have been many informatics approaches proposed to retrospectively identify MBC cases from healthcare databases such as claims and EMR. Rule-based approaches that use structured data such as qualifying diagnoses, procedures, and drug codes have been developed. [9][10][11][12][13] While such approaches are simple to replicate in a new dataset, their reliability is challenged by coding bias and differential coding practices. In addition, these approaches can suffer from low sensitivity (40%-60%), despite reasonable specificity (70%-90%). 10,11,13 A promising alternative is to analyze unstructured clinical text in EMR, which has shown higher sensitivity and specificity. 14,15 However, the limitations of this approach include a high cost of initial development, difficulty in adapting to new systems, and most significantly, the requirement for a prohibitively large amount of manually annotated training data.

OBJECTIVES
We sought to improve upon current informatics approaches to automate MBC case detection with the potential to support population-level surveillance research across California and nationally. To do so, we leveraged the complementary patient data contained in EMR and CCR, and developed a semisupervised machine learning framework, within which we applied natural language processing (NLP) techniques to extract information from unstructured clinical notes. Semisupervised machine learning comprises a class of techniques that make use of unlabeled data to train machine learning models. It falls between unsupervised learning (no labeled training data) and supervised learning (completely labeled training data). It typically consists of pairing a small amount of labeled data with a large amount of unlabeled data. Specifically, our methodological innovation extends the distant supervision paradigm described by Mintz et al, which has been applied over a decade in the development of general domain NLP and information extraction tools. [16][17][18] In distant supervision, a distinct data source can be used to label training examples automatically in the absence of human-labeled training data, for the purpose of subsequent supervised learning. [16][17][18] In this study, we implemented distant supervision to the problem of retrospective MBC case detection.

Data source
The Oncoshare breast cancer research database comprises a threeway data linkage at the patient level. It is an integration of EMRs of Stanford Health Care (SHC), an academic health institution, and multiple sites of the Palo Alto Medical Foundation (PAMF), a community-based medical center in Northern California, both linked to data from CCR, a state-wide SEER registry. 19,20 Only SHC patients were included in this study since clinical notes of PAMF patients were not accessible at the time of study. Human Subjects approval for all research reported here was obtained from the Institutional Review Boards of Stanford University and the State of California.
The structured EMR fields in Oncoshare's clinical database include each patient's diagnoses, procedures, and drug orders. The unstructured EMR fields include free-text clinician notes such as medical and social histories, impressions, and visit summaries. The CCR contains detailed sociodemographic information such as patient age, race/ethnicity, zip code and neighborhood characteristics, insurance and marital status, tumor characteristics at initial breast cancer diagnosis, and continually updated survival data which SEER obtains through linkage to the Social Security Death Master file and other national databases. 19,20 For this study, we focused on 11 459 breast cancer patients treated at SHC from 2000 to 2014. Descriptive information on the length of follow-up of the study population appears in Supplementary Tables S1 and S2. Survival status was collected by the CCR as of December 31, 2014 or any later follow-up of specific patients. The last followup date was the latest date of the last follow-up from the CCR (December 31, 2014) or from the last encounter date in SHC's EMR database. A flow chart that shows how patients were analyzed by our framework appears in Figure 1. Metastatic disease that was de novo stage IV was directly retrieved from the CCR. Patients not identified as de novo MBC by CCR and did not have any clinical notes were classified as non-MBC patients. Our informatics method focused on detecting cases of metastatic recurrence, and thus included only patients initially diagnosed in stages 0-III as recorded by the CCR.
Creating an expert-reviewed "gold standard" patient set for evaluation Two board-certified medical oncologists (AWK and JLC) manually reviewed deidentified EMRs from 146 female breast cancer patients to create an evaluation set: these patients' records were not used in the development of the statistical classifiers. The size of this evaluation set was chosen in reference to the validation process of EMRbased phenotypes from the EMRs and Genomics (eMERGE) studies, where 50-200 subjects were reviewed to evaluate the performance of a given algorithm. 21 The set of 146 patients was a combination of convenient sampling and selective sampling.
Approximately 50 MBC cases were selected upon reflection of the participating oncologists' clinical experience. To extend the size of our evaluation set, we randomly sampled more patients from two ends of a ranked list. Patients with any MBC mentions (positive or negative) were ranked according to the number of positive MBC mentions in their clinical notes. Details on the detection of MBC mentions from clinical notes can be found in the following sections. Considering availability and time constraints of our oncologists, and without prior knowledge of the underlying prevalence of MBC in our study population, we selected patients to achieve a final balanced evaluation set with similar numbers of MBC cases and controls.
The oncologists determined the presence or absence of a metastatic recurrence in each patient's medical record using all her clinical notes, radiology reports, and pathology reports in the EMR. Unlike note-level or sentence-level review, the participating oncologists synthesized information contained in a patient's entire medical record available over time before labeling her as having MBC or not. The most common source of information on recurrence was the most recent medical oncology or radiation oncology visit note. If there was no such note, or if this note was written more than 6 months before the time of chart review or the patient's death and did not indicate MBC, then more recent notes from other clinical specialties, pathology reports and imaging reports were examined. If no evidence of MBC was found after review of all these sources, then the patient was labeled as not having recurrent MBC.

Distant supervision of MBC classification
Our distant supervision framework exploits the Oncoshare EMR-CCR linkage. In the absence of a large number of manually annotated cases, we used one data source from the linked data, EMR, to infer a class label for metastatic recurrence. These class labels were then used to supervise the learning of a classification model using input variables from both EMR and CCR.
Step 1: Processing EMR clinical notes and assigning distant labels In step 1, we used NLP-derived features to label patients that were likely to have experienced a metastatic recurrence, based on freetext patient notes in the EMR. Specifically, we adapted an opensource clinical text analysis tool, CLEVER (CL-inical EV-ent R-ecognizer), which has been validated for EMR-based information extraction tasks in prior work, to extract metastatic disease information. 22 This decision was based on the efficiency of CLEVER's tagger, which facilitates the review of intermediate system output by subject matter experts and their inclusion in the development of custom clinical NLP extractors. CLEVER's source code, base terminology, and all customized components that were developed as part of this work are distributed publicly with a MIT software license on Github 1 .
Although mature clinical NLP systems exist, they can be difficult to install and must be adapted to new sources of data. Simple taggers leveraging resources such as the National Library of Medicine's Unified Medical Language System (UMLS) and SPECIALIST Lexicon tools have been shown to rival their performance and are easier to install. 23 As illustrated in Supplementary Figure S4, CLEVER makes one modification to these types of general UMLS based taggers such as Noble Coder or MetaMap 24,25 in that we pretrained word-and phrase-embedding models on clinical text to expand terminologies that are "seeded" by UMLS terms. Using languageembedding models to identify new terms that were statistically similar to the high quality UMLS seed terms, we developed an enhanced terminology using an iterative and incremental process that included two informaticists and a subject matter expert to assist in the review of candidate terms.
After our terminology for MBC information extraction was complete, we used CLEVER to annotate the corpus and extract mentions of different metastatic disease concepts that could be used to infer the presence or absence of a metastatic recurrence. We also examined their immediate contexts to determine if the target term was negated, hypothetical or an attribute of a family member and not the patient (Supplementary Figure S5). Specifically, CLEVER's base classes include negation and familial terms from ConText and NegEx that have been expanded through word embedding method to detect additional similar terms directly from patient notes. 22 The custom classes that we developed for metastatic recurrence detection are shown in Table 1. The CLEVER rule that we developed to assign a case label to each patient was based on the positive present mention of at least one terms from any of the four custom word classes: 1 https://github.com/stamang/CLEVER "METSBONE," "METSBRAIN," "METSLIVER," and "METSLUNG". These four word classes were constructed in a data-driven way from the most common sites of metastasis among our patients. In contrast to less specific word classes such as "DRECUR," which contains terms that indicate a nonspecific distant recurrence, these four word classes include terms that indicate both metastatic disease and a location distant to the breast.
Step 2: Recurrent MBC classification In step 2, we used the distant labels from step 1 to train metastatic recurrence classification models. Two sets of features were included into the classifiers: NLP-derived features from clinical notes in EMR and CCR features. The 427 patient-level NLP-derived features included the total number of terms mentioned in each of the customized word classes (except METSBONE, METSBRAIN, METSLIVER, and MET-SLUNG) and their frequency as positive or negative concepts, in each specific note type and across all note types in the EMR. The features from the four custom word classes mentioned above were excluded from this step because they were used to infer the training labels, and including them would result in "learning back" our labeling process. The CCR features included structured fields such as age, race, ethnicity, marital status, socioeconomic status, insurance type, comorbidity, year of initial breast cancer diagnosis, cancer stage, tumor grade, tumor histology, and tumor receptor status (eg, expression of estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 [HER2/neu]). Missing data in any of the structured features above were coded as a separate category. We trained three classifiers: A (CCR features only), B (NLP features only), and C (NLP þ CCR features). Other than having different sets of input features, all aspects of the three classifiers were kept the same for fair comparisons.
We trained logistic regression models with L2 regularization using glmnet package in R. 26 Compared to regular logistic regression, L2 regularization smoothly shrinks regression coefficients based on regularization parameter, lambda, while retaining all input features in the model. 27 Such regularization can help reduce prediction error in our case because many of our input features are likely to be corre-lated. In practice, we chose the largest value of lambda such that error is within 1 standard error of the minimum mean cross-validation error (lambda.1se) by 10-fold cross-validation using the cv.glmnet function. The probability cutoff of the classifier was chosen to optimize the F1 score. Finally, we tested our classifiers on a physicianlabeled set of 146 patients (72 cases and 74 controls) and measured model performance using sensitivity, specificity, positive predictive value (PPV or precision), negative predictive value (NPV), and overall accuracy. Bootstrap confidence intervals were calculated for each of the performance measurements by resampling the gold standard sets of patients with replacement 1000 times and taking the 2.5% and 97.5% percentile of the measurements calculated using the bootstrap samples. To benchmark the performance of our classifiers, we also implemented a simple rule-based algorithm to classify MBC patients as those who have at least one instances of 196.XX-199.XX in their structured EMR diagnosis. 12,28,29

RESULTS
Among the 11 459 patients, follow-up time ranged from 6.3 to 202.8 months with a median of 96.3 months. The mean follow-up time was 97.8 months with a standard deviation 46.7 months. A total of 1, 886 (16.5%) were classified as MBC patients and 9, 573 (83.5%) as not having any evidence of distant metastases in the data that were available to us at the time of this study. Of the 1886 MBC patients, 512 (27.1%) were de novo stage IV MBC patients, while 1374 (72.9%) were classified as recurrent MBC patients (1302 from text processing step and 72 reviewed by physicians). This result is consistent with a recent report from the SEER registry using unrelated methods. 6 Table 2 summarizes socio-demographic, clinical and genetic features of patients grouped into MBC (stage 0-III at diagnosis), stage IV at diagnosis, and non-MBC. Using the test set of 146 manually annotated patients, our text processing step generated 15 false-positive and 8 false-negative labels, with an overall accuracy of 0.842 as shown in Table 3.
Furthermore, we trained three distant supervised classification models for metastatic recurrences using these distantly labeled patients (1302 as MBC and 7590 not enough evidence of MBC) us-  ing combinations of CCR and NLP-derived features. A summary of all CCR features used in our classifiers is listed in Table 4. Compared to the classifier A (CCR features only), we observed a boost in all performance measurements by including NLP-derived features in classifiers B and C (Table 3). Classifiers B and C achieved very similar performance regardless of the presence of any CCR features ( Table 3). For both classifiers B and C, the regularization parameter lambda was chosen to be 0.041. Using 10-fold cross validation within the training data, we obtained the highest F-1 score of 0.89 with a probability cutoff of 0.45. This cut-off was applied to be evaluated using the 146 manually annotated records in the goldstandard set. Classifiers B and C achieved areas under the receiver operating characteristic curve (AUC) of 0.917 and 0.925 (DeLong 95% confidence interval 0.868-0.966 and 0.880-0.969), respectively ( Figure 2). 35 For both classifiers, there were 9 false-positives and 10 false-negatives, corresponding to sensitivity ¼ 0.861, specificity ¼ 0.878, and overall accuracy ¼ 0.870 (Table 3). The NLP-derived features with the highest beta coefficients from classifier B are shown in Supplementary Table S3. As a benchmark, the ICD-9 code rule-based classifier achieved high sensitivity (0.93) but low specificity (0.47) and PPV (0.63).

DISCUSSION
The lack of high-quality longitudinal databases that can be used to study metastatic recurrence is the biggest obstacle to practice-based evidence on how patients die from breast cancer. To address this problem, we developed a novel scalable framework that enables retrospective MBC case detection with good performance (sensitivity ¼ 0.861 and specificity ¼ 0.878). To our knowledge, there has been no one threshold above which the algorithm performance is sufficient for all types of research use. Our framework is flexible in that future researchers could adapt the probability threshold of our algorithm, depending on their needs for sensitivity, specificity, or other performance measures. [11][12][13][14] The contribution of this work is 3-fold. First, we retrieved information from the unstructured text of clinical notes by developing a custom NLP extraction tool for metastatic recurrences and demonstrated the benefit of using unstructured EMR data. 14 Second, we applied a semisupervised machine learning technique, distant supervision, to the problem of metastatic recurrence classification. In doing so, we avoided the salient bottleneck presented by human annotation, which is time and cost-prohibitive for many institutions and researchers. Last, we leveraged complementary data sources, EMR and CCR, to develop a framework for the detection of MBC that enables population-based studies of patients with metastatic cancer. Given that the two classifiers that included NLP-derived features performed very similarly regardless of the presence of CCR features, we concluded that unstructured data alone was sufficiently informative for the purpose of identifying the recurrent MBC cohort in this study. Note that our primary goal was to solve a binary classification problem (whether a breast cancer patient had experienced metastasis or not) with the best possible performance in order to build Triple negative: estrogen receptor, progesterone receptor and HER2 all negative. HER2 positive: HER2 positive, regardless of estrogen receptor or progesterone receptor status. Note that positive predictive value (PPV), negative predictive value (NPV), F-1 score, and overall accuracy are highly dependent on the prevalence of the condition, which in our case is 72/146 ¼ 0.49. The actual prevalence of recurrent metastatic breast cancer in our study population is likely to be much lower. However, sensitivity, specificity, and area under the curve (AUC) are intrinsic properties of classifier and are insensitive to prevalence of cases. 32,33 30,31 b Triple negative: estrogen receptor, progesterone receptor and HER2 all negative. HER2 positive: HER2 positive, regardless of estrogen receptor or progesterone receptor status. a MBC cohort with a diverse set of relevant patient-level variables to be used in subsequent clinical studies. This is in contrast to a classic epidemiologic study, in which researchers quantify the association of common risk factors or discover new ones. Thus, in the classifiers in which CCR features were included, we included all variables from CCR in our penalized logistic regression models that could possibly lead to a better classification model, including known risk factors. Nevertheless, we acknowledge that including known risk factors as predictors in our classifier might cause bias in our cohort construction.
Although additional CCR features only offered marginal gains in the classification performance, we emphasize the importance of taking advantage of both EMR and CCR data resources for the purpose of downstream epidemiological studies. While EMR contains more comprehensive information about patients' treatment and progression of disease over a longer period of time compared to cancer registry data, the EMR does not reliably collect tumor characteristics. 36,37 To be more specific, the main coding system in EMR, ICD, does not specify stage at diagnosis, and metastatic codes do not specify the occurrence of distant recurrence after initial breast cancer diagnosis. Cancer registries such as the CCR are the best data sources to obtain accurate tumor characteristics that are absent in EMR, including breast cancer stage at diagnosis. This information is essential in characterizing any cancer patient cohort and in describing patient outcomes. In addition, the highly accurate and complete sociodemographics information collected by the cancer registry, which is known to be associated with disparities in cancer risk and survivorship, will facilitate downstream outcomes research. 38 We understand that it can be challenging to link the two datasets due to data privacy issues, but it is possible and Oncoshare is an example that can be replicated at other institutions. We believe such linkage will boost outcomes research in this patient cohort as well as in a broader cancer patient population. 20,39,40 Our work suggests that an important next step is to develop tools for temporal information extraction. Due to the relatively short time between metastatic recurrence and death, NLP approaches must perform at high accuracy to support meaningful survival analysis. Although we initially planned to estimate onset time for metastatic recurrences, we found that simple methods (eg, for a given patient, using the earliest timestamp of all notes with any positive-affirmative MBC mentions) were not sufficient. Analyses of notes from 10 patients found that the most common errors of this naïve approach were attributable to phrases such as "patient was diagnosed with metastatic breast cancer [number] months ago at [another medical institute]." Possible future directions for automating recurrent MBC case detection could be to acquire linguistic annotations of English clinical text or other data for training a temporal metastatic recurrence classification model.
There are several limitations of our study. First and foremost, although we value the external validity of our method, it was prohibitively difficult to conduct any validation studies at the time of this study, due to the lack of similarly linked databases from other institutions as well as restrictions in data sharing. Second, our patients are limited to those treated at SHC, an academic medical center, and thus they do not fully represent the broader community of breast cancer patients from all geographical and socioeconomic backgrounds. Third, even our linked dataset contains incomplete data due to patients receiving care outside of SHC. This is mitigated to some extent by state-wide capture of treatment summaries by CCR, but does not capture events outside of the state. Fourth, our work has primarily focused on NLP-derived features from unstructured free-text data in the EMR and structured data from the CCR. The integration of structured data from the EMR, such as diagnoses, drugs, and procedures that patients received as part of their treatment and continued survivorship care, may also improve classification, especially when there is ambiguity in describing metastasis in the notes or for patients without any clinical notes. 14 Fifth, we used a relatively simple machine learning classifier: a penalized logistic regression. Use of decision tree analysis and more nuanced machine learning methods may improve classification performance. Sixth, due to practical constraints in selecting our gold standard set, there could be potential bias in evaluating the performance of various algorithms. Last, before we are able to ascertain the date of metastatic recurrence, the performance of our classifier is subject to change depending on the length of follow-up.

CONCLUSION
In conclusion, we developed an open-source MBC case detection framework using linked EMR-CCR data, within which we used NLP techniques to accurately label breast cancer patients as recurrent metastatic or not. As more linked datasets are developed (eg, the American Society of Clinical Oncology's CancerLinQ initiative 2 ), tools such as ours can readily be adapted for them. This approach has tremendous potential to identify cohorts of recurrent metastatic cancer patients and offer insights into the characteristics, care received, and outcomes of this important and understudied patient population.