Introduction

Early recurrence, herein defined as the return of a primary tumor within three years of diagnosis, is an important endpoint in clinical management of breast cancer1. Recurrences can often be successfully managed, but they are stressful, costly, and increase risk of mortality if not detected early2,3,4. Clinical risk stratification is currently based on several clinical characteristics, including hormone receptors, HER2 status, grade, stage, and age, and RNA-based methods are available to identify tumors with high risk of recurrence5,6,7,8. Clinical gene expression assays are not uniformly performed on all patients, and are often limited to specific subgroups of patients with low-stage and ER-positive disease. Genomic assays are also expensive9, so histopathology-based stratification is appealing. Currently, only combined histologic grade—a metric that classifies breast tumors according to tubule formation, nuclear pleomorphism, and mitotic frequency–is routinely collected from H&E images in the clinic. Grade evaluation is performed manually and is subject to interobserver variability. An objective, image-based method could be valuable for prescreening patients at higher risk of recurrence.

Recent work in computer vision has extensively explored using deep convolutional neural networks (CNNs) to extract global contextual information from a variety of image types. Early applications of CNNs primarily were focused on natural images (e.g., cars or birds)10, but more recently, methods have been extended to medical images, including radiographic and histopathologic images11,12,13. Machine learning methods in image classification have been shown to predict or diagnose invasive breast cancer incidence, using both histopathological and radiographic images14,15,16,17,18. However, few studies have evaluated breast cancer outcomes based on images, and most that have have been limited in sample size, range of tumor phenotypes, or patient diversity19,20,21. For example, there are several data sets (e.g., the Camelyon challenge in the Netherlands and IBM-curated BRIGHT) that have encouraged researchers to investigate benign and neoplastic breast tissue using machine learning methods; however, these are largely focused on diagnostic capacity rather than prognostic or predictive modeling. Campanella and colleagues used WSI from multiple cancers to develop a predictive model of invasive disease, but again, this work was focused on diagnostic rather than prognostic applications16. Many other previous studies have emphasized a priori hypotheses, such as associations with spatial arrangement of immune cells22 or emphasized overall or breast cancer-specific survival rather than recurrence. Furthermore, many data sets—even for diagnostics—do not include diverse populations of women with breast cancer. In the US, Black women have significantly higher recurrence rates and breast cancer mortality, but often have lower representation in clinical and observational research. We used data from a source that represented both Black and non-Black women in similar proportions, allowing us to investigate breast cancer recurrence in a diverse setting.

We sought to investigate whether we could use image information extracted with a CNN (VGG1623, a CNN pre-trained on ImageNet) together with support vector machines (SVM)24 to create image-based classes that were predictive of recurrence among breast cancer patients. We assessed reproducibility and inter- and intra-individual variance by comparing validation accuracy across and within patient specimens and compared results to existing, established biomarkers. The Carolina Breast Cancer Study (CBCS3) is a well-annotated image dataset for a diverse group of women (50% Black, 50% under age 50) who were followed for medical record-confirmed recurrence.

Results

Study population

Table 1 shows our study population and indicates that the matched dataset (n = 101) had a similar distribution relative to the full population represented on the tissue microarrays (TMAs, n = 1543). However, the subset of CBCS cases included on the TMA tended to include more large size tumors and higher-grade tumors relative to the entirety of CBCS. Relative to participants who experienced an early recurrence, age-matched participants without early recurrence were significantly more likely to be early stage (stage 1 52.5% vs 10.9% in early recurrences), grade 1 or 2 (58.4% vs 26.7% in early recurrences), and ER-positive (76.2% vs 43.6% in early recurrences).

Table 1 Demographic and clinical tumor characteristics for the full study population and the matched training sample, stratified by 3-year recurrence status.

Model prediction accuracy

First, we assessed the accuracy for detecting recurrence within three years of diagnosis in our balanced data set, displayed in Table 2. In cross-patient 10-fold cross validation we observed 62.4% accuracy and 63.4% sensitivity. However, using within-patient validation, accuracy was 70.3% (67.7% sensitivity). In both approaches, the sensitivity and specificity were well-balanced, with within-patients (72.9%, 95% CI: 64.2, 81.6) specificity slightly higher than sensitivity (67.7%, 95% CI: 58.6, 76.8) and cross-patients sensitivity (63.4%, 95% CI: 54.0, 72.8) slightly higher than specificity (61.4%, 95% CI: 51.9, 70.9). To contextualize these accuracy estimates, we also evaluated the accuracy of standard clinical markers. Using grade and ER status as predictors of recurrence resulted in accuracies of 65.8% and 66.3%, respectively, but grade had higher sensitivity (73.3%, 95% CI: 64.7, 81.9) while ER status had higher specificity (76.2%, 95% CI: 67.9, 84.5). In pre-screening tumors for genomic testing, sensitivity to detect aggressive tumors is higher priority.

Table 2 Recurrence prediction accuracy comparison between image-based classes and other tumor characteristics (ER, grade).

To investigate recurrence prediction accuracy among clinically low or high-risk tumors, we further stratified our accuracy assessment by grade (low/intermediate vs high) (Table 2). Accuracy was higher in the low/intermediate grade group compared to high grade for both the within-patients approach (77.1% vs 65.2% in low vs. high grade) and the cross-patients validation approach (61.6% vs. 53.4% in high-grade). The sensitivity was lower among low/intermediate-grade tumors, while specificity was lower for high-grade tumors. However, sensitivity of both image-based approaches in the low/intermediate group exceeded that for ER status (70.4% for within-patients, 48.1% for cross-patients vs. 22.2% for the ER status). A total of 19 low/intermediate group patients (70% of patients with low/intermediate grade tumors who recurred within 3 years) were detected by image analysis that would have been missed via grade alone.

Time-to-event analysis

To also consider time-to-event (and not just binarized early recurrence vs. not), we evaluated both the within- and cross-patients predictors in time-to-recurrence based on Kaplan–Meier analysis (Fig. 1). Image-based classes from the within-patients approach (HR 2.70; 95% CI: 1.78, 4.11) had a slightly stronger hazard of recurrence than the cross-validation-derived classes (HR 1.73; 95% CI: 1.16, 2.57), but both were significantly associated with time to recurrence.

Fig. 1: Kaplan-Meier plots for the cumulative incidence of recurrence.
figure 1

These plots were generated using (a) the cross-patients validation method and (b) the within-patients validation method. Cox proportional HR (95% CI) for cross-patients method: 1.73 (1.16, 2.57); HR (95% CI) for within-patients method: 2.70 (1.78, 4.11).

Comparison with genomic assays

When comparing image-based classes to existing molecular metrics that represent risk of recurrence, specifically research-based versions of PAM50-derived ROR-PT and OncotypeDX scores, we observed that the image-based “high-risk” class was substantially enriched for individuals whose tumors had also been classified as high-risk by either ROR-PT or OncotypeDX (Table 3). The cross-patients approach resulted in an image-based high-risk class with the highest proportion of molecularly high-risk individuals [OncotypeDX RFD (95% CI): 15.0% (8.6, 21.3), ROR-PT RFD (95% CI): 21.5% (14.2, 28.5)], though the high-risk class resulting from the within-patients approach was still substantially associated with high-risk molecular features [OncotypeDX RFD (95% CI): 11.7% (5.3, 18.1), ROR-PT RFD (95% CI): 17.4% (10.0, 24.5)].

Table 3 Relative frequency differences (RFDs) for RNA-based risk of recurrence classifiers.

Discussion

We applied convolutional neural networks to detect early recurrence in the diverse, clinically well-annotated Carolina Breast Cancer Study. We found that image-based features predict survivorship with accuracy, sensitivity, and specificity that are comparable to those for standard clinical markers such as estrogen receptor status and grade. The performance characteristics of image-based classifiers differed within strata defined by grade, suggesting that future optimization should consider training separate within strata of grade. However, these image-based classifiers predicted recurrence with significant hazard ratios, and showed associations with risk-based genomic signatures. Since genomic signatures were not used in training, the association with genomic data suggests promise for rapid, low-cost pre-screening of tumors that may need further genomic testing. The importance of image-based pre-screening may also be of increasing importance as the proportion of neoadjuvant-treated breast cancer cases increases25, because in neoadjuvant cases tissue for diagnostic purposes is limited to biopsy materials. It may also be advantageous that the image-based methods use the same data collected for diagnosis and do not require any additional laboratory steps.

Our analysis is unique in that we trained on recurrence rather than other genomic or clinical data. Previous machine learning studies have predicted breast cancer recurrence and survival, most commonly using clinical and demographic data as inputs to ML algorithms26,27,28,29. Lou et al. compared an array of computational methods on a registry data consisting of 1140 patients, using collected medical records as inputs to the machine learned classifier26. Our approach used a much smaller image dataset for training but does not assume that the clinical data are mediators of the recurrence outcomes, allowing us to discover features that may not be captured in other clinical data. The promising results obtained with a small sample size suggest that future, larger studies with more images may improve accuracy further.

Our recurrence-trained classifier also recapitulated genomic risk subtypes, with high image-based risk groups being more likely to have high genomic ROR-PT and OncotypeDX scores. Other researchers have predicted genomic scores from images. Whitney et al. used 178 breast tumor H&E images to predict RNA-based OncotypeDX scores13. Accuracy in that study was 74% for low-intermediate vs high OncotypeDX score. We did not compare to Oncotype DX, which limits our ability to directly compare our accuracy for a PAM50-based risk of recurrence score, but our results suggest that training on recurrence rather than recurrence score is a viable strategy. Also, within the CBCS, our group demonstrated that ML methods could be utilized to predict tumor features such as ER status, grade, and subtype using images and data from CBCS312. Accuracy in that analysis for prediction of high vs low-medium ROR-PT score was 75%, which is slightly higher than our results for recurrence. However, the training set for that analysis was much larger than the recurrence vs. non-recurrence dataset used here and the outcome was more common. Thus, future analyses with larger datasets should evaluate the optimal method for identifying high risk specimens.

Application of breast tumor tissue core images to predict the binary recurrence outcomes is a difficult problem because of a few key challenges. First, the input data from CNN (512 times the number of samples) is much more high-dimensional than researcher-selected features like grade, stage, and other clinical characteristics. Second, early recurrence rates in breast cancer are fortunately relatively low, but this results in recurrence data from breast cancer cohorts being highly imbalanced. Using a matching scheme to match each recurrent case with a non-recurrent participant allowed us to overcome some of the challenges of using machine learning techniques while working with such a strongly imbalanced data set.

Higher sensitivity when training within-patients suggests some similarity of the images in training was being leveraged in testing and raises the intriguing hypothesis that repeated samples of images from patients have some individuality or ‘identifiability’. Whether this identifiability is clinically meaningful merits further investigation. On the molecular level, Perou et al. suggested that tumors are individuals and that this individuality may be targetable for precision medicine30; if tumors are similarly individual on a histological level, perhaps machine-learning techniques to evaluate histologic distinctions across a tumor could be used to identify subgroups or to study tumor evolution between biopsies and excisions/mastectomies. In any case, establishing the reproducibility of classification across samples from a given tumor is an important future direction if histologic biomarkers are to be used for risk prediction.

In summary, our proposed image-based approaches achieve competitive prediction accuracies on the order of established biological and clinical markers (i.e., grade and ER status), with balanced sensitivity and specificity. Among patients in the low/intermediate grade subgroup, both approaches were more sensitive than ER status. This analysis underscores the promise of training histopathologic predictors directly on recurrence rather than clinical surrogates, and emphasizes the need for larger, collaborative analysis of breast cancer outcome where sufficient event sizes and inter-study comparisons can be made. The benefits of a histologic approach for risk stratification could be significant, particularly for low-grade patients where the current markers (grade and ER) are not sensitively capturing risk of recurrence.

Methods

Study population

CBCS3 is a prospective, population-based cohort of 2998 women with incident invasive breast cancer recruited from 44 counties in North Carolina between 2008 and 2013. First, primary breast cancer cases were identified using rapid case ascertainment in collaboration with the North Carolina Central Cancer Registry. Eligible women were between 20 and 74 years old. Black and young (<50 years old) women were oversampled to each represent 50% of the population. The study was approved by the University of North Carolina Institutional Review Board in accordance with U.S. Common Rule. All study participants provided written informed consent prior to study entry. This study complied with all relevant ethical regulations, including the Declaration of Helsinki.

Breast cancer recurrence was ascertained by patient self-report at annual telephone follow-ups and then confirmed by medical record. Formalin-fixed, paraffin-embedded (FFPE) tumor blocks were obtained from participating medical centers for participants with available tissue. Tumor blocks were obtained for 1743 of the women enrolled in the study and were reviewed by the study pathologist (JG). From tumor-enriched regions selected by the pathologist, between one and four 1-mm tumor cores were sampled and embedded in tissue microarrays (TMAs) at the Translational Pathology Laboratory at UNC-Chapel Hill. TMA slides were sectioned (5-μm thickness) and top and bottom sections were stained with hematoxylin and eosin (H&E) and scanned at 20x magnification. In the balanced dataset, the 202 participants corresponded with 704 H&E core images of approximately 3000 × 3000 pixels and each participant had between two and four 1-mm tumor cores.

Tumor grade was determined centrally by the study pathologist, except where whole slide images were unavailable for secondary review and the originally reported (“clinical”) tumor grade was used (n = 7) or where both slides and clinical grade were missing (n = 39). From 1743 women with TMA images, participants were excluded if missing grade or if tissue was insufficient for VGG16 CNN (i.e., core damaged or section folded, tumor was depleted, n = 30). Because this study was aimed at evaluating recurrence, women with metastatic disease at diagnosis were excluded (n = 40), resulting in a final eligible population of 1644 women, corresponding to a total of 5969 core images. Approximately 7% of the study population experienced an early recurrence (n = 101), defined as recurrence within three years of diagnosis. To construct a balanced dataset of recurrent and non-recurrent cases, we matched recurrent cases to non-recurrent participants 1:1 on age (defined in 5-year bins).

Gene expression assays

Where available, additional FFPE specimens from CBCS3 participants were obtained for RNA extraction. RNA was isolated using RNeasy FFPE Kits (Qiagen) and Nanostring gene expression assays were performed at UNC Chapel Hill in the Translational Genomics Lab. Gene expression data were cleaned and normalized as described previously31. Of the 1543 women included in the study, 1145 women had data on genes required for the PAM50 predictor, a research-only version of the Prosigna clinical assay. These genes were used to calculate a PAM50 risk-of-recurrence (ROR) score. For this study, the ROR-PT score was used, which additionally incorporates information on the PAM50 subtype, proliferation score (P), and tumor size (T)7. These scores were then categorized into low-intermediate and high risk. Data on the 21 genes included in the OncotypeDX score assay were available for 937 of the 1543 women on study. These genes were used to approximate OncotypeDX scores for these patients, which were then categorized into low-intermediate (<26) and high (26+) risk.

Model specification and validation

Image preprocessing and feature extraction

The appearance of core images was standardized in each color channel to have mean equal to zero and standard deviation equal to one. We used a CNN32 to extract feature representations of core images (n = 704 cores from 202 participants). The CNN first applies convolutional filters followed by pooling operations. Lower-level layers learn generic image features such as edges and shapes, intermediate level layers capture increasingly complex properties like shape and texture, and higher-level layers learn global concepts that describe the semantic meaning of the images32,33. The parameters of such a network are the weights of the convolutional filters and are learned from the data in an adaptive manner, creating a hierarchically set of features of increasing abstraction.

We used the VGG1623 network that was pre-trained on the ImageNet dataset, without alteration. We explored using other networks (e.g., Resnet), but VGG16 resulted in better cross-validation and test set accuracy. ImageNet34 contains 1.2 million images, all of which belong to one of 1000 ImageNet object categories35, and although the extant categories are different from histology images, the pre-trained weights transfer well to feature extraction from tissue sections. To use the pre-trained VGG16 network to extract features of the core images, we feed the core images as input into the VGG16 network. Each layer transforms the features obtained from the previous layer based on its parameters and learns concepts like shape and texture that are complex enough for generalization but not too specific to images in ImageNet as to be inapplicable to histology images. In principle, one can use the features at any layer as the feature representation of the input image. Similar to the approach used by Couture12, we evaluated performance of the features extracted from different layers of the pre-trained VGG16 network. We ran a grid search over the feature extraction layers on 90% of the data (training set) and evaluated the performance on the remaining 10% (validation set). The 7th layer had highest validation accuracy and was therefore selected for model development. A total of 704 cores across 202 participants were used to extract a 512 × 64 × 64-dimensional matrix of image features. Spatial mean pooling was used to calculate to produce a feature vector of length 512 for each tumor core for use in prediction analysis.

We first attempted to use fully connected layers, but this resulted in overfitting. Therefore, we used a support vector machine (SVM) to classify the patient-level features that predict binary recurrence. An SVM24 is a classification algorithm that finds a linear decision boundary to separate the two classes (here, recurrent vs non-recurrent). The foundational idea of SVM is that if one interprets the margin between a data point and the decision boundary as the difficulty of classifying that data point (i.e., the smaller the margin is, the more difficult it is to classify that data point), we seek an optimal decision boundary that maximizes the margin, thereby minimizing the classification difficulty. After locating the decision boundary, an SVM predicts the class assignments of the data points based on which side of the decision boundary those points are on. We used those predicted class assignments for further analysis of the recurrence prediction.

We also explored training a neural classifier for predicting early recurrence in an end-to-end fashion; however, due to the small sample size of our balanced data set, the trained deep network overfit the training data and poorly distinguished recurrences in the test set.

Validation datasets

Ideally, data is split into training, validation, and test sets to evaluate the accuracy of ML algorithms; however, the balanced data set (n = 202) was small, and therefore, we assessed performance via two methods. First, our goal was to have a single feature vector to represent each patient. To this end, we performed a cross-patient validation (Fig. 2), in which we averaged the feature vectors of multiple core images per patient to create a single patient-specific feature vector. We then performed ten-fold cross-validation to train and evaluate the model for predicting recurrence, with one-tenth held back for testing in each iteration. We confirmed that as we added additional folds for k-fold validation (i.e., as we increased k), we increased our accuracy from 56% at 2-fold to 62% at 10-fold. Second, we performed a within-patient validation (Fig. 2). In this method, we took advantage of multiple tumor cores for each patient and trained the model using the feature vectors on half of each patients’ cores, testing the model on the second half. This latter method assumes that the core images belonging to each patient are independent, which is unlikely given that the correlation between image features within a patient (average cosine similarity within individual patients’ image features = 0.91) was higher than correlations between patients (average cosine similarity between patients’ image features = 0.84). However, we were interested in estimating how ‘individuality’ of a given tumor contributes to predictive accuracy, and thus, we considered within-patient methods as an optimistic estimation of model performance and compared the results to those obtained by cross-validation.

Fig. 2: Workflow for cross-patients validation (left) and within-patients validation (right) approaches.
figure 2

For the cross-patients validation, image features from all of each patient’s cores were extracted using VGG-16 and averaged as a feature representation for the patient. An SVM was trained on the features of the cohort and tested using 10-fold cross-validation. For the within-patients validation, image features from half of each patient’s cores were extracted using VGG-16 and averaged as the feature representation for each patient. An SVM was trained on the features of this first half of the images. Image features from the second half of the patient’s cores were then averaged and used to test this classification.

Statistical analysis

The predictive value of image-based classes was assessed using sensitivity, specificity, and overall accuracy of prediction. 95% confidence intervals were produced for these measures using the normal approximation of a binomial proportion. Following development of a binary classification scheme, we performed time-to-event analyses to assess relationships between the SVM-derived image classes and recurrence within the full cohort. Cox proportional hazards models were used to estimate hazard ratios and 95% confidence intervals. Relationships between image-based classes and recurrence were visualized with Kaplan–Meier curves. Generalized linear models with identity link and binomial family were used to estimate relative frequency differences (RFDs) and 95% confidence intervals to describe associations between image classes and molecular risk scores (OncotypeDX and ROR-PT).