Effect of MR Imaging Contrast Thresholds on Prediction of Neoadjuvant Chemotherapy Response in Breast Cancer Subtypes: A Subgroup Analysis of the ACRIN 6657/I-SPY 1 TRIAL

Functional tumor volume (FTV) measurements by dynamic contrast-enhanced magnetic resonance imaging can predict treatment outcomes for women receiving neoadjuvant chemotherapy for breast cancer. Here, we explore whether the contrast thresholds used to define FTV could be adjusted by breast cancer subtype to improve predictive performance. Absolute FTV and percent change in FTV (ΔFTV) at sequential time-points during treatment were calculated and investigated as predictors of pathologic complete response at surgery. Early percent enhancement threshold (PEt) and signal enhancement ratio threshold (SERt) were varied. The predictive performance of resulting FTV predictors was evaluated using the area under the receiver operating characteristic curve. A total number of 116 patients were studied both as a full cohort and in the following groups defined by hormone receptor (HR) and HER2 receptor subtype: 45 HR+/HER2−, 39 HER2+, and 30 triple negatives. High AUCs were found at different ranges of PEt and SERt levels in different subtypes. Findings from this study suggest that the predictive performance to treatment response by MRI varies by contrast thresholds, and that pathologic complete response prediction may be improved through subtype-specific contrast enhancement thresholds. A validation study is underway with a larger patient population.


INTRODUCTION
Breast cancer, the most common type of cancer among women, is a heterogeneous disease comprising subtypes with different biology, prognosis, and treatment outcome. Breast cancer can be classified into subtypes based on the hormone receptor (HR) status, including both estrogen and progesterone receptors, and human epidermal growth factor receptor 2 (HER2) expression to inform treatment decisions (1,2). These breast cancer subtype classifications also have implications for disease-free survival and relapse (3). Further understanding of subtype-specific response and effective monitoring by imaging may provide means for early therapeutic intervention, leading to better outcomes (4).
Magnetic resonance imaging (MRI) is one of the most accurate imaging tools used to monitor and predict treatment response for patients undergoing chemotherapy (5)(6)(7)(8)(9)(10)(11)(12)(13)(14). However, the predictive performance varies between different quantitative measurements derived from MRI, and by variations in the parameters that define those measurements. Previous studies have found that the tumor volume measured using MRI for patients undergoing preoperative chemotherapy has strong association with recurrence-free survival (13,15,16), and the association is influenced by the threshold settings of 2 contrast enhancement parameters (17). Another recent study has demonstrated that the influence varied in HR/HER2− defined breast cancer subtypes (18).
A standardized MRI-derived volume calculation procedure was used in the I-SPY 1 TRIAL (Investigation of Serial Studies to Predict Your Therapeutic Response with Imaging And moLecular Analysis) imaging sub-study: American College of Radiology Imaging Network (ACRIN) 6657. This procedure used empirically determined, site-specific analysis parameters, specifically an early time-point percent enhancement threshold (PE t ) and a signal enhancement ratio threshold (SER t ) for calculation of a functional tumor volume (FTV) for patients undergoing neoadjuvant (preoperative) chemotherapy (NACT) for breast cancer. FTV was shown to be predictive of both treatment response, as measured by pathological complete response (pCR) (15), and of recurrence-free survival (16) in the study population.
In the current study, we explored how the pCR prediction performance of FTV varies over a wide range of PE t and SER t , for different serial time-point MRI scans during the NACT course, and for different patient cohorts determined by HR and HER2 status. We show that the predictive performance to treatment response by MRI varies by contrast thresholds, and that the pCR prediction may be improved through subtype-specific contrast enhancement thresholds.

METHODOLOGY Patient Population
In total, 237 women with breast tumors sized ≥3 cm evaluated by either clinical examination or imaging were enrolled between 2002 and 2006 at 9 institutions in the USA. All patients provided written consent. As shown in Figure 1, 4 MRI examinations were conducted for each patient at the following time-points: before starting anthracycline-cyclophosphamide (AC) chemotherapy (MRI 1 ); at least 2 weeks after the first cycle and before the second AC cycle (MRI 2 ); between regimens if taxane was given (MRI 3 ); and following the completion of chemotherapy but before surgery (MRI 4 ). A subset of 116 patients that had image data from all 4 MRIs, pathological outcomes, and HR/HER2 status were analyzed for this retrospective study. The detailed design and previous findings of I-SPY 1 TRIAL/ACRIN 6657 have been previously published (15,16,19,20).

Determination of Breast Cancer Subtype
HR status and HER2 receptor expression were determined by pretreatment core biopsy, using immunohistochemistry (IHC) and Allred score at study sites. The HER2 status was determined by IHC and/or fluorescence in situ hybridization assays. Unlike HRs, HER2 testing (IHC and fluorescence-in situ hybridization assays) was performed locally at study sites and centrally at the University of North Carolina (19). Estrogen or progesterone receptor was positive if Allred score was ≥3, that is, ≥3% cells stained positive. HER2 was positive if it was tested positive at either a local or a central laboratory. The following 3 subtype groups were defined: HR+HER2−; HER2+ (HR either positive or negative); and triple-negative breast cancer (TNBC, ie, HR−HER2−) tumors.

Evaluation of Pathological Response
The pCR was considered as the surrogate end point of NACT and was defined as the absence of residual invasive disease in the breast and axillary lymph nodes at surgery (19). By this definition, patients were classified into 2 groups at the end of NACT as follows: pCR and non-pCR (residual invasive cancer). In I-SPY 1/ACRIN 6657, pCR was evaluated locally by each institution's pathologist immediately after surgery. In the event of a patient declining surgery, there was no pCR status for that patient.

Image Acquisition
Each patient had 4 MRIs ( Figure 1) at their participating site using a 1.5 T scanner and dedicated 4-or 8-channel breast radiofrequency coil. Imaging was performed with the patient in the prone position with an intravenous catheter inserted in the antecubital vein or hand. The image acquisition protocol was prespecified, and it included a localization scan and T2-weighted sequences, followed by a contrast-enhanced T1-weighted series. For the contrast-enhanced T1-weighted series, high spatial resolution (in-plane spatial resolution, ≤1 mm), 3-dimensional fat-suppressed T1-weighted imaging of the symptomatic breast was performed using a gradient-echo sequence with the following parameters: repetition time = 4.5 milliseconds, flip angle ≤45°, field of view = 16-18 cm, minimum matrix = 256 × 192, sections = 64, and section thickness ≤2.5 mm.
All imaging tests were performed unilaterally over the symptomatic breast and in the sagittal orientation. Imaging time for the T1-weighted sequence was between 4.5 and 5 minutes, with one data set acquired before injection of a gadolinium-based contrast agent and repeated 2-4 times immediately after injection. Interimaging delays were added as needed to result in postcontrast administration temporal sampling between 2 minutes 15 seconds and 2 minutes 30 seconds for early-phase images and between 7 minutes 15 seconds and 7 minutes 45 seconds for delayed-phase images.

Functional Tumor Volume Measurement
Following each MRI examination, image data were transferred to the ACRIN Core Lab for central archival and subsequently to the University of California at San Francisco for image analysis. All images were analyzed using in-house software developed in the IDL programming environment (ITT Visual Information Solutions, Boulder, Colorado) (21). For each dynamic contrast-enhanced (DCE-) MRI acquisition, a region of interest (ROI) encompassing the primary tumor as determined by signal enhancement was manually defined by a trained research associate by placing rectangular boxes on orthogonal maximum intensity projection images created from the early postcontrast scan (Figure 2A-C). Background air regions and suppressed fat regions were masked out using an automatically determined intensity threshold applied to the precontrast image.
The FTV was then measured using the signal enhancement ratio method within the ROI (22). The volumes of image voxels within the ROI that met PE t and SER t were summed to compute FTV, constrained by a minimum number of connected voxels to eliminate isolated voxels. PE and SER were calculated at each voxel as follows: PE = 100% × (S 1 − S 0 )/S 0 and SER = (S 1 − S 0 )/(S 2 − S 0 ), where S 0 , S 1 , and S 2 were signal intensities at precontrast, early contrast, and late postcontrast, respectively, collected during the DCE-MRI scan (23). A cutoff PE t was first applied followed by a connectivity test to create an enhanced tissue mask. SER was then calculated for all voxels in the mask ( Figure 2D), and SER t was applied to determine which voxels to include in the FTV. In ACRIN 6657, PE t was nominally set at 70% and adjusted empirically for each site to qualitatively reflect the extent of tumor and to account for unexpected variability in MRI systems and imaging parameters. SER t was set to be zero across all participant sites in the primary aim analysis of the trial. All magnetic resonance images from a given site were processed using the same site-specific PE t . To study the effect of PE t /SER t setting, we recalculated FTV by varying these 2 thresholds. PE t was changed from 30% to 200% in steps of 10% and SER t from 0 to 2 in steps of 0.2. FTV was recalculated at each MR examination as follows: baseline (FTV 1 ), early treatment (FTV 2 ), inter-regimen (FTV 3 ), and before surgery (FTV 4 ). Percent change of FTV was defined as the change in FTV relative to the baseline FTV 1 value (ΔFTV n = 100% × (FTV n − FTV 1 )/FTV 1 , n = 2, 3, 4).

Statistical Analysis
FTV measurements were calculated for each pair of PE t /SER t values, and associations with pCR were evaluated using receiver operating characteristic (ROC) curve analysis. The area under the ROC curve (AUC) was estimated to provide a measure of predictor quality. In the statistical model, patients with pCR were considered as controls (negative outcome) and those with non-pCR were considered as cases (positive outcome). For each PE t /SER t pair, the AUC was estimated in the full cohort and separately in each specific breast cancer subtype. The AUCs were then mapped as a surface plot on the axes of PE t (range, 30%-200%) and SER t (range, 0-2) for each FTV measurement. Higher AUC indicates "stronger association" between the measurement and pCR status. The optimized PE t /SER t was selected as having the maximum AUC over the map of PE t /SER t combinations. The processes of calculating FTV for each specific PE t /SER t pair, estimating AUCs, and selecting optimized PE t /SER t based on AUC values were performed automatically after the ROI was defined.
Because of the small sample size, it was not feasible to perform cross validation and hence AUCs and predictive accuracy estimates will be subject to overfitting. An optimal cutoff point was chosen as closest to sensitivity = 100% and specificity = 100% on the ROC curve (24). Data processing and optimization were performed in Matlab (R2012b 64bit for Mac, MathWorks Inc., Natick, Massachusetts), and all statistical analyses were conducted using the R statistical analysis software package and the pROC library (25,26). Data are expressed as median with interquartile range. All tests were performed at the P < .05 level, and all results are provided with estimates, 95% confidence intervals (CIs), and P values if appropriate.

RESULTS
A cohort of 116 patients was analyzed. The status of HR and HER2 was available for primary tumors in 115 patients (99%). Characteristics of patients with and without pCR are described in Table 1.

Effect of Varying PE t /SER t on Predicting pCR
Analyses of surgical samples revealed pCR in 34 patients (29%). The remaining 82 patients (71%) did not achieve pCR (non-pCR). Among 45 patients with HR+/HER2− breast cancer, only 6 (estimated percentage, 13%, with 95% CI of 5% to 27%) achieved pCR. Sixteen HER2+ patients out of 39 (estimated percentage: 41%, with 95% CI of 26% to 58%) achieved pCR and 11 out of 30 patients (estimated percentage: 37%, with 95% CI of 20% to 56%) achieved pCR in the TNBC subgroup. Figure 3 shows the highest AUCs observed for FTV measurements at different treatment time-points for the full cohort and by breast cancer subtype. In general, AUCs evaluated in subtypes were estimated to be higher than those in the full cohort, of which triple negatives had the highest estimated AUCs. In addition, absolute FTVs and ΔFTVs at MRI 2 and MRI 3 showed higher AUCs than those measured at MRI 1 and MRI 4 . The estimated AUC at ΔFTV 3 in the HR+/HER2− subgroup was among the highest with a narrow confidence interval. Although ΔFTV 3 showed no significance difference relative to other FTV predictors in the full cohort and other subtypes, we focused our contrast threshold comparison between subgroups using ΔFTV 3 as a predictor.
In the full cohort among all PE t /SER t combinations, ΔFTV 3 exhibited higher estimated AUCs (≥0.75) at 70% ≤ PE t ≤ 140% and lower range of SER t (0.0-1.0) ( Figure 4A). Within specific subtypes, differential effect of varying PE t /SER t on the prediction of using ΔFTV 3 for pCR was observed. In the HR+/HER2− subgroup ( Figure 5A), higher estimated AUCs occurred at higher PE t ranging from 120% to 200% across the entire range of SER t (0.0-2.0). In the HER2+ subgroup ( Figure 6A), high AUCs occurred at PE t from 70% to 140% and at SER t from 1.0 to 2.0. In the TNBC subtype ( Figure 7A), higher estimated AUCs also occurred at a PE t range of 60% to 150% and across the entire range of SER t (0.0-2.0).
To demonstrate the improved discrimination of pCR versus non-pCR using optimized PE/SER thresholds, we examined ΔFTV 3 in the full cohort and in breast cancer subtypes. Table 2 shows diagnostic performance for cutoff points selected from ROC curves (Figures  4-7B). In the full cohort, inconsistent effects on sensitivity and specificity were observed, whereas a consistent improvement was shown in subtypes. Table 3 shows ΔFTV 3 values and differences between patients with pCR and those without pCR (non-pCR) (Figures 4-7C). P values in Table 3 were estimated by likelihood ratio test. Lower P values at optimized PE t /SER t in subtypes may indicate that ΔFTV 3 calculated by optimized PE t /SER t has stronger predictive value for pCR than the default. Odds ratios were also estimated to be larger using optimized than default thresholds. Figure 8 shows an example of the effect of PE t /SER t on tumor voxels and subsequent FTV calculations in DCE-MRI. In this example, a 38-year-old female patient with a tumor sized 4 cm was enrolled in the I-SPY 1 TRIAL. The tumor was identified to be HR+/HER2− before treatment. The patient received AC-and taxane-based chemotherapy, and she did not achieve pCR at the completion of the treatment.
The effect of varied PE t /SER t on estimated AUCs for FTV 2 , ΔFTV 2 , and FTV 3 is shown in the supplement of this paper. When comparing absolute measures FTV 2 and FTV 3 with percent change ΔFTV 2 and ΔFTV 3 (Figure 4-7A), the absolute measurements are more reliable in predicting pCR over a wider range of PE t /SER t . In HR+/HER2− subtype, higher estimated AUCs were observed at high PE t in all FTV measurements. Estimated AUCs for HER2+ are generally lower than HR+/HER2− and TNBC, which can also be observed in Figure 3. A mixed effect of PE t /SER t in TNBC was observed when high AUCs were found at a higher range of PE t for FTV 2 , FTV 3 , and ΔFTV 3 but at lower range of PE t for ΔFTV 2 .

DISCUSSION
In this study, the impact of PE and SER thresholds on FTV prediction of neoadjuvant treatment response was retrospectively investigated using data from the I-SPY 1 TRIAL/ ACRIN 6657. In that study, default PE t and SER t levels were used in the FTV calculations that were empirically set by visual evaluation of DCE images. In this paper, we present a semiautomated method to customize the PE t and SER t parameters, particularly for breast cancer subtypes, to account for the heterogeneity of tumor biology as reflected in imaging biomarkers. Through the optimization framework of this study, we seek to better understand the enhancement patterns of individual breast cancer subtypes and the association between enhancement measurements and pathologic outcomes of NACT.
Various forms of FTV have been investigated and compared previously to test predictive performance measured at different time-points during the treatment. Previous work on the ACRIN 6657 study reported AUCs of FTV ratios at MRI 2 , MRI 3 , and MRI 4 relative to MRI 1 in predicting pCR using the default PE t /SER t (15). In a study using earlier data from a pilot cohort of 64 patients imaged at a single center (18), the effect of varying PE t /SER t on FTV and ΔFTV was investigated. The percent change in FTV over the entire course of treatment from baseline to before surgery (ΔFTV f ) was the predictor with the highest hazard ratio in the full cohort and the HR+/HER2− and HER2+ subgroups, whereas the absolute presurgical FTV (FTV f ) was the highest for the TNBC subtype. In this study, FTV was calculated at MRI 1-4 and percent change of FTV at MRI 2, 3, 4 . Although the inter-regimen metrics FTV 3 /ΔFTV 3 generally showed the higher estimated AUCs, AUCs of the presurgery values FTV 4 /ΔFTV 4 varied across patient cohorts (Figure 3). Meanwhile, FTV 2 /ΔFTV 2 had similarly high AUCs as FTV 3 /ΔFTV 3 across all patient cohorts except HR+/HER2−. Given the small sample size, these observations are limited to this study only. Cross validation is needed to confirm it in a general population.
PE and SER measure the signal enhancement characteristics of pre-and postcontrast injection during DCE-MRI (22). These 2 basic measurements and their thresholds may have a profound effect on the subsequent FTV calculation and, hence, its predictive performance of response in breast cancer subtypes during the treatment course. The current study showed that higher AUCs were observed at higher PE t when absolute FTV was used to predict pCR in HR+/HER2− subtype. A similar finding was observed in the HR+ subgroup in a previous study (18), indicating that higher PE t may better discriminate regions of malignant tumor from the high background parenchymal enhancement often found in HR+ patients (27)(28)(29)(30). High SER value is indicative of tissue with a strong contrast washout characteristic and is generally associated with malignancy (31). Many studies have reported that TNBC shows a malignant enhancement pattern on DCE-MRI (32-36). Li et al. reported that postchemotherapy tumor volume with high SER had a statistically significant association with disease recurrence (37). Among breast cancer subtypes in this study, HER2+ was most affected by SER t at FTV 3 and ΔFTV 3 . Higher AUCs were observed at higher SER t , suggesting distinct biology and microenvironment within the HER2+ tumor that differ from other subtypes.
Compared with HR+/HER2− and TNBC, HER2+ had lower AUCs. This may be because of the heterogeneity within this subgroup, which included both HR+ or HR−. Because of the small sample size, we could not further subset this group into HR+/HER2+ and HR−/ HER2+. The heterogeneity within this subtype may limit the effectiveness of changing PE t /SER t to improve AUC. Furthermore, although trastuzumab is the current standard treatment for HER2+ patients, it was not used routinely in the timeframe of this study. Only 13 of 39 HER2+ patients received trastuzumab therapy. This adds complexity to this subtype and may have also created bias in our results. Because of the small sample size, we did not exclude these patients.
The presented retrospective study has a few limitations. First, the image quality may not be consistent in our patient cohort. Imaging data in this study were collected from a multicenter clinical trial and were acquired from 7 participating sites in the USA. The default PE t /SER t setting varied across sites, and we only studied the subsequent calculated FTVs by applying subtype-specific thresholds. Second, the sample size is too small to perform any kind of validation (or cross validation) of the optimization model. The highest AUCs found in the full cohort and in subtypes may therefore overestimate the true optimal values. Further study on an independent cohort should therefore be performed to evaluate the extent to which our estimated AUCs represent generalizable improvement in predictive values. Again because of the relatively smaller sample sizes, AUCs estimated in subtypes have wider CIs compared with those estimated in the full cohort. In this study of 116 patients, we were unable to evaluate other factors such as age, tumor size, and axillary lymph node status. Third, the treatment was not the same for all subtypes. The data set was acquired between May 2002 and March 2006. All patients in our cohort had AC and taxane therapy before surgery, and one-third of HER2+ patients received additional trastuzumab. These different treatments can affect the predictive performance of ΔFTV with or without optimization. Finally, HER2+ subtype comprised both HR+/HER2+ and HR−/HER2+, posing potential heterogeneity in the analysis. In our planned future study with a larger cohort, the HR+/HER2+ and HR −HER2+ subsets will be separately analyzed.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

Acknowledgments
This research was supported by the National Institutes of Health (R01 CA069587, R01 CA116182, and U01 CA151235). The authors sincerely thank the patients who participated in I-SPY 1 TRIAL/ACRIN 6657.    Maximum area under the receiver operating characteristic curve (AUCs) observed for FTV prediction of pCR. Plots were generated for patients in the full cohort and in HR+/HER2−, HER2+, and TNBC cohorts, separately. Each AUC was plotted with 95% confidence interval (CI).  Effects of PE t and SER t on the subsequent ΔFTV 3 association with pCR and non-pCR in the HR+/HER2− subtype of 45 patients. The 3D surface map, with the star indicating where maximum AUC is observed: PE t = 130% and SER t = 0 (A). The ROC curves of using 130%/0 (in red) versus the default (in blue) (B). Estimated AUC for the blue curve is 0.77 (95% CI, 0.48-1.00) and that for the red curve is 0.90 (95% CI, 0.84-0.97) (B). Box plots of ΔFTV 3 values in patients with pCR and those without pCR (non-pCR) calculated with default (on the left) versus 130%/0 (on the right) (C). Effects of PE t and SER t on the subsequent ΔFTV 3 association with pCR and non-pCR in the HER2+ subtype of 39 patients. The 3D surface map, with the star indicating where maximum AUC is observed: PE t = 130% and SER t = 2.0 (A