The 3′UTR signature defines a highly metastatic subgroup of triple-negative breast cancer

Triple-negative breast cancer (TNBC) is a highly heterogeneous disease with an aggressive clinical course. Prognostic models are needed to chart potential patient outcomes. To address this, we used alternative 3′UTR patterns to improve postoperative risk stratification. We collected 327 publicly available microarrays and generated the 3′UTR landscape based on expression ratios of alternative 3′UTR. After initial feature filtering, we built a 17-3′UTR-based classifier using an elastic net model. Time-dependent ROC comparisons and Kaplan–Meier analyses confirmed an outstanding discriminating power of our prognostic model for TNBC patients. In the training cohort, 5-year event-free survival (EFS) was 78.6% (95% CI 71.2–86.0) for the low-risk group, and 16.3% (95% CI 2.3–30.4) for the high-risk group (log-rank p<0.0001; hazard ratio [HR] 8.29, 95% CI 4.78–14.4), In the validation set, 5-year EFS was 75.6% (95% CI 68.0–83.2) for the low-risk group, and 33.2% (95% CI 17.1–49.3) for the high-risk group (log-rank p<0.0001; HR 3.17, 95% CI 1.66–5.42). In conclusion, the 17-3′UTR-based classifier provides a superior prognostic performance for estimating disease recurrence and metastasis in TNBC patients and it may permit personalized management strategies.


INTRODUCTION
Triple-negative breast cancer (TNBC) which lacks estrogen receptor (ER) and progesterone receptor (PR) expression and without human epidermal growth factor receptor 2 (HER2) amplification, accounts for 15-20% of breast cancers [1]. It is a highly heterogeneous disease with an aggressive clinical course. Survival dramatically decreases during the first 3 to 5 years after diagnosis and distant relapse after this period is rare, indicating a potential curability of TNBC.
Treatment selection should be tailored to the patient based on the risk of recurrence and it should include considerations of tumor size, lymph node and receptor status and patient age at diagnosis [2]. In the era of personalized medicine, traditional clinicopathological risk factors are being challenged [3] and novel biomarkers and multigene expression assays [4][5][6] are emerging as better predictors of patient outcomes. However, few risk prediction models are widely used in clinical practice so the need remains to distinguish TNBC patients with poor prognosis from those with potentially favorable outcomes, and to test alternative treatment modalities with higher toxicity for these high-risk patients.
Alternative polyadenylation (APA) is a fundamental molecular mechanism influencing the kinetics of gene regulation under diverse physiological and pathological states through the dynamic usage of messenger RNA (mRNA) 3′ untranslated regions (3′UTRs) [7,8]. Virtually 70% of human genes have multiple polyA sites that Research Paper produce distinct transcripts with variable 3′UTR length and this contributes to transcriptome diversity [9,10]. Recent studies [11][12][13] suggest a biological importance of APA dynamics in human cancer, but the precise role in carcinogenesis is pooly defined. We previously demonstrated the prognostic relevance of 3′UTR APA dynamics in prostate cancer. Consensus clustering with 3′UTR APA data stratified patients with prostate cancer into four subsets with different risk of biochemical relapse [14], indicating that a 3′UTR signature can be used to stratify risk in cancer. For breast cancer, only a few studies describe the 3′UTR landscape in clinical specimens and most focus exclusively on the prognostic significance of 3′UTR shortening, regardless of the heterogeneity between different molecular subtypes [15,16]. Data are still scarce for APA dynamics of TNBC and its survival relevance, so we estimated the underlying prognostic value of a 3′UTR signature in TNBC including both 3′UTR shortening and lengthening. Then we developed a multi-3′UTR-based model with an elastic net model to identify a subgroup of operable TNBC with high metastatic potential. We evaluated the prognostic efficacy of this classifier in training and validation sets and compared it to traditional clinical risk factor assessment and single 3′UTRs. To our knowledge, this is the first study that illustrates the prognostic significance of a 3′UTR signature in TNBC.

Patient characteristics
Patients (n = 327) with TNBC were retrospectively recruited to this study. Patients were on average 52.5 years-of-age (SD 12.4 years; median 53 years, IQR 43-61, range 29-84). After a median follow-up time of 88 months, 114 of the 327 patients had tumor relapse or distant metastasis.
For robust prognostic modeling, we divided TNBC data into training (n = 164) and validation sets (n = 163) using stratified random sampling based on microarray batches. Baseline clinicopathological characteristics are summarized in Table 1. There was no statistically significant difference in the event-free survival time between TNBC patient sets (log-rank p = 0.89) or age (mean 52.4 years-of-age; IQR 43-61 for training set; mean 52.6 years-of-age; IQR 43-61 for validation set p = 0.465). Lymph node status and tumor size distributions did not vary between patients in either set (p = 0.977 and 0.974, respectively).
We generated 3′UTR ERI profiles of 1,933 genes from 12 microarray batches and removed batch effects in ERI data using the ComBat method, which incorporates systematic batch biases common across variables for adjustments. Supplementary Figure S1 shows thatsamples mixed well in the principle component analysis (PCA), indicating corrected batch effects.

A 17-3′UTR-based classifier redefines patients with different risks of relapse or metastasis
After applying initial feature filtering using univariate Cox analysis, we developed a prognostic model with an elastic net to the training TNBC samples set. A value λ = 0.109 with log(λ) = −2.22 was chosen by leaveone-out cross validation via 1-SE criteria. The optimal penalty parameter selected 17 3′UTRs in the elastic net model and risk scores were calculated for all patients as a weighted sum of selected features based on this model. The ultimate model is a linear combination of 3′UTRs (features) selected by the elastic net. The weights are an estimation of the information contributed by each marker. Specifically: The exponential of the coefficient gives the hazard ratio (HR) of event (recurrent or metastasis) associated with each marker. Intuitively, we computed a standardized risk score and parameters of linear transformation were determined by the training set. Specifically: Supplementary Figure S2 shows that standardized risk score distribution was unimodal with similar peaks between training and validation cohorts. To select a threshold to stratify TNBC into high-and low-risk groups, we use X-tile approach, which split training patients into two risk groups with maximal χ 2 log-rank value (Supplementary Figure S3). We included patients with risk scores > 1.146 (standardized risk score 0.903) as high risk of recurrence or distant metastasis (high-risk group), and those with risk score < 1.146 as low risk of recurrence or metastasis (low-risk group).
We then investigated the relationship between the 17-3′UTR-based classifier and available clinical risk factors (age, lymph node status and tumor size). Table 1 confirms no significant correlation between the risk group and clinical factors in both sets (p > 0.05 for all). Stratified by age, lymph node status and tumor size, the 17-3′UTRbased classifier could still differentiate patients with highand low-risk TNBC among all subgroups ( Figure 2).

17-3′UTR-based classifier adds significant prognostic information to established clinicopathologic risk factors
We evaluated the additional prognostic power of the 17-3′UTR-based classifier compared to clinical variables with using a univariate and multivariate Cox proportional hazards model. We included three clinical risk factors (age at diagnosis, lymph node status and tumor size) that were available for all patients in our analysis. The 17-3′UTR-based classifier and lymph node status were consistently significant across the different data sets (p<0.05; Table 2). After multivariate adjustment by clinicopathological variables, the 17-3′UTR-based classifier retained significant prognostic power in all 327 patients. (HR 4.72, 95% CI 3.22-6.92; p<0.0001; Supplementary Table 1). The multivariate Cox analysis identified two independent prognostic factors for EFS: the 17-3′UTR-based classifier was the strongest predictor, followed by lymph node status.
Next, we compared the prognostic accuracy of the classifier and clinical variables. The 17-3′UTRbased classifier had significantly more prognostic power compared to clinicopathological risk factors (p<0.0001 for all; Figure 3A). Combining the classifier and lymph node status increased the AUC at 5 years from 0.684 (95% CI 0.633-0.734) to 0.729 (95% CI 0.666-0.792; p<0.0001; Figure 3A), indicating that joint use both indices improved the TNBC patient risk prediction.

Association between 3′UTR APA dynamics and event-free survival in TNBC
Patients in the training set were divided into two groups based on 3′UTR ERIs. The optimal threshold for dichotomizing ERI to plot survival curves was derived from the training cohort using the X-tile program and was applied without modification to the validation set. 3′UTRs with ERIs exceeding the threshold were "shortened" and those at the threshold or less were "lengthened". These parameters were also applied to the validation set. In 10 of 17 selected 3′UTRs (N4BP2L2, WDHD1, ZER1, ADGRL2, PRSS12, NPL, SYNGR1, SCL2A3, UBE2G2 and SIK3), 3′UTR shortening was correlated with poorer prognosis (Supplementary Figure S4; Table 2). In the remaining seven 3′UTRs (PPIC, ZCCHC14, RTN1, PRCK8, CLIC2, CXCL8 and SMAD6), 3′UTR lengthening offered an unfavorable prognosis (Supplementary Figure S5; Table 2).
We also assessed the prognostic accuracy of single 3′UTR and the 17 3′UTR signature to distinguish highly metastatic TNBC using a time-dependent ROC test (Supplementary Figure S6). When considered individually, the 17 3′UTRs had moderate bootstrap-corrected AUCs at 5 years (from 0.528 to 0.605). The 17-3′UTR-based classifier had greater AUC than that of any single 3′UTR (p<0.0001 for all).

DISCUSSION
Increasing recognition of the active role of 3′UTR APA dynamics in tumorigenesis has led to the identification of novel APA markers for prognosis. We developed and validated a novel prognostic model based on 17 3′UTRs to improve the prediction of disease recurrence or distant metastasis for patients with operable TNBC. Data show that the proposed classifier can successfully identify a patient subgroup with poorer 5-year EFS. In addition, this tool can reliably predict risk in patients with TNBC significantly better than the classical clinicopathological risk factors (age at diagnosis, lymph node status, and tumor size). When patients were stratified by risk factors, the 17-3′UTR-based classifier retained prognostic efficacy and offered additional power that complemented clinicopathological factor analysis. To our knowledge, this is the first study to report using the 3′UTR signature in TNBC for patient prognosis.
In the present study, we confirmed that shortening of ten 3′UTRs (N4BP2L2, WDHD1, ZER1, ADGRL2, PRSS12, NPL, SYNGR1, SCL2A3, UBE2G2 and SIK3) is associated with unfavorable prognosis in TNBC. This is consistent with the emerging role of 3′UTR shortening which enables key genes to escape microRNA repression and cause accelerated tumor progression [11,12]. Most genes with shortened 3′UTRs in TNBC participate in cancer development. WDHD1 acts as a cell cycle regulator and a downstream molecule in the PI3K/ Akt pathway in lung and esophageal carcinogenesis [17]. NPL glycoprotein serves as a serum biomarker to distinguish healthy and precancerous esophageal lesions [18]. SYNGR1 has differential expression between bladder cancer with different risk of recurrence [19]. UBE2G2 is a mutant gene in human leukemia [20] and SIK3 has been recently identified as a tumor antigen associated with ovarian cancer tumorigenesis. The 3′UTR markers for which lengthening indicates poorer prognosis include PPIC, ZCCHC14, RTN1, PRCK8, CLIC2, CXCL8 and SMAD6. Among them, SMAD6 and CLIC2 are both related to the TGF-β pathway, and SMAD6 is associated with favorable survival in lung cancer [21,22]. CXCL8 participates in the autocrine NF-κB/IL-8 (CXCL8) pathway driving cell migration [23]. Previous studies have mainly focused on the prognostic significance of 3′UTR shortening, but our findings first propose that 3′UTR lengthening can reduce survival. We speculate that more microRNA response elements (MREs) are harbored in the lengthened 3′UTR, leading homologous gene repression and competing endogenous RNA (ceRNA), which causes a series of aberrant pathways and cancer progression. Using computational network analysis, our recent study [14] revealed alternative 3′UTR participates in ceRNA network dynamics in cancer but detailed mechanisms remain to be characterized. Previous studies identified multiple 3′UTRs that with different length preferences in TNBC compared with non-TNBC or normal breast tissue [13,24]. Two studies independently confirmed that 3′UTR shortening predicts unfavorable outcomes in breast cancer based on public high-throughput data obtained from GEO and The Cancer Genome Atlas (TCGA), respectively [15,16]. However, research in cell lines indicated that TNBC prefers longer forms of 3′UTR compared with luminal subtype tissue and normal breast tissue [13]. Previous studies are limited by small numbers of 3′UTR analyzed, small sample sizes and a lack of validation. Although this preliminary work laid the groundwork for exploring the importance of 3′UTR shortening as a promising marker in cancer, our findings are more broad and inclusive of 3′UTR dynamics in human malignancy. With a high-dimensional elastic net modeling technique, we established a 3′UTR signature (including 3′UTR shortening and lengthening) to be a powerful tool for TNBC patient risk stratification by correlating alternative 3′UTR patterns with survival outcomes.
The proposed classifier is of clinical importance because current gene expression prognostic models such as the 70-gene signature [4], the Genomic Grading Index [25], and the Recurrence Score [26] allocate unfavorable prognostic risk category to all TNBC patients despite the diversity of clinical outcomes. Because TNBC is a highly heterogeneous disease, treatment strategies should be personalized according to risk recurrence and our data allow this type of risk evaluation and better management of TNBC patients.
Of note, the statistical method used for analyzing microarray data must be carefully selected. Cox regression was used previously [15] when constructing a prognostic model for breast cancer but this is inappropriate due to an insufficient sample size relative to the high-dimension features (i.e. sample size n: feature size p less than 10:1), especially for high-dimensional microarray data. An elastic net overcomes this disadvantage (p >> n) and simultaneously conducts automatic variable and group selections of correlated features [27,28].
The primary limitation of this study is that it is a retrospective study. Thus, data must be validated using a prospective study in a multicenter clinical trial. In addition, 3′-based chips (HG-U133A and HG-U133 plus 2.0) severely limited the number of genes analyzed (1,933 genes in this study, or ~9.7% of human protein-coding genes). We chose this platform because these two chips are widely used in gene expression studies and the number of available microarray data with survival and clinical information was sufficient for prognostic modeling. We show that transcripts with significant 3′UTR length change are highly biologically and clinically relevant to cancer. However, we cannot ensure our results are unbiased estimations of whole genome profiles among TNBC, so generalizations to patterns of alternative 3′UTR should be performed with caution. Currently we are conducting a study to profile 3′UTR APA dynamics of TNBC using transcriptome arrays, which provide a less biased 3′UTR landscape (paper in preparation).
In summary, we performed the largest 3′UTR APA prognostic analysis of pooled gene expression data and yielded essential clinical information about TNBC. We described a new 3′UTR signature for TNBC that identifies about one-fifth of TNBC patients as being relatively high risk for tumor recurrence. Finally, we studied a limited number of genes, our novel 3′UTR-based classifier should lay a solid foundation for future biological and pathologic analysis of alternative 3′UTR events in TNBC.

TNBC microarray data collection
In this study, we used data from a single Affymetrix platform (including HG-133A and HG-U133 plus 2.0) which is commonly used in microarray studies of breast cancer. Because most microarrays lacked immunohistochemistry information for ER, PR and HER2, triple-negative status was defined based on the bimodal filter of mRNA expression of ER, PR and HER2 as previously described [29,30]. For prognostic efficacy comparisons and stratification analysis, we only included samples with available follow-up data and three established clinicopathological risk factors: age at diagnosis, lymph node status and tumor size. Then, 327 publicly available microarrays were analyzed. Data for 3′UTR analysis comprised 12 pooled public data sets obtained from the Gene Expression Omnibus (GEO),

3′UTR profiling and data normalization
Microarrays were analyzed using an R package 'ERI-expr' [15], which contains custom chip description files (CDF) for HG-U133A, HG-U133B and HG-U133 plus 2.0. Briefly, the processed probesets matching alternative transcripts with different 3′UTR lengths were extracted in terms of APA sites defined by the polyA_DB database [31]. After robust multi-array average (RMA) correction, intensities for the 5′ (S 5′ ) and 3′ (S 3′ ) probe sets included in the custom CDF were extracted. The expression ratio index (ERI) was defined as the signal ratio of 5′ and 3′ probesets of the APA site: In equation 3, e S (e L ) represents expression of the short (long) form of the 3′UTR, whereas α S,L (β S,L ) denotes the affinities of the short (S) or long (L) forms for the 5′ (α) and 3′ (β) probe sets. The ERI is a linear function of the expression ration between shortening and lengthening forms of 3′UTR. Thus, we used the ERI value to characterize the relative prevalence of the APA dynamics. With this algorithm, 6,045 APA sites in 3,542 unique genes in HG-U133 plus 2.0 and 3,210 APA sites in 1,933 genes in HG-U133A were identified. Using the intersection of the recognized APA sites between the two chips, we reported the ERI value of the APA site closest to 5′ end of the gene with multiple APA sites, since 90% of the significant changes in APA isoform expression occurred at the first APA site [32].
To minimize batch effects when pooling ERI data, we used an empirical Bayes based ComBat method to adjust batch effects in microarray data [33]. To confirm that batch effects were successfully removed, principle component analysis (PCA) of combined ERI data sets was subsequently performed to visualize the spatial distribution of data sets from different batches.

Development and validation of risk prediction model for TNBC
3′UTR profiles were obtained from 12 publicly available data sets that contained 327 primary TNBCs. For robust analysis, we used stratified random sampling to assign each sample to a training or validation set according to chip batch. Patient characteristics of training and validation cohorts appear in Table 1. To reduce feature dimensionality, we first filtered out noisy features using univariate Cox analysis applied to 3′UTR ERI data. We established a threshold of 0.15 for the p-values, and kept only 3′UTRs with a Wald p-value smaller than 0.15 for model development.
Next, we used the elastic net [27] to identify 3′UTRs associated with event-free survival and to train the final model for prognosis with the selected features in the training set. The elastic net has been extended and broadly applied to the Cox proportional hazard regression model for survival analysis with high-dimensional data. This approach conducts automatic feature selection and group selection of the correlated variables simultaneously [28]. To find a parsimonious model with a modest discriminating accuracy, we used ten-fold cross validations to select the penalty parameter λ (the tuning parameter), and chose λ via 1-SE (standard error) criteria. The elastic net mixing parameter α was set to 0.5. We used R version 3.2.3 and its implemented 'glmnet' package [34] to perform the elastic net analysis.
To evaluate the effectiveness of the 3′UTR signature for survival prediction, we assigned each patient a risk score according to a linear combination of ERI of the 3′UTRs selected by the elastic net. The risk score function for sample i using the information from the significant 3′UTRs was calculated as follows: In the above equation, s ij is ERI for 3′UTRj on sample i, and Wj is the weight of the risk score of 3′UTRj. Weights were obtained by the corresponding coefficients derived from the elastic net modeling. The weights are a rough estimate of the information content contributed by each 3′UTR to event-free survival. The standard risk score was also reported for each sample. Then, we allocated the patients to high-risk and low-risk groups using X-tile plots based on correlations with the patients' event-free survival.
X-tile plots adopt an intuitive and concise approach to estimate associations between variables and survival. For continuous variables, X-tile software can analyze each value and choose an optimal threshold with the maximal Chi square value according to the log-rank test (or minimum p-value) [35]. We performed X-tile analysis using X-tile version 3.6.1 (Yale University School of Medicine, New Haven, CT).
To confirm the robustness of the elastic net model, we calculated risk scores based on Equation 4 for the validation samples, and divided them into high-or low-risk groups, with the threshold determined in the training set as described above. Time-dependent receiver operating characteristic (ROC) curves, a Cox regression model and Kaplan-Meier survival analysis were used to assess the prognostic accuracy of the model in training and validation sets.

Statistical analysis
Patient characteristics were summarized for all participants using standard descriptive statistics. A Pearson's χ 2 test was used to compare categorical variables whereas the Student's t-test was used for continuous variables. Disease-free survival (DFS) was preferentially used as a clinical end point for event-free survival (EFS). In data sets without DFS records, distant-metastasis-free survival (DMFS) was used as proxy for DFS. Patients with a study end date or who were lost to follow-up were considered censored. The median follow-up time was calculated using a reverse Kaplan-Meier method and the Kaplan-Meier method was used to construct EFS curves. Log-rank tests were used to assess survival differences. Unadjusted and adjusted hazard ratios (HR) with 95% confidence intervals (CI) were calculated using Cox proportional hazards models.
Time-dependent ROC curves were used to assess the risk prediction model and clinical factors to discriminate among patients with respect to the risk of EFS events [36]. Prediction accuracy was calculated based on the area under the time-dependent ROC curve (AUC). We performed 1,000 bootstrap re-samplings to compute CI for the AUC. In this procedure, rows of data were sampled with replacements, and risk scores and corresponding AUC values were estimated for each iteration.
All reported p-values were two sided and p<0.05 was considered to be statistically significant. All statistical analyses were completed using SPSS Statistics 20 (SPSS Inc., Chicago, IL) and R 3.2.3 (R Development Core Team, Vienna).