Gene expression signature of Gleason score is associated with prostate cancer outcomes in a radical prostatectomy cohort

Prostate cancer (PCa) is a leading cause of cancer-related mortality worldwide. Gleason score (GS) is one of the best predictors of PCa aggressiveness, but additional tumor biomarkers may improve its prognostic accuracy. We developed a gene expression signature of GS to enhance the prediction of PCa outcomes. Elastic net was used to construct a gene expression signature by contrasting GS 8-10 vs. ≤6 tumors in The Cancer Genome Atlas (TCGA) dataset. The constructed signature was then evaluated for its ability to predict recurrence and metastatic-lethal (ML) progression in a Fred Hutchinson (FH) patient cohort (N=408; NRecurrence=109; NMLprogression=27). The expression signature included transcripts representing 49 genes. In the FH cohort, a 25% increase in the signature was associated with a hazard ratio (HR) of 1.51 (P=2.7×10−5) for recurrence. The signature's area under the curve (AUC) for predicting recurrence and ML progression was 0.68 and 0.76, respectively. Compared to a model with age at diagnosis, pathological stage and GS, the gene expression signature improved the AUC for recurrence (3%) and ML progression (6%). Higher levels of the signature were associated with increased expression of genes in cell cycle-related pathways and decreased expression of genes in androgen response, estrogen response, oxidative phosphorylation, and apoptosis. This gene expression signature based on GS may improve the prediction of recurrence as well as ML progression in PCa patients after radical prostatectomy.


INTRODUCTION
Prostate cancer (PCa) is the most common noncutaneous cancer and a leading cause of cancer-related deaths among men in the U.S. Following primary curative treatment, PCa recurrence rates vary depending on stage, Gleason score (GS), and prostate-specific antigen (PSA) level [1]. Although 20 to 30% of patients with clinically localized disease will relapse within 5 years after initial therapy [2], predicting an individual patient's risk of recurrence or metastatic progression remains challenging.
Several previous studies have constructed prognostic gene expression signatures based on cell cycle proliferation genes [3], prostate tumorigenesis related  [4], and PCa recurrence [5][6][7]. In addition, GS has been utilized to construct gene expression signatures [8][9][10], which have been associated with relapse [8] and prostate cancer mortality [9,10]. The approach of identifying differentially expressed transcripts in tumors stratified by GS is expected to capture relevant data on tumor aggressiveness potential since GS is one of the best predictors of PCa prognosis [11]. While patients with GS ≤6 tumors typically have a favorable prognosis, patients with GS 8-10 tumors often have a poor prognosis [12]. A substantial proportion of PCa patients, however, have intermediate GS 7 tumors (3+4 or 4+3), and these patients have a more variable prognosis [13]. Better prognostic markers are needed to risk stratify the clinically heterogeneous subset of patients with intermediate GS 7 tumors. In this study, we developed a GS based gene expression signature by utilizing publically available transcriptome-wide data from The Cancer Genome Atlas (TCGA; N=333) and Elastic net. The constructed signature was then evaluated for its ability to predict recurrence and metastatic-lethal (ML) progression after radical prostatectomy in a PCa patient cohort (N=408, mean follow-up for biochemical recurrence=8 years).

Recurrence and metastatic-lethal progression in the FH cohort
The gene expression signature was then calculated in the FH cohort (N=408). The signature was significantly associated with GS (Pearson's product-moment correlation = 0.26, P-value <2.9×10 -9 ) (Figure 1b). The AUC of the gene expression signature to predict GS high (8-10) vs. low (≤6) was 0.76.
We next evaluated risk of recurrence based on the level of differentially expressed transcripts measured by the gene signature. A 25% increase in the gene expression signature was associated with a hazard ratio (HR) for recurrence of 1.65 (95% CI: 1.36, 1.98; P = 2.4×10 -7 ), which remained significant after adjusting for age at diagnosis and pathological stage (HR = 1.51; 95% CI: 1.24, 1.82; P = 2.7×10 -5 ) ( Table 3; Figure 2). When the analysis was restricted to patients with GS 7 tumors, a 25% increase in the gene expression signature was associated with a HR of 1.44 (95% CI: 1.14, 1.82; P = 0.002), which also remained significant after adjusting for age at diagnosis and pathological stage (HR = 1.38; 95% CI: 1.09, 1.76; P = 0.008).
Finally, we estimated the AUC to assess the predictive ability of the gene expression signature. The signature alone had an AUC of 0.68 ( Figure 2b) and 0.76 ( Figure 3) for predicting recurrence and ML progression, respectively. Adding the signature to a logistic regression model that included clinicopathological factors (i.e., age at diagnosis, pathological stage and GS) significantly improved the goodness of fit of the model for recurrence (likelihood ratio test (LRT) P = 0.0003) and ML progression (P = 0.0004). The AUC increased by 3% and 6% for recurrence ( Figure 2b) and ML progression (Figure 3), respectively. In patients with GS 7 tumors, the signature alone had an AUC of 0.66 for recurrence ( Figure 2d). When the signature was added to a model with age at diagnosis, pathological stage and GS (3+4 vs. 4+3), the AUC increased by 5% for recurrence and the goodness of fit of the model significantly improved (P = 0.01).

Gene set enrichment analysis
A GSEA analysis was performed to investigate whether the signature correlates with biological pathways known to be involved in prostate biology and tumor progression. Because the gene signature is based on differentially expressed transcripts in high vs. low GS tumors, higher levels of the signature are expected to reflect biological pathways active in high-grade tumors. We first investigated correlations between the gene expression signature and transcriptome-wide mRNA expression levels. After applying multiple testing correction, gene expression levels of 2,492 genes were significantly correlated with the signature (Bonferroni corrected P-value <0.01). Of the 2,492 genes, 1,372 had transcript levels that were positively correlated with the gene expression signature while 1,120 had transcript levels that were negatively correlated with the signature. GSEA for the 2,492 correlated genes identified gene sets with a FDR q-value <0.05 (Table 4). Higher levels of the signature were associated with increased expression of genes in cell cycle-related pathways including G2M checkpoint, and genes encoding cell-cycle targets of E2F transcription factors, and decreased expression of genes in several pathways including androgen and estrogen response, oxidative phosphorylation, and apoptosis. Several previous studies found evidence that these pathways are important in prostate tumor progression, and the results therefore suggest that our signature captures these relevant biological differences.

DISCUSSION
In the present study, we developed a mRNA expression signature of GS using TCGA data, which predicted risk of recurrence and ML progression when PSA: prostate-specific antigen, SD: standard deviation a Local stage is pT2, N0/NX, M0. Regional stage is pT3-T4 and/or N1, M0 tested in an independent patient cohort under long-term follow-up. This transcript signature may improve disease prognostication after radical prostatectomy. Several previous studies have constructed prognostic gene expression signatures based on cell cycle proliferation genes [3], prostate tumorigenesis related genes [4], and PCa outcomes [5][6][7]. The cell cycle progression score [3], Prolaris®, was demonstrated to have predictive value for outcomes in RP specimens [14] and needle biopsy [15,16]. It improved prognosis for biochemical recurrence [3], metastatic progression [17], and PCa death [15,16]. The gene expression signature  Box plots of the gene expression signature for patients in different Gleason score categories in the FH cohort.  based on prostate tumorigenesis related genes [4], Oncotype DX®, improved the prediction of aggressive disease [18] and recurrence [19]. The gene expression signature Decipher™ [5] based on PCa metastasis was also predictive of biochemical recurrence [20], metastasis [21][22][23], and mortality [24] in independent validation samples. In addition, several previous studies have identified gene expression signatures based on GS [8][9][10]25]. The previous study by Bibikova et al. [8] utilized 512 prioritized genes that were selected based on biological relevance and publicly reported lists of genes differentially expressed in prostate tumors. They constructed a 16-gene signature using genes that were significantly correlated with GS, however, none of them overlaps with the 49 genes in our signature. This may be due to differences in the input gene sets (prioritized genes vs. transcriptome-wide), methods used to construct the signatures (correlation vs. elastic net), and differences in the size of the training datasets (71 samples vs. 333 TCGA samples). Another study by Penney et al., considered 6,100 genes to develop a 157-gene signature of GS (8-10 vs. ≤6) using the adapted nearest neighbor classification method [9]. Out of the 157 genes, five genes overlap with our signature: asporin (ASPN), centromere protein F (CENPF), endothelial cell specific molecule 1 (ESM1), geminin, DNA replication inhibitor (GMNN), and PAGE family member 4 (PAGE4). Recently, the same group of investigators utilized transcriptome-wide data to generate an updated expression signature of GS that included transcripts from 30 genes [10]. Of those 30 genes, four overlap with our GS signature: anoctamin 7 (ANO7), asporin (ASPN), galactosidase beta 1 like 3 (GLB1L3), and non-SMC condensin II complex subunit D3 (NCAPD3). Interestingly, prior evidence also suggests that three of these genes, ANO7, ASPN, and NCAPD3, play a role in PCa progression [26][27][28]. The ANO7 gene encodes a prostate-specific cytoplasmic and polytopic membrane protein [29], and its activity is regulated by androgen signaling. Reduced expression of the gene has been associated with high-grade prostate cancers [26], which is consistent with the gene transcript's negative coefficient in our signature of GS. ASPN codes for a cartilage extracellular protein, and others have shown that it is over-expressed in both primary tumors and metastatic PCa [27]; this is consistent with the positive coefficient of this gene in our signature. NCAPD3 plays critical roles in mitotic chromosome assembly and segregation, and its expression has been associated with a decreased risk of recurrence after radical prostatectomy [28]; this is also consistent with the negative coefficient of the gene in our signature. Thus, our gene signature of GS provides confirmation for four of the 30 genes included in the Penney et al. signature for GS.
We did not prioritize the genes based on PCa-related biological pathways or patient outcomes, however, our signature also includes three genes (CENPF, DLGAP5, and PRC1) represented in the cell cycle progression score [3], (Prolaris®), one gene (ANO7) in the metastasis-based Decipher™ panel [5], two genes (ANO7 and ESM1) in the PCa recurrence-based signature [6], and two genes (CENPE and CST2) in the signature for ML progression previously developed by our group [7]. Our GS signature provides confirmation for several of the genes involved in PCa related pathways and some previously highlighted in relation to adverse patient outcomes. In particular, evidence from multiple studies now corroborates the differential expression of ANO7, ASPN, CENPE, and CENPF in more aggressive PCa.
In addition to alterations in gene expression, other types of molecular biomarkers in prostate tumor tissue may reveal further perturbations in genomic pathways that mediate PCa progression. Thus, the ability to predict aggressive tumor behavior in PCa patients diagnosed  [30]. Of note, our gene expression signature of GS is highly correlated with that DNA methylation signature of GS (Pearson correlation = 0.57). However, in spite of the high correlation, only the ANO7 gene was highlighted in both signatures. One limitation of our study is that the gene expression data of the training set (TCGA) and the testing set (FH) were generated using different platforms. RNAseq data were available in TCGA and array-based data in the FH cohort. In general, RNA-seq read densities are known to be highly correlated with gene expression intensities from microarray measurements [31], and therefore, we used the averaged values without making any transformation. Out of the 49 genes in our signature, 17 (35%) genes had more than one transcript measured in the FH tumor samples. Therefore, we might have missed or oversimplified the differences in gene expression levels from alternative splicing. Another limitation of our study is the small number of cases with metastatic progression or who died of PCa. Since PCa that is clinically localized at diagnosis often takes decades to relapse as metastatic disease or cause disease-specific mortality, a long follow-up period is required to ascertain these outcome events. With a maximum follow-up of 17 years for recurrence (mean of 8 years) in our study, we observed 109 recurrence events including 27 ML events. Therefore, further evaluation of our signature in other PCa cohorts with larger numbers of ML events is warranted.
Gleason score is one of the best predictors of PCa prognosis, however, its utility for assessing prognosis at the time of diagnosis is limited by inter-pathologists variability in grading tumor tissue [32], and the substantial heterogeneity of outcomes in patients with intermediate GS 7 tumors [13]. To better predict subsequent outcomes in PCa patients diagnosed with clinically localized tumors, we developed a gene expression signature based on contrasting high (8-10) vs. low (≤6) GS tumors in TCGA. By focusing on patients in more extreme GS groups, the signature may be less likely to be influenced by inter-pathologists variability in reading GS and may be more predictive of adverse outcomes in patients with intermediate grade tumors. The signature was then tested in a prospectively followed FH-based patient cohort for its ability to predict long-term PCa outcomes. The GSbased gene expression signature significantly improved the prediction of recurrence and ML progression compared to standard clinicopathological parameters. Therefore, the signature may be a useful tool to help improve the prognostication of PCa patients.

Study population The Fred Hutchinson (FH) Cancer Research Center cohort
The FH cohort is composed of male residents of King County, Washington, who were diagnosed with PCa in 1993-1996 [33] or in 2002-2005 [34], ascertained through the Seattle-Puget Sound Surveillance Epidemiology and End Results (SEER) registry, and participated in population-based studies. For this analysis, only the subset of patients who underwent radical prostatectomy as primary treatment and who had gene expression and outcomes data available are included. The Fred Hutchinson Institutional Review Board approved the studies and all participants signed informed consent statements. Clinicopathological information including age at diagnosis, Gleason score, pathological tumor stage (local: pT2, N0/NX, M0; regional: pT3-T4 and/ or N1, M0), and prostate-specific antigen (PSA) level at diagnosis was collected from the SEER cancer registry. The participants were followed for recurrence status using detailed patient

RNA extraction and profiling
Formalin-fixed paraffin-embedded (FFPE) PCa tumor tissue blocks were obtained from radical prostatectomy samples and used to make H&E stained slides, which were reviewed by a PCa pathologist to confirm the presence and location of prostate adenocarcinoma. For each patient, two 1-mm tumor tissue cores were taken from areas enriched with ≥75% tumor cells from the dominant lesion. The RNeasy® FFPE Kit (Qiagen Inc., Valencia, CA) was used to isolate RNA from tissue cores. RNA samples were quantified with RiboGreen, aliquoted onto 96-well plates and shipped to Illumina for gene expression profiling. Tumor RNA samples from patients with various outcomes were randomly distributed across the plates and laboratory personnel were blinded to this information.
The WG_DASL® HT Assay (Illumina, Inc., San Diego, CA) was used for gene expression profiling. RNA was converted to cDNA using biotinylated oligo (dT) and random nonamer primers, and immobilized to a streptavidin-coated solid support. Prequalification of cDNA was assessed using quantitative RT-PCR and analysis of the housekeeping gene RPL13a. Biotinylated cDNAs were annealed to assay-specific oligonucleotides to create PCR templates that were amplified using labeled and biotinylated universal primers. Labeled PCR products were captured on streptavidin paramagnetic beads, washed, and denatured to yield single-stranded fluorescent molecules which were hybridized to the HumanHT-12 v4 Expression BeadChip. Samples were scanned using a BeadArray® Reader that reads the fluorescence intensities, and intensity data file images were extracted for 29,377 transcripts that map to 20,818 genes. Gene expression data were quantile normalized and log2 transformed using R statistical computing software. Low quality probes were filtered out with IlluminaHumanWGDASLv4.db in Bioconductor, leaving 26,051 transcripts for further analysis. Batch effects were removed using ComBat [35]. FH blind duplicate samples from 11 patients that were randomly distributed across the plates had correlations ranging from 0.98 to 0.99, and replicate samples from two patients that were included on every plate had mean correlations of 0.99.

The Cancer Genome Atlas
The TCGA dataset included 333 PCa patients [36]. Level three RNA-seq data were downloaded from the Cancer Browser (https://genome-cancer.ucsc.edu/), and expression data for 20,530 genes were available.

Statistical analysis
In the FH cohort, there were 26,051 gene expression probes in 17,923 genes. In case a gene had more than one probe, the average transcript level for that gene was calculated. There were 16,174 genes for which expression data were available in both the FH cohort and in TCGA. All 16K genes were used as input for feature selection using elastic net. Elastic net logistic regression was used to build a gene expression signature of patients with GS 8-10 vs. GS ≤6 tumors in TCGA. The analysis was done using the glmnet R package with the binomial family option and an elastic net mixing parameter α equal to 0.5 (α = 0 for ridge regression; α = 1 for lasso regression). Five-fold cross-validation was used to identify the value for lambda that resulted in the highest cross-validation area under the curve (AUC=0.96). The gene expression signature of patient j was calculated as follows: In the FH cohort, the correlation of the signature with GS was calculated. Kaplan-Meier curves and Cox proportional hazard models were used to investigate associations between quartiles of the signature and PCa recurrence. Models were adjusted for age at diagnosis and pathological stage. The signature's AUCs for recurrence and ML progression were estimated using the predicted probabilities from logistic regression (pROC in R) adjusting for age at diagnosis, pathological stage, and GS ≤6, 7 (3+4), 7 (4+3), and 8-10 groups. A likelihood ratio test was used to evaluate the ability of the signature to improve the prediction of PCa outcomes compared to a model with the clinicopathological parameters age at diagnosis, GS and pathological stage only (lmtest in R).
After that, we identified the genes for which the expression levels correlated with the signature (Bonferroni corrected P-value < 0.01). These genes were ranked based on the correlation coefficient and used as input for Gene Set Enrichment Analysis (GSEA) [37]. Gene sets used in the analysis were from the Molecular Signatures Database (MSigDB) hallmark gene set collection, which have been shown to reduce variation and redundancy thereby providing more refined and concise inputs for GSEA [38]. An enrichment score (ES) for each hallmark gene set was calculated, and these scores were normalized to account for the size of the gene set to yield a normalized enrichment score (NES) [37]. A FDR q-value threshold of 0.05 was used.

Author contributions
MAJ performed the data analysis, interpretation of the results, and drafted the manuscript. MSG developed the analysis approach, participated in interpretation of the results, and critically revised the manuscript. JLW helped with review and classification of patient outcomes. SK helped with collection of FH data. CA, MB, and JBF generated the gene expression data. EAO helped with processing of primary tumor tissue and RNA purification. JBF, ZF, and JLS designed the original study that provided the FH data and participated in interpretation of the results. All co-authors helped with critical revision of the paper and approved the final version.