A prognostic gene expression signature for oropharyngeal squamous cell carcinoma

Background Robust prognostic stratification of patients with oropharyngeal squamous cell carcinoma (OPSCC) is important for developing individualized treatment plans. This study was conducted to develop and validate a clinically feasible prognostic classifier based on transcriptome-wide gene expression profiles. Methods Tumor tissues were collected from 208 OPSCC patients treated at Washington University in St. Louis and 130 OPSCC patients treated at Vanderbilt University, used for model training and validation, respectively. OPSCC patients (n = 70) from the TCGA cohort were also included for independent validation. Based on RNA-seq profiling data, Cox proportional hazards regression analysis was performed to identify genes associated with disease outcomes. Then, Lasso-penalized multivariate survival models were constructed to identify biomarker genes for developing a prognostic gene signature. Findings A 60-gene signature was identified by RNA-seq profiling analysis. Computed risk score of the gene signature was significantly predictive of 5-year overall survival of the training cohort (Hazard ratio (HR) 28·32, P = 4·3E-41). Subgroup analysis stratified by HPV status revealed that the signature was prognostic in HPV-positive OPSCC patients (HR 30·55, P = 7·0E-37) and was independent of clinical features. Importantly, the gene signature was validated in two independent patient cohorts, including the TCGA cohort (HR 3·94, P = 0·0018) and the Vanderbilt cohort (HR 8·50, P = 5·7E-09) for overall survival. Interpretation The prognostic gene signature is a robust tool for risk stratification of OPSCC patients. The signature remains prognostic among HPV-positive OPSCC patients. Funding National Institutes of Health.


Introduction
Epidemiologic evidence has shown a rapid increase in the incidence rate of oropharyngeal squamous cell carcinoma (OPSCC) in the past several decades [1]. Accurate prognostication of patients with OPSCC is critical for providing effective individualized treatment plans. Currently, OPSCC prognosis is mainly based on human papillomavirus (HPV) status, lifetime tobacco cigarette smoking and tumor and nodal stage [2]. Although HPV-positive (HPV+) patients show better overall survival, disease outcomes still vary greatly, and rates of distant metastasis are similar to the HPV-negative (HPV-) population [3].
Multiple prognostic models for OPSCC have been developed in recent years based on specific domain knowledge of small subsets of genes, such as those focused on specific somatic mutations [4], HPV-promoter methylome [5], miRNA [6] and immune-related genes [7]. However, knowledge of the molecular mechanisms underlying observed clinical differences remains rather limited at present, and existing prognostic models have not been validated in large independent OPSCC patient population. Thus, there is an urgent need to identify robust prognostic biomarkers which further stratify risk in OPSCC patients.
In general, an integrated model consisting of multiple genes have greater predictive ability than single gene models [8]. To this end, traditional univariate or multivariate Cox regression models are commonly used to select biomarker genes for further building multi-gene signatures. However, conventional Cox regression models, without adequate consideration of variable selection, often suffer from increased risk of model overfitting when a large number of genes are included. To address this challenge, the least absolute shrinkage and selection operator (LASSO) [9] penalized Cox model is a useful method to implement variable selection while fitting the Cox model and has been widely applied for robust modeling of high-dimensional predictors. In this study, we sought to develop a robust prognostic gene signature for OPSCC based on transcriptional expression profiles of genes, as selected by LASSO Cox regression models. A patient cohort from Washington University in St. Louis was used as the training set to build the prognostic signature. An independent OPSCC cohort from Vanderbilt University as well as an OPSCC cohort from The Cancer Genome Atlas (TCGA) [10] were used to evaluate the general applicability of the signature. In this way, we have developed and validated an RNA expression based 60-gene signature that robustly predicted OPSCC survival across multiple institutions.

Patients cohorts
This study was approved by the Human Research Protection Office of the Washington University in St. Louis and the Human Research Protection Program of Vanderbilt University Medical Center. Informed patient consent was not required for this retrospective study. Two hundred and eight OPSCC cases treated at Washington University were included in this study as the training cohort. In addition, we obtained 130 cases treated at Vanderbilt University as a validation cohort. Clinical data were collected prospectively from these patients and then updated retrospectively after follow-up review.
For all the patients, formalin-fixed, paraffin-embedded (FFPE) tumor tissues were collected for pathologic analysis from biopsy or primary surgical resection. Tumor sections from these cases were stained with hematoxylin and eosin (H&E) and reviewed independently by pathologists at Washington University and Vanderbilt University to confirm the diagnoses. Tumor regions from each section were identified by H&E staining, and total RNA was then extracted from macrodissected tumor regions with the miRNeasy FFPE kit (Qiagen). In this way, we focused on expression profiling of the tumor tissues with minimal contamination from adjacent normal tissues.
From the TCGA GDC Data Portal (https://gdc-portal.nci.nih.gov/), normalized RNA-seq and clinical data for head and neck cancer (HNSCC) patients were retrieved. From this cohort, a total of 70 OPSCC patients were identified according to anatomic neoplasm subdivision annotations in the clinical data.

RNA-seq and sequencing data analysis
Details of the RNA-seq experimental protocol have been described previously [11]. In brief, ribosomal RNA (rRNA) was removed with the RiboMinus kit (Life Technologies) and custom DNA oligonucleotide probes. Then, the RNA was used to construct RNA-seq libraries with the NEBNext Ultra mRNA Library Prep kit (New England BioLabs). The resulting cDNA libraries were PCR amplified and subject to sequencing with Illumina HiSeq 3000 at Washington University Genome Technology Access Center. Raw sequencing reads were mapped to the NCBI reference sequence database [12] to identify human coding genes as well as aligned to annotated HPV genomes [13] to identify HPV transcripts with Bowtie [14]. Twenty-seven samples were excluded from a total of 338 samples due to low sequencing data quality (i.e. total mRNA read counts < 400,000). The read counts were normalized to reads per kilobase per million reads (RPKM). Genes with average RPKM <1 were excluded from further analysis. The RPKM expression values were further log2 transformed.

Construction of gene expression prediction model
Univariate Cox regression analysis was performed to evaluate the association between individual gene expression and disease outcome in the training data. A simple mean imputation approach was used to replace missing values in the clinical variables. The average percentage of missing values is 2.4% per clinical variable in the training cohort and 2.2% in the Vanderbilt validation cohort. Multivariate Cox regression analysis was performed to evaluate the independent prognostic value of individual genes after controlling for common clinicopathologic variables. LASSO was used to reduce model overfitting during the selection of high-performance biomarker genes with the glmnet R Package [15]. Specifically, an optimal value for the penalty parameter (λ1) was calculated using a ten-fold internal cross-validation for selecting variables in the final model. The modeling process was repeated 1000 times. From the 1000 iterations, different gene models were constructed for optimal prognostic performance. We then counted the occurrence of individual genes across all presented

Research in context
Evidence before this study Accurate prognostication of patients with oropharyngeal squamous cell carcinoma (OPSCC) is critical for providing effective individualized treatment plans. Although HPV-positive (HPV+) patients show better overall survival, there are approximately 20% of HPV+ OPSCC having dismal prognosis. We systematically searched PubMed up to September 15, 2019, with the search terms "mRNA", "OPSCC", "prognostic", and "biomarker", with no publication date or language restrictions. Of the OPSCC mRNA signature studies, several signatures were developed by analyzing small numbers of genes using limited numbers of tumor specimens. For example, high expression level of HER3 was found to be significantly correlated with poor overall survival (OS); patients with highly expressed nuclear PRMT5 had 1¢7 times higher hazard of death; an immune-related gene signature was reported to be prognostic of OS; a hypoxia-related gene signature was reported to be associated with poor progression-free survival; an HPV-correlated signature including 38 genes was prognostic of HPV+ OPSCC patients. Our search did not identify any previous transcriptome level studies that investigated the potential predictive role of mRNA expression signature in OPSCC.

Added value of this study
We did a cross-institutional study to develop an mRNA expression signature for predicting the risk of OPSCC progression. By performing transcriptome-wide RNA-seq profiling analysis for 408 OPSCC cases, a 60-gene signature was developed. Computed risk score of the gene signature was significantly predictive of 5-year overall, 5-year recurrence-free, and 5-year metastasis-free survival of the training cohort. Subgroup analysis stratified by HPV status revealed that the signature was prognostic in HPV+ OPSCC patients. We validated the performance of the 60-gene signature in two independent cohorts and the signature maintained its independent prognostic value after adjustment for the clinical variables. Our finding suggests that the 60-gene signature is a clinically feasible signature for OPSCC prognosis including the HPV+ OPSCC subgroup.

Implications of all the available evidence
We developed and validated a clinically feasible gene signature for the prognosis of OPSCC patients including the subgroup of HPV+ patients. The robust performance of the signature on multi-institutional OPSCC patients indicates its general applicability in clinical practice. models. Various cutoffs were considered to retain genes with occurrences >100, 200 or 300 to develop three respective gene signatures. The risk score of the signatures was calculated as follows: where p is the total number of genes retained in the final signature, i indicates the ith gene in the signature, z i and e i represent the z-score (i.e., the Cox regression coefficient of a gene divided by its estimated standard error) and the expression level of the ith gene, respectively. C-index from the Cox analysis was applied to evaluate the performance of the signatures. Patient stratification according to risk score was done using the optimal cut-off value as established in the training cohort by receiver operating characteristic (ROC) curve analysis for overall survival.

Statistical analysis
In our study, overall survival (OS) was defined as the time interval between the date of diagnosis and the date of death; recurrence-free survival (RFS) was defined as the time interval between the date of diagnosis and the date of first recurrence or death; metastasis-free survival (MFS) was defined as the time interval between the date of diagnosis and the date of first distant metastasis or death. The Kaplan-Meier survival curve and log-rank test were used to evaluate the performance of the gene signatures. Statistical data analysis was performed using the statistical programming language R (http://www.r-project.org/).

RNA-seq profiling to globally identify genes associated with patient survival
RNA-seq profiling was performed for 208 OPSCC patients treated at Washington University in St. Louis. Total RNA isolated from FFPE tumor specimens is usually highly degraded, and thus not suitable for conventional RNA-seq profiling analysis. To address this challenge, we developed a sequencing library construction method using custom designed probes for rRNA removal (see Methods for details). In this way, most rRNA reads were successfully removed from the constructed libraries. On average, we obtained 12 million total reads per sample. After removing 11 samples with low sequencing data quality, 197 cases were included in the training cohort for discovering prognostic biomarker genes (Table 1). Among them, 32.5% (n = 64) of the patients received surgery with adjuvant chemoradiation therapy, 35.0% (n = 69) received surgery with adjuvant radiation therapy, 16.2% (n = 32) received chemoradiation therapy and 11.2% (n = 22) received surgery only. The median follow-up time was 60 months, with 49 of the 197 patients (24¢9%) deceased during the study period. We performed univariate Cox regression analysis to identify genes whose RNA expression was correlated with 5-year (5yr) overall survival in the training cohort. The results revealed that 1928 protein-coding genes were associated with 5-yr OS (P < 0¢05, Wald test). The independent prognostic values of these genes were evaluated further by controlling for treatment protocols (chemotherapy and radiotherapy status) with multivariate Cox regression analysis. In this way, 1436 genes were selected as biomarker candidates for further model development (P < 0¢05, Wald test).

A gene expression signature to predict patient survival
To further identify the most promising gene biomarkers, we performed LASSO-penalized multivariate Cox modeling with ten-fold cross-validation. Across 1000 iterations, 16 different gene expression models were constructed for optimal survival prediction. The frequency counts of these different models ranged from 1 to 681. A total of 77 genes were represented in all the models combined, with a range of 6 to 59 genes in individual models. The frequency counts of the 77 genes were in the range of 5 to 700 across 1000 iterations (Supplementary Table S1). We established three gene signatures based on various cutoffs of the gene occurrence (>100, 200 or 300). We found that the signature consisting of 60 genes (n>200) had the best prognostic performance (C-index = 0¢89). Thus, this signature was selected for further analysis. Among the 60 genes in this signature, 26 had positive Cox coefficients, indicating that higher expression levels were associated with poor survival outcomes. On the other hand, 34 genes had negative Cox coefficients, indicating higher expression levels were associated with favorable outcomes (Supplementary Table S2). A risk scoring system was established by summarizing weighted expression values of the 60 genes (see Methods). The z-scores in the univariate Cox models were used as weighting factors to compute the risk score. In this way, we calculated the risk score for each patient in the training cohort. ROC analysis was performed to define the optimal cut-off for patient risk stratification, revealing a sensitivity of 80% and a specificity of 96% at the cut-off value of À158¢3 (Supplementary Figure S1). Patients were divided into high-risk and lowrisk groups based on the optimal cut-off value. In this way, 53 patients were predicted to be of high-risk (with score >À158¢3) and 144 of low-risk (with score < À158¢3). Expression heatmap analysis indicates that the 34 favorable and 26 unfavorable genes had enriched expression in patients with low-and high-risk scores, respectively (Supplementary Figure S2). In total, 86¢7% of the patients in the high-risk group vs. 6¢6% in the low-risk group died during the study period. Kaplan-Meier survival analysis indicated that the death risk for high-risk patients was significantly higher compared with that for low-risk patients (HR 28¢32, 95% CI 13¢90À57¢66, P = 4¢3E-41, Log-rank test; Fig. 1a). Moreover, the risk score was also demonstrated to be prognostic of locoregional and distant failures (HR 17¢28, 95% CI 9¢49À31¢46, P = 7¢9E-35 for 5-y RFS and HR 22¢89, 95% CI 12¢03À43¢55, P = 2¢5E-40 for 5-y MFS, Log-rank test; Fig. 1b and c).

The 60-gene signature was independent of clinicopathologic features and HPV status
We assessed whether the 60-gene signature had independent prognostic value in the context of clinicopathologic parameters including age, sex, race, smoking history, TNM stage, radiotherapy and chemotherapy status. With univariate Cox analysis, the 60-gene signature were significantly associated with 5-yr OS in the training cohort (P = 3¢1E-20, Wald test, Table 2). Multivariate Cox analysis further exhibited that the signature retained its independent prognostic  Table 2). It is well-known that HPV+ OPSCC is associated with better outcome than HPV-OPSCC. Given the strong prognostic impact of the HPV status on survival, we further assessed whether the 60-gene signature could significantly prognosticate both HPV+ and HPV-patients in the training cohort. The HPV status was determined by RNA-seq analysis as described in Methods. Patients were stratified into two groups according to their HPV status (167 HPV+ cases and 30 HPVcases). The 60-gene signature was assessed for its performance by applying to these two patient subgroups separately. Survival analysis indicated that the signature risk score was prognostic of 5-yr OS (HR 30¢55, 95% CI 13¢86À67¢34, P = 7¢0E-37, Log-rank test), 5-yr RFS (HR 17¢57, 95% CI 8¢96À34¢49, P = 6¢6E-29, Log-rank test) and 5-yr MFS (HR 24¢11, 95% CI 11¢75À49¢48, P = 5¢1E-35, Log-rank test; Fig. 2a-c) for HPV+ patients. Similarly, the risk score was also a prognostic predictor for HPV-patients, although with less significant p-values (p-values for survival outcomes ranging from 8¢8E-05 to 2¢0E-04, Log-rank test) as compared with HPV+ patients, partly due to the smaller sample size (Fig. 2d-f). Importantly, the 60-gene signature also maintained its independent prognostic value in the HPV+ patient subgroup as evaluated by both univariate analysis (HR 30¢55, 95% CI 13¢86À67¢34, P = 2¢3E-17, Wald test) and multivariable analysis adjusting for clinical features (HR 30¢82, 95% CI 13¢01À72¢98, P = 6¢5E-15, Wald test; Table 2). Expression heatmap analysis indicates that the favorable and unfavorable genes in the signature had enriched expression in HPV+ OPSCCs with low-and high-risk scores, respectively (Supplementary Figure S3). Thus, the prognostic value of the 60-gene signature was independent of both the clinicopathologic features and HPV status.

Discussion
In this study, we developed a comprehensive gene expression signature for OPSCC prognosis by RNA-seq transcriptome profiling. Our analysis indicated that the signature was independent from the clinical parameters as well as HPV status. Previous studies have identified multiple prognostic signatures for the general HNSCC population. However, signatures developed in this way may not perform well for OPSCC patients as OPSCC is distinctively different from most other subtypes of HNSCC. Attempting to address this challenge, several prognostic gene signatures were developed specifically for OPSCCs [16À18]. For example, high expression level of HER3 was found to be significantly correlated with poor OS [16]; patients with highly expressed nuclear PRMT5 had 1¢7 times higher hazard of death [17]; an immune-related gene signature was reported to be prognostic of OS [7]; a hypoxia-related gene signature was reported to be associated with poor progression-free survival [18]; an HPV-correlated signature including 38 genes was prognostic of OPSCC patients [19]. However, these existing signatures were developed by analyzing small numbers of genes using limited numbers of tumor specimens. In contrast, the present study generated transcriptome-wide RNAseq profiling data for over 300 OPSCC patients treated at two institutions. To date, this is the most comprehensive study characterizing RNA expression in OPSCCs. In this way, we have developed and independently validated a 60-gene signature for risk stratification of OPSCC patients.
HPV status has been shown to be a robust prognostic marker in OPSCC. The favorable prognosis of HPV+ patients leads to various deintensification strategies aiming to maintain high survival rates while reducing treatment toxicity [20]. Currently, this is an active area of clinical research, and multiple studies have demonstrated promising results from deintensification clinical trials for HPV+ OPSCC patients [21]. Despite favorable prognosis in general, there are approximately 20% of HPV+ OPSCC patients who died of disease [2,22]. It is hypothesized that there may be distinct genetic subtypes of HPV+ OPSCC, resulting in variable responses to therapy [23]. Thus, deintensification may not be appropriate for all HPV+ OPSCC patients. In our study, the 60-gene signature was robust at distinguishing high-and low-risk patients in the HPV+ patient group. Thus, our gene signature could contribute to ongoing trials by improving the selection of low-risk HPV+ OPSCC patients for deintensification treatment.
Multiple genes in the 60-gene signature were previously implicated in cancer progression. For example, LYN and MEI1 were included as favorable genes in the signature. LYN has the capacity to stabilize focal adhesion complexes and its overexpression were associated with better OS in acute myeloid leukemia [24,25]. Similarly, patients with higher expression levels of MEI1 have better OS than those with lower expression levels in cervical cancer [26]. Another gene, TRIB3, was an unfavorable gene in the signature. It has recently been identified as a stress sensor in response to various tumor microenvironments [27]. In addition, TRIB3 also shows a significant correlation with worse OS in colon cancer [28]. To further evaluate the functional roles of the genes in the signature, DAVID was used to identify gene oncology functional enrichment [29]. The most significantly enriched terms include regulation of glucose transport and inflammatory response. Previous studies indicate that glucose metabolism is a potential therapeutic target in cervical cancer, and high tumor inflammatory response predicts better survival in OPSCC [30,31]. Thus, the roles as well as the underlying mechanisms of these biomarker genes in OPSCC warrant further investigation.
In summary, we developed and validated a clinically feasible gene signature for OPSCC prognosis. The robust performance of the signature on multi-institutional OPSCC patients indicates its general applicability in clinical practice.

Author contributions
XL led the bioinformatic and biostatistical data analysis. PL performed the RNA-seq experiments and updated the clinical data. XL and PL contributed equally to this study. R.D.C, K.A.LK, J.S.L, H.A.G and W.L.T provided the tumor specimens and patient clinical data. JL provided statistical guidance throughout the study. XW contributed to the study design and project supervision. All authors contributed to writing and reviewing the manuscript and approved the submitted version.

Declaration of Competing Interest
The authors have no conflicts of interest to declare.