Transcriptomic analysis reveals a WNT signaling pathway-based gene signature prognostic for non-small cell carcinoma

The value of combining multiple candidate genes into a panel to improve biomarker performance is increasingly emphasized. Genes associated with WNT signaling are widely-reported to provide prognostic signatures in non-small cell carcinoma (NSCLC). Screening of genes involved in this signaling pathway facilitated selection of an optimal candidate biomarker gene combination and development of an NSCLC prognostic model based on expression of these genes. Risk scores derived from the model performed well in predicting survival; in the training dataset, samples achieving a high risk score exhibit a shorter survival interval (median survival time 34.8 months, 95% CI 31.1-41.0) than did samples achieving a low risk score (median survival time 72.0 months, 95% CI 59.3-87.5, p=2e-11), and exhibited higher oncogene and lower tumor suppressor gene expression. Receiver-operator characteristic curves based on three-year survival demonstrate that the model outperformed clinical prognostic indicators. In addition, the model was validated in four independent cohorts, demonstrating robust NSCLC prognostic value. Correlation analyses reveal that the model offers efficacy independent of other clinical indicators. Gene Set Enrichment Analysis (GSEA) reveals that the model reflects variable tissue functional states relevant to NSCLC biology. In summary, the signature model shows potential as a valuable and robust NSCLC prognostic indicator.


INTRODUCTION
Lung cancer is the most common type of cancer worldwide, exhibiting a five-year survival rate of less than 15% [2]. In China, an estimated 733,300 incident cases and 610,200 deaths occurred in China in 2015 [1]. Non-small cell lung cancer (NSCLC) -a heterogeneous condition [5] consisting mainly of lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) subtypes -accounts for the majority (85%) of lung cancer cases [3]. Despite recognition of tobacco smoke as the major causative agent of lung cancer [4], exact pathogenic mechanisms remain unclear. Thus, identification of effective NSCLC prognostic biomarkers is urgently required.
Because genes belonging to a common signaling pathway have complementary functions, integrating information regarding expression of multiple genes effectively decreases the impact of NSCLC genetic heterogeneity on prognostication. The Wnt/β-catenin signaling pathway is implicated in regulating NSCLC progression, thus influencing prognosis [11].
Additionally, it has been demonstrated that genes involved in regulating WNT signaling also have NSCLC prognostic value. For example, up-regulated expression of HIF-2α-dependent lncRNA NEAT1 activates WNT signaling, thereby promoting NSCLC progression [12]. Similarly, Forkhead Box Protein P3 (encoded by FOXP3) also activates WNT signaling, facilitating proliferation and metastasis of NSCLC [13]. A similar function has been reported for Telomere-Associated Protein RIF1 (encoded by RIF1) [14].
However, clinical utility of the above biomarker candidates remains limited, owing to NSCLC molecular heterogeneity and relatively small study sample sizes, resulting in relatively low biomarker robustness. It is likely that a multi-gene biomarker panel focused on WNT signaling will outperform any single biomarker in accurately predicting NSCLC prognosis. The present study accessed data from five multi-center clinical cohorts to identify NSCLC prognostic biomarker candidates, and evaluated performance of the resulting multi-gene model.

Candidate gene screening and model development
Transcriptomic and clinical data from five independent primary NSCLC cohorts were retrieved from public databases, including The Cancer Genome Atlas (TCGA) and the Gene Expression Omnibus (GEO). Different cohorts employed different gene expression assay platforms. The largest of these cohorts (the training cohort)deriving from TCGA -was used to identify candidate prognostic biomarker genes and develop a prognostic model based on their expression level.
Briefly, signaling pathway gene member expression was evaluated for significant correlation (p < 0.01) with overall survival (OS). Panels of 2-7 combined candidate genes were used to generate prognostic models, producing risk scores predictively stratifying training cohort samples into high-and low-risk groups (risk score and survival data are detailed in Supplementary  Table 1). Model prognostic performance was evaluated by calculating significance (p < 0.05) of survival difference between risk score-predicted high-and lowrisk groups, retaining the best-performing prognostic model for survival prediction.

Performance of the WNT pathway gene prognostic model in the training cohort
As expected, OS interval was significantly lower in the high-risk group (median survival 34.8 months, 95% CI 31.1-41.0) than in the low-risk group (median survival 72.0 months, 95% CI 59.3-87.5) (p = 2e-11) ( Figure 1A). In addition, consistent with these results, DSS and progression-free survival (PFS) interval differences between high-and low-risk groups indicated that risk score correlated significantly with both (p = 1e-12 and p = 3e-11, respectively) ( Figure 1B, 1C). Metadata demonstrated that high-risk samples are characterized by earlier incidence, higher oncogene expression, and lower tumor suppressor gene expression ( Figure 1D).
Risk score performance was further evaluated -relative to performance of demographic and clinical indicators using receiver-operator characteristic (ROC) curves based on three-year survival. The area under the curve (AUC) for risk score, age, gender, tumor size, and stage was 0.6698, 0.5173, 0.5078, 0.5058, and 0.6757, respectively, indicating that stage and risk score outperformed the other indicators in discriminating prognosis. Since p-values for genes PPARD, CAMK2A, and CSNK1E exceed 0.05 (Table 1) when applying Cox multivariate regression, a model incorporating only

Figure 2.
Survival interval between risk score-predicted high-and low-risk groups (upper panel) was compared in independent cohorts AGING the remaining four (significantly associated) genes was also constructed. However, since this resulted in a significant survival difference between high-and lowrisk groups in only one out of five cohorts (Supplementary Figure 2) the other three genes do appear to contribute to NSCLC prognosis and the full seven-gene model was retained.
Within the training cohort, 28 patients exhibited distant metastasis, a clinically important indicator of the extent of NSCLC progression. Although risk scores of the metastasis-positive group trended higher than those of the metastasis-negative group, differences were not statistically significant (p = 0.1, data not shown).
Collectively, these results indicate that the WNT signaling gene expression-based model is effective in predicting NSCLC survival.

Validation of risk score performance using independent cohorts
In all independent cohorts, survival intervals are significantly lower in the high-risk group (p = 0.004,  Collectively, these results indicate that the WNT signaling gene expression-based model is robust, performing reproducibly across independent NSCLC cohorts and gene expression platforms.

Risk score characterization
Firstly, risk score was analyzed for correlation with existing clinical indicators (variables), and risk score is not significantly associated with age, tumor size, or pathological stage ( Figure 3A). Furthermore, Cox multivariate regression was implemented using risk score as well as clinical indicators ( Figure 3B). Results suggest that risk score has value independent from clinical indicators in predicting NSCLC clinical outcome.

Model bias estimation within subgroups
Radiotherapy is one the most important adjuvant therapies offered to NSCLC patients. Thus, samples were divided into radiotherapy-treated and -untreated subgroups for within-subgroup evaluation of risk score AGING performance. Consistent with abovementioned results, both high-risk groups exhibited longer survival intervals in the training cohort, relative to the low-risk groups ( Figure 4A). Similarly, analyzing samples subdivided according to stage ( Figure 4B) or LUAD versus LUSC subtype ( Figure 4C) demonstrated no model bias in predicting prognosis.
Results indicate that risk score performs favorably and is independent of existing clinical indicators, at least one treatment modality, and clinical as well as pathological disease subtypes.

Risk score reflects biological significance of gene expression patterns
Gene Set Enrichment Analysis (GSEA) was used to investigate high-and low-risk group transcriptomic data biological meaning, and results were visualized using the Cytoscape plugin EnrichmentMap. Enriched biological cassettes include up-regulation of cancer-promoting pathways, and down-regulation of cancer-suppressing pathways ( Figure 5). Results indicate that the gene expression-based risk score reflects the biological status of pathways underlying NSCLC.

DISCUSSION
Relevance of WNT signaling pathway functions has been widely emphasized for various cancers [11,15,16], including NSCLC. However, due to genetic and environmental heterogeneity [17][18][19] at all biological levels, even within the same tumor, it is anticipated that integrating information from multiple genes belonging to the WNT signaling pathway may significantly improve prognostication. The present study identifies an optimized panel of candidate biomarkers consisting of genes belonging to the WNT signaling pathway, and develops and evaluates a prognostic model based on expression of these genes.
The model incorporates seven genes: SFRP1, CSNK1E, CAMK2A, CCND2, FOSL1, PPP2R1A, and PPARD. Many of these likely play roles in cancer progression. For example, SFRP1 is known to inhibit Epithelial to Mesenchymal Transition (EMT) [20] in NSCLC cell lines, and has previously been suggested as a potential biomarker for NSCLC [21]. The prognostic role of CCND2 has been frequently reported in many cancer types [22], though effects are context specific. For example, it acts as a tumor AGING suppressor gene in NSCLC, while acting as an oncogene in gastric, bladder, and breast cancers. Functions of the remaining genes -including CSNK1E [23], FOSL1 [24], PPP2R1A [25], and PPARD [26] are widely reported as being relevant to cancer (if not specifically NSCLC) biology.

Sample enrollment and data processing
Primary NSCLC transcriptomic and clinical metadata from five independent cohorts -deriving from multiple centers employing different gene expression assay platforms -were retrieved from the GEO (Gene Expression Ominous) and TCGA (The Cancer Genome Atlas) public databases, requiring no additional informed consent or ethical approval.
Raw transcript counts downloaded from GEO were log2 transformed and normalized as described below. Detailed assay platform, sample number, tumor type, country, and normalization method data are shown in Supplementary  Table 6. Raw transcript counts for the TCGA cohort were downloaded using the University of California Santa Cruz (UCSC) Xena tool (https://xena.ucsc.edu/) and log2 transformed after 1% minimum value imputation. Only samples associated with complete survival information were included, with such clinical metadata consisting of three types of carefully curated survival endpoints: OS (overall survival), DSS (disease-specific survival, PFS (progression-free survival).

Candidate gene screening and prognostic model development and evaluation
Due to its large sample size, the TCGA cohort was used as the training cohort to identify candidate prognostic biomarker genes.
Over 170 human signaling pathways were downloaded from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (https://www.kegg.jp/) in order to generate lists of each pathway's member genes. Correlation between OS and gene expression was evaluated using Cox univariate regression analyses (as described below). Genes with a p-value of less than 0.01 were deemed significantly associated with OS and were retained as candidate prognostic biomarkers for downstream analysis.
For each pathway, a matrix of all possible remaining member gene combinations was generated. Although some pathways retained over 40 member genes significantly associated with OS, subsequent combinatorial evaluation of panel performance was restricted to 2-7 genes per combination, in order to remain within available computational resource constraints, as well as to facilitate eventual clinical utility. Multivariate Cox regression analysis was implemented for each panel, and risk score models were developed based on linear formula coefficients in combination with gene expression values. Risk score provided by a model was defined as bi is the coefficient of gene x and xi is the expression value of that gene. In each case, training cohort samples were divided into high-and low-risk groups using median risk score as the threshold value. The performance of each model was then evaluated by determining significance of survival difference (incorporating all three survival metrics) between risk score-predicted high-and low-risk groups. The bestperforming model was retained.
Risk score performance was further evaluated relative to performance of demographic and clinical indicators (age, gender, primary tumor size represented by tumor diameter, and pathological stage) using receiving operating characteristic ROC curves based on three-year survival. Finally, risk score performance was evaluated with respect to training cohort subgroups. The training cohort was divided into metastasis-negative andpositive subgroups, and between-group risk score predictive performance was compared.
Since the predictive model was derived from the training cohort, favorable risk score performance is expected in this cohort. Thus, model robustness was validated using four independent GEO cohorts: GSE30219, GSE4127, GSE42127 and GSE50081. As described for the training cohort, gene expression values and coefficients were adapted to generate a risk score for each sample (Supplementary Tables 2-5), median risk score was used as the threshold value discriminating high-and low-risk groups, and model performance was evaluated by significance of survival difference between groups. Additionally, risk score was analyzed for correlation with existing clinical indicators (variables), first binarizing each variable using subjective cut-off values. Age was divided into younger (< 60 years) and older (≥ 60 years) groups, primary tumor size was divided into small (< 1 cm diameter) and large (≥ 1 cm diameter) groups, and pathological stage was divided into early (stage 1-2) and late (stage 3-4) groups. Finally, risk score performance was evaluated with respect to clinically-relevant independent cohort subgroups, as described for training cohort metastasispositive versus -negative groups. Here, subgroups included samples deriving from radiotherapy-treated and -untreated patients, samples stratified by stage, and samples stratified by LUAD versus LUSC subtype.

AGING
Lastly, Gene Set Enrichment Analysis (GSEA) was used to investigate high-and low-risk group transcriptomic data biological meaning, and results were visualized using the Cytoscape plugin EnrichmentMap.

Analytical software and statistics
All analyses were performed using the R Project for Statistical Computing (https://www.r-project.org). R package "affy" was used for raw data processing and normalization. R function combn() was used to generate matrices of all possible combinations of genes within a pathway. R packages "survival" and "pROC" [27] were used for survival and ROC analyses, respectively. Heatmap was generated using R package"pheatmap". Survival differences between risk score-predicted highand low-risk groups were assessed using Student's ttest, with a significance threshold of p < 0.05. Gene Set Enrichment Analyses was performed with GSEA java software [28], and visualized with Cytoscape [29] plugin EnrichmentMap [30].

AUTHOR CONTRIBUTIONS
GH, GL and WX contributed the conception and design; GH administrated and supported. GL, MJ, LL, LL collected and analyzed the data, GL, GH, WX and PL interpreted the data. All authors wrote and approved the final manuscript.