Differential expression patterns of housekeeping genes increase diagnostic and prognostic value in lung cancer

Background Using DNA microarrays, we previously identified 451 genes expressed in 19 different human tissues. Although ubiquitously expressed, the variable expression patterns of these “housekeeping genes” (HKGs) could separate one normal human tissue type from another. Current focus on identifying “specific disease markers” is problematic as single gene expression in a given sample represents the specific cellular states of the sample at the time of collection. In this study, we examine the diagnostic and prognostic potential of the variable expressions of HKGs in lung cancers. Methods Microarray and RNA-seq data for normal lungs, lung adenocarcinomas (AD), squamous cell carcinomas of the lung (SQCLC), and small cell carcinomas of the lung (SCLC) were collected from online databases. Using 374 of 451 HKGs, differentially expressed genes between pairs of sample types were determined via two-sided, homoscedastic t-test. Principal component analysis and hierarchical clustering classified normal lung and lung cancers subtypes according to relative gene expression variations. We used uni- and multi-variate cox-regressions to identify significant predictors of overall survival in AD patients. Classifying genes were selected using a set of training samples and then validated using an independent test set. Gene Ontology was examined by PANTHER. Results This study showed that the differential expression patterns of 242, 245, and 99 HKGs were able to distinguish normal lung from AD, SCLC, and SQCLC, respectively. From these, 70 HKGs were common across the three lung cancer subtypes. These HKGs have low expression variation compared to current lung cancer markers (e.g., EGFR, KRAS) and were involved in the most common biological processes (e.g., metabolism, stress response). In addition, the expression pattern of 106 HKGs alone was a significant classifier of AD versus SQCLC. We further highlighted that a panel of 13 HKGs was an independent predictor of overall survival and cumulative risk in AD patients. Discussion Here we report HKG expression patterns may be an effective tool for evaluation of lung cancer states. For example, the differential expression pattern of 70 HKGs alone can separate normal lung tissue from various lung cancers while a panel of 106 HKGs was a capable class predictor of subtypes of non-small cell carcinomas. We also reported that HKGs have significantly lower variance compared to traditional cancer markers across samples, highlighting the robustness of a panel of genes over any one specific biomarker. Using RNA-seq data, we showed that the expression pattern of 13 HKGs is a significant, independent predictor of overall survival for AD patients. This reinforces the predictive power of a HKG panel across different gene expression measurement platforms. Thus, we propose the expression patterns of HKGs alone may be sufficient for the diagnosis and prognosis of individuals with lung cancer.

cancer markers across samples, highlighting the robustness of a panel of genes over any one specific biomarker. Using RNA-seq data, we showed that the expression pattern of 13 HKGs is a significant, independent predictor of overall survival for AD patients. This reinforces the predictive power of a HKG panel across different gene expression measurement platforms. Thus, we propose the expression patterns of HKGs alone may be sufficient for the diagnosis and prognosis of individuals with lung cancer.

INTRODUCTION
In 1965, Watson characterized housekeeping genes (HKGs) as essential genes, those ''expressed in all tissues'' (Watson & Levinthal, 1965). Since then, it has been further refined as genes required for the maintenance of functions essential for a cell's existence, ubiquitously expressed across tissue type and developmental or cell cycle stage (Eisenberg & Levanon, 2013). In addition, other studies have suggested several unique genomic features of HKGs. For example, HKGs were shown to have shorter introns and exons (Vinogradov, 2004;Eisenberg & Levanon, 2013), lower conservation of promoter sequences (Lawson & Zhang, 2008), and protein products enriched in some domain families (Lehner & Fraser, 2004). However, one important notion that needs to be re-examined is the assumption that HKGs maintain constant expression levels across all cells and conditions (Yang et al., 2002;Rubie et al., 2005).
The advent of high-throughput screening technologies such as microarrays and RNA-Seq provides the ability to formulate a more concrete description of HKGs on the genomic scale. One such early, large-scale study examined the expression levels of 7,000 genes in 11 different human adult and fetal tissues, from which Warrington et al. (2000) identified 535 HKGs that were expressed in fetal development and throughout adulthood in all tissues. Subsequently, Hsiao et al. (2001) analysed the expression pattern of 7,070 genes across 19 human tissue types to identify 451 HKGs, 358 of which were common to Warrington's list. These two studies were particularly important as they highlighted that HKGs, while constitutively expressed across tissues, did not maintain constant expression levels (Warrington et al., 2000;Hsiao et al., 2001). Rather, Hsiao et al. (2001) demonstrated that their expression patterns were sufficient to differentiate between human tissue types.
Since HKGs can distinguish one normal human tissue type from another, it is natural to ask whether these same genes may be used in discriminating between diseased tissues. Cancer is a multifactorial disease whose characteristics shift with time and space. In a study focused on breast neoplasms, it was revealed that the most frequently used traditional HKGs (e.g., GAPDH, ACTB and TUBA1A) appeared significantly altered in expression levels from one sample to the other (Janssens et al., 2004). Byun, Logothetis & Gorlov (2009) further noted that HKGs are more likely to be differentially expressed in prostate tumorigenesis, perhaps indicating their driving role in cancer development.
For this study, we focus on a subset of 374 of the 451 HKGs (originally identified by Hsiao et al., 2001) that are common to the three standard Affymetrix microarrays containing human genes: HuGene-FL (∼7000 unique sequences), HG-U95A (∼12,000 unique sequences), and HG-U133A (∼22,000 unique sequences) (Table S1). Using this set of 374 HKGs, we can extend our analysis to disease studies using different expression platforms ranging from multiple array types (HuGene-FL and HG-U95Av2) to RNA-Seq. Our goal is to test whether the expression patterns of a cluster of HKGs can serve as markers in fingerprinting human lung cancer.

MATERIAL AND METHODS
Affymetrix oligonucleotide microarray data was collected from online databases, processed, and then stored in a relational database. The database for lung cancers, containing 13 normal lung samples, 89 adenocarcinomas of the lung (AD), seven small cell lung carcinoma (SCLC), and 24 squamous cell lung carcinoma (SQCLC), using the Affymetrix U95A arrays were downloaded from the Broad Institute (http: //portals.broadinstitute.org/cgi-bin/cancer/publications/view/62). Additional microarray datasets using the Affymetrix U133 arrays, GDS4794 (23 SCLC, 42 Normal), GDS3627 (40 AD, 18 SQCLC), and GDS3257 (58 AD, 49 Normal) were downloaded from NCBI's database (https://www.ncbi.nlm.nih.gov/gds/) and used for classification of lung cancers. The Affymetrix microarray expression values were computed in each study using standard analysis tools such as the Affymetrix Microarray Suite (MAS) software, RMA, or gcRMA. Then for global normalization of each microarray dataset, the average expression signal in an array was made equal to 100.
The classification of samples for survival was performed with HKG expression data from Agilent Whole Human Genome Microarrays GSE13213 (117 AD) as the training set and the RNA-Seq data of 519 AD samples as the testing set analysed using normalized RSEM values from the TCGA-LUAD study, downloaded through UCSC Xena Browser (http://xena.ucsc.edu/).
Differentially expressed genes between pairs of sample types were determined using two-sided, homoscedastic t -test followed by FDR (Benjamini-Hochberg) correction with Microsoft Excel. All differences in the mean log (expression levels) between samples of two groups (e.g., lung cancer versus normal lung) in the training set were determined to be statistically significant if p < 0.05 following FDR (Benjamini-Hochberg) correction. In every application the use of different microarray datasets as ''training'' and ''testing'' sets as well as across different gene expression platforms (i.e., microarrays and RNA-seq) serve to minimize the effects of overfitting that are typically seen when elucidating differentially expressed genes (Tinker, Boussioutas & Bowtell, 2006).
Hierarchical clustering provided unsupervised classification of normal tissue and tumours according to relative variations in gene expression patterns of the 374 HKG's. Hierarchical clustering computations were executed with GENE-E (https: //software.broadinstitute.org/GENE-E/). Although the choice of clustering algorithm is somewhat subjective (i.e., there is no ''correct'' cluster algorithm for all applications) (Quackenbush, 2006), the analysis parameters we utilized in this study have been shown to accurately and reproducibly distinguish between various normal tissues based on expression levels of HKGs reliably expressed in all specimens (Hsiao et al., 2001). Samples analysed using both HuGeneFL and U95Av2 microarrays were used in cluster analysis.
Principal component analysis (PCA) and corresponding 95% confidence ellipses were performed with the R package, FactoMineR (http://factominer.free.fr). Equal numbers of samples for PCA were randomly selected from each group in a given testing set.
Survival analysis was performed over 451 HKGs with the TCGA-LUAD study. Samples missing ''Overall Survival'', age, gender, and status of mutational indicators of lung cancer (KRAS, EGFR, and ALK) were excluded. To identify significant predictors of overall survival we used uni-and multi-variate cox-regressions and to examine the strength of gene expression-guided risk groups we utilized Kaplan-Meier survival analyses, both of which were performed with SPSS. Risk scores were derived from the Cox proportional hazards regression model, with the median score (−0.177) separating ''high-risk '' and ''lowrisk'' groups (Parker et al., 2008). Gene Ontology examinations were through PANTHER (http://pantherdb.org), which utilizes binomial distribution test and Bonferroni correction in overrepresentation analyses.

Housekeeping genes alone are sufficient to distinguish normal tissue from lung cancers
Our previous work demonstrated that the 451 HKGs are sufficient to distinguish one normal human tissue from another (Hsiao et al., 2001). In this study, we explored the possibility that 374 common HKGs can also be used to distinguish between normal and diseased tissues across different gene expression platforms and laboratories. Samples for lung cancers and normal lung tissues were used for this analysis. Expression data of 49 normal lung tissues and 58 AD from GDS3257 using the Affymetrix U133A arrays at the NIH were used as a ''training set''. Two-tailed t -test highlighted 242 differentially expressed HKGs between the AD and normal lung tissues. These HKGs were then used to classify the microarray data from the Broad Institute using the Affymetrix U95A arrays (''testing set''). The resulting hierarchical cluster demonstrates that HKGs can separate the normal lung from AD (Fig. 1A).
The same process was repeated with the goal of identifying gene expression patterns between normal tissue, SCLC and SQCLC. Once again, training sets were utilized to select for differentially expressed genes. For SCLC analysis, 42 normal and 23 SCLC samples for the training set were from GDS4794 using the Affymetrix U133 Plus 2 arrays in Japan. In the SQCLC study, 12 SQCLC and six normal randomly selected samples from the Broad Institute were used for the training set. Distinct clustering was accomplished with 245 HKGs in 13 normal versus seven SCLC test samples from the Broad (Fig. 1B) and 99 HKGs in the remaining seven normal versus nine SQCLC test samples also from the Broad dataset ( Fig. 1C; Table S2).
Comparison of the three gene lists used in the Fig. 1 allowed us to identify 70 differentially expressed HKGs that are common among the lung cancer samples such as 26S protease Figure 1 HKGs alone are sufficient to distinguish normal lung tissue from various lung cancers. Using 374 HKGs identified by previous study (7), hierarchical clustering analysis shows that (A) 242 genes identified in a training set have significant differential expression that are sufficient to separate normal lung from most lung adenocarcinoma in an independent test set; while (B) 245 genes identified in a training set can distinguish normal lung from small cell carcinoma in a test set and (C) 99 genes from a training set can distinguish normal lung from squamous cell carcinoma in a test set.  Table S3). We report here that the expression patterns of these 70 HKGs alone could separate six normal lung tissues from 18 (six AD, six SCLC, six SQCLC) cancerous tissues in the Broad test set using principal component analysis (Fig. 2B).
The results were further confirmed via unsupervised hierarchical clustering and k-means clustering analysis (Figs. S1, S2). Ontology analysis revealed that these 70 genes were largely involved in the most common biological processes such as those involved in metabolism, cell cycle regulation, and stress and immune response (Table 1). Many studies have shown that the four most represented pathways of the 70 HKGs (integrin signalling, ubiquitin proteasome, EGF receptor signalling, and FGF signalling) have been reported to regulate cancer growth and metastasis in multiple cancer types (Czubayko et al., 1997;Tamura et al., 1999;Frezza, Schmitt & Ping Dou, 2011).

The 70 diagnostic housekeeping genes exhibit low variation across and within multiple lung cancer subtypes compared to current disease markers
We used six known lung cancer markers, epidermal growth factor receptor (EGFR), anaplastic lymphoma receptor tyrosine kinase (ALK), GTPase KRas (KRAS), hepatocyte growth factor receptor (MET), phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform (PIK3CA), and ret proto-oncogene (RET) to compare the coefficient of variation (CV) of our 70 HKGs set with cancer datasets. Using the Broad dataset, we showed that KRAS has the lowest average CV, 0.54, among the six selected markers across 120 lung cancer samples. In contrast, 62 out of our 70 genes had an average CV across the same samples less than 0.54 (Table S3). Examining the 120 lung cancer samples, only UBE2C and dual specificity protein phosphatase 1 (DUSP1) had average CVs greater than 1 (1.06 and 1.04, respectively). The low variance of these 70 genes across three major lung cancer types suggests a common set of genetic drivers may be involved in lung cancer development. In addition, cancer subtype analysis showed that the known cancer markers have relatively high variation within the subtypes, for example Table 2 shows that ALK had a CV of 5.19 in AD of the lung, 1.69 in SCLC and 10.33 in SQCLC. Similarly, RET had a CV of 0.44 in SCLC, 4.90 in AD and 5.10 in SQCLC. In contrast, among our 70 genes, Table S3 shows that UBE2C has the greatest CV, 1.91 in AD, markedly lower than the CVs seen in commonly known cancer marker, ALK and RET. Furthermore, Table 2 shows that the 10 HKGs with the lowest CVs are smaller than 0.34. The relatively low variance of the 70 HKGs within the normal samples and the cancer subtypes may improve their clinical reliability in serving as cancer biomarkers.

Housekeeping genes alone can also differentiate lung adenocarcinoma from squamous cell carcinoma
We then assessed the ability of HKGs to distinguish two subtypes of non-small cell carcinoma (NSCLC). Utilizing GDS3627 data, with 40 AD and 18 SQCLC selected as the training set, we identified 106 differentially expressed genes between AD and SQCLC (Table S4). Additional assessment with PCA showed that the expression patterns of these  (Fig. 3). While neurofilament light polypeptide (NEFL) and dishevelled segment polarity protein 3 (DVL3) show fold changes of 3.26 and 2.09, respectively, the remaining 104 HKGs have fold changes less than 2. A total of 34 different pathways were associated with the 106 HKGs. Examination of 15 pathways containing the greatest number of the 106 HKGs reveals those critical in cancer development such as apoptosis signalling, angiogenesis, p53, and glycolysis (Table 3). Furthermore, many genes are involved in multiple different pathways. For example, A-Raf proto-oncogene, serine/threonine kinase (ARAF) appears in the angiogenesis pathway as well as integrin, VEGF, FGF, and EGF signalling. In addition, ARAF comprises known functions involving cancer development such as activation of MAPK/ERK (associated with cell growth) and binding to pyruvate kinase isozymes M1/M2 (PKM2), critical for the Warburg effect. Furthermore, the 106 HKGs were also reported to link to diseases such as Alzheimer, Parkinson, and Huntington. Interestingly, the differentially expressed HKGs associated with the ubiquitin proteasome pathway all had lower expression in AD compared to SQCLC. The wide range of diseases and processes associated with the 106 HKGs highlights their roles in basic cellular processes and maintenances. Table 2 Top 10 HKGs with the lowest average coefficient of variation. Among the 70 shared HKGs, these 10 genes have a CV less than 0.34 and a maximum difference in CV of 0.167 (in HYAL2) across cancer subtypes.

Housekeeping gene expression is an independent predictor of overall survival in lung Adenocarcinoma
Much of the focus on individual gene expression has been on its value as a prognostic marker in disease states (Beer et al., 2002;Shedden et al., 2008). Here we demonstrate that a panel of HKGs may be a valuable tool in predicting overall survival for AD. The list of HKGs was first ranked by log-rank test P-values gained through Cox-regression analysis of the training set ( GSE13213) data for 116 patients using Agilent arrays. Survival times for the 116 patients are listed in Table S5. Utilizing the resulting list of 13 significant HKGs (p < 0.05, unadjusted) from the training set (Table 4), samples from the testing set (TCGA-LUAD, RNA-Seq data for 576 patients) were plotted on a continuous risk curve; the median score (−0.177) separated ''low'' and ''high'' risk groups. These risk groups, classified by expression of 13 HKGs, showed average difference of 2,078 days in overall survival (mean low risk-mean high risk) (Table 5). To test whether this response was a viable and independent predictor, we performed multivariate analyses to include age, gender, and the presence of known mutational drivers (KRAS, EGFR, and ALK) ( Table 6). Our risk group classification was the only significant estimator of overall survival. Interestingly, most of the 13 individual HKGs were not individually significant predictors of overall survival in the testing set. Indeed, the greatest differences were an average 2.32-fold decrease in pleckstrin and Sec7 domain containing (PSD) and 2.07-fold increase in 60S ribosomal protein L27 (RPL27) in the high-risk group (Table 4). In the testing set, all but    three of these HKGs were not individually significant (p > 0.05); however, their removal from the panel led to a marked decrease in predictive value. As PSD was determined as the most significant individual gene in our testing set and had one of the highest fold changes, we utilized PSD alone to determine risk groups. Our results show a marked reduction in differences of cumulative hazard between the two groups from classification by 13 HKGs (Fig. 4). This finding reinforces the value of a panel of genes over any one specific biomarker in disease prognosis. While 5 out of the 13 HKGs are ribosomal protein-related genes, not all the ribosomal protein encoding genes responded in the same way. RPL14, RPL31, and RPL27 all displayed increases in high-risk individuals whereas RPL9 and RPL7 were decreased. Despite studies having examined the roles of RPLs in the regulation of the MDM2/MDMX-p53 cascade, much remains unknown on how individual RPLs may function in cancer progression (Zhang et al., 2003;Zhou et al., 2015).

DISCUSSION
Much of the efforts studying disease states using microarray and RNA-Seq technology have focused on the identification of ''specific disease markers'' (Bubendorf et al., 1999;Golub et al., 1999;Glynne et al., 2000). Although the value of these efforts is undeniable, it remains a challenge to find suitable analytical tools that will provide reliable conclusions. As we have discussed previously, the gene expression patterns in any given sample represents the specific cellular states of the sample at the time of collection (Hsiao et al., 2001). In other words, cancers at different stages may express different markers. For example, the use of prostate-specific antigen in testing for prostate cancer has yielded much criticism due to its high false positive rates (Potts, 2000). Similarly, while BRCA1 does significantly increase risk of breast cancer, germline mutations of this gene only account for 3% of all breast cancer cases (Whittemore, Gong & Itnyre, 1997). Therefore, there have been efforts towards examining the relationship between sets of genes or utilizing a panel of biomarkers. For example, our previous study highlights that using 3 sets of ratios, instead of one set, between two genes increased differential diagnosis accuracy from 90% to 99% in lung cancer and mesothelioma (Gordon et al., 2002).
HKGs offer a potential solution to the problems of specific markers. Their constitutive, and relatively high, expressions in all cells provide assurance of being identifiable in any sample (Caracausi et al., 2017). Furthermore, studies have shown that HKGs cannot be assumed to maintain constant expression levels in all cells and conditions (Hsiao et al., 2001;Barber et al., 2005). Others have also examined the states of HKGs in cancers; Blanquicett et al. notes that 15 traditional HKGs (e.g., PGK, GAPDH, and β-Actin) demonstrate significant expression variations between carcinomatous and normal liver samples (Blanquicett et al., 2002). Similarly, Rubie et al. found that HKGs that code for the metabolic enzymes, PGK and GAPDH, show high up-regulation in cancerous versus normal tissues from the pancreas, stomach, and colon (Rubie et al., 2005). In our study, we show that the expression patterns of 70 HKGs alone can significantly differentiate between normal and lung cancer samples.
Our study also highlights that these 70 HKGs maintain low levels of variance across tumour samples when compared to accepted markers of lung cancer (e.g., KRAS, ALK, RET). As HKGs are largely responsible for basic cellular maintenance, this suggests that our identified list of HKGs may be the common drivers of lung cancer development. While some of the 70 HKGs (e.g., PSMD2) have been identified by other genome-wide association studies as relevant in single subtypes of lung cancer development (AD), many have not been fully examined, adding to the list of possible targets for therapeutic development (Matsuyama et al., 2011). In addition, other HKGs have been seen to drive pathogenesis across tissue types. For example, overexpression of ubiquitin-conjugating enzyme E2C (UBE2C) has recently been thought to play a role in not only NSCLC, but also gastric, colorectal, and breast cancers (Wang et al., 2015;Pellino et al., 2016;Yang et al., 2016). Despite this low variance of HKGs across lung cancer samples, our study shows that a panel of only 106 differentially expressed HKGs was able to act as a class predictor between two subtypes of NSCLC (SQCLC and AD).
Furthermore, the expression patterns of 13 HKGs alone provided significant prognostic value, estimating 2,078 days in overall survival difference between the high and low risk groups. These results are independent of traditional clinical markers. Importantly, we utilized separate datasets for training and testing to circumvent issues of overfitting. Moreover, our training set being microarray data and our testing set being from RNA-Seq demonstrates the robustness of a panel of 13 HKGs across different gene expression measurement technologies.
The results of this study are consistent with the idea that while HKGs may maintain relatively stable gene expressions among similar tissue-type samples, their expression profiles remain tissue and disease specific (Warrington et al., 2000;Hsiao et al., 2001). The disease-specificity of HKG expressions demonstrates the predictive capabilities for lung cancers, while its ubiquitous and low intra-tissue variance allows for its reproducibility across different microarray platforms and RNA-seq technology. However, there remains translational challenges. For example, it has been shown that different microarray platforms displayed varying degrees of repeatability, reproducibility, and consistency (Consortium, 2006;Shi et al., 2008). Due to lack of public datasets, this study was largely constrained to two microarray types (HuGeneFL and U95v2).

CONCLUSIONS
Overall, our findings reinforce both the diagnostic and prognostic power of HKG expression patterns. Given their ubiquitous nature and role in maintenance of basic cellular functions, it is possible that HKGs act as drivers in lung cancer. Thus, while it has been reported that HKGs evolve at a slower rate than other genes (Zhang & Li, 2004), it will be of interest to examine the mutational patterns of HKGs and further validate their relevance in lung cancer development. Furthermore, as we have demonstrated, HKGs is a strong predictor of ''low'' and ''high'' risk patients whom differ significantly in overall survival. It is therefore important that future studies examine the correlation of HKG patterns to the morphology and histology of lung cancer tissues; because while histology may provide initial stratification of patients into rough classes, HKGs may offer a more reliable method of classification. Consequently, the ability to identify high-risk individuals in early lung cancer stages may allow for adjustments of therapeutic interventions and increased survival.