Genome Wide Gene Expression Profiling and Molecular Classification of Renal Cell Cancer Subtypes

Renal Cell Carcinoma (RCC) is one of the most common malignancy and the sixth cause of leading death from the cancer. In this study, our aim is to identify biomarker candidates expressed at different levels between subtypes of RCC for potential new therapy for RCC. A total of 1,020 raw RNA-Seq counts of renal cancer dataset including 606 KIRP, 323 KIRC and 91 KICH were obtained from the Cancer Genome Atlas (TCGA) Project. Five statistical learning methods including support vector machines (SVM), random forests (RF), penalized discriminant analysis (PDA), naïve bayes (NB) and knearest neighbors (KNN) were applied on the purpose of classification. Gene ontology (GO) and Kyoto encyclopaedia of genes and genomes (KEGG) pathway analysis were also conducted for a deeper understanding of differentially expressed genes in three renal cancer subtypes. We identified 181 up-regulated and 69 down-regulated genes that showed a significant differential expression between cancer samples. Classification analysis resulted that integrating these genes in a statistical learning framework perform between 90.7-94.6% accuracy, 81.4-91.2% sensitivity and 92.9-97.0% specificity. Our findings showed a few DEGs including NPK, HRNPLT, ARF1 and TTR might contain a critical role in RCC. Furthermore, study demonstrated using the DEGs as the high rate hit distinguish and classification for RCC subtypes.


I. INTRODUCTION
Renal Cell Carcinoma (RCC) is one of the most common malignant cancer types representing 2-3% of all malignancies with an expectation of more than 200,000 new cases annually.In 2014, approximately 64,000 new cases of renal cancers are expected in the USA, which makes the RCC the sixth largest cause of cancer deaths.These new diagnosed cases correspond approximately 5% of all malignancies within males and 3% within females [1][2][3][4].Kidney renal clear cell (KIRC), kidney renal papillary cell (KIRP) and kidney chromophobe (KICH) carcinomas are the most common subtypes of the RCC, accounting for approximately 95% of total malignant kidney tumors in adults.KIRC is the most common (70-80%) and deadly subtype of this disease with 5-year survival rate of 44-69% and metastasis rate of 94%.For KIRP and KICH subtypes, incidence rates are about 14-17% and 4-8%, 5-year survival rates are about 82-92% and 78-87%, metastasis rates are about 4% and 2%, respectively [5,6].
Moreover, distinguishing renal cell cancer subtypes are crucial due to following reasons: (i) histological subtypes are associated with the prognosis and survival differences, (ii) it is possible to detect individual chances for optimal treatment responses and (iii) it may help understanding the etiology of this disease, since potential biomarkers may have roles in different pathways for different disease subtypes [7].With the recent advances in molecular biology, it is possible to understand the molecular mechanisms that underlie the tumorigenesis and progression of cancer diseases.In this way, reliable candidate biomarkers can be detected and novel potential targets can be identified for targeted cancer therapies.Several studies are reported regarding to gene-expression based differential diagnosis of the RCC [1,[8][9][10].Lately, various genomics, transcriptomics and epigenomics related studies have been developed in genome-wide characterization of RCC [11][12][13][14][15][16].However, there is still an urgent need to determine new genome wide biomarkers and diagnostic schemes for the RCC.
In this study, we aimed to detect the differentially expressed genes among the RCC subtypes to identify candidate biomarkers and novel potential targets for targeted RCC therapies.We also aimed to classify corresponding subtypes, using the significant genes, which may play a significant role in the diagnosis of the RCC.

A. Experimental Data
We obtained a real RNA-Seq dataset composed of level 3, RNASeqV2 data from the Cancer Genome Atlas (TCGA) Project [17].The data contains a total of 1,020 renal cancer samples containing 606 KIRP, 323 KIRC and 91 KICH, respectively.The data for each sample is the raw read counts that resulted from 20 billion reads and mapped to human reference genome.Each sample contains the counts of 20,531 known human RNA transcripts as generated by one of the Genome Sequencing Center of TCGA, University of North Carolina at Chapel Hill.We selected the top 5,000 genes with the highest variances to filter non-informative genes before starting analysis.

B. Data Normalization and Transformation
To adjust the sample specific differences, we applied trimmed mean of M values (TMM) normalization to the raw count data.Next, we applied a voom transformation to estimate the mean-variance relationship of counts and to convert the data in continuous form because of several advantages such as: (i) count based mean-variance estimation using discrete distributions does not totally match to its expected value, (ii) voom is more precise method in mean-variance modeling, (iii) voom based differential expression analysis are able to control type I error and has the highest statistical power [18,19].

C. Differential Expression Analysis
Linear models for microarray and RNA-Seq data (limma) method was applied to the transformed data to identify the most significantly differentially expressed genes (DEGs) among the RCC subtypes.Log-cpm (log counts per million) values and the associated weights for each observation and each gene were taken as input data for the limma method.p values were adjusted for multiple testing problem using Benjamin-Hochberg procedure.Top 250 genes with lowest false discovery rate values were stored in all pairwise comparisons.A Venn diagram is constructed to identify the commonly differentially expressed genes in all pairwise comparisons [20,21].

D. Classification Analysis
To identify the predictive performances of multiple significant genes and distinguish different subtypes of the RCC, we performed classification analysis.Five statistical learning methods including support vector machines (SVM), random forests (RF), penalized discriminant analysis (PDA), naï ve bayes (NB) and knearest neighbors (KNN) were applied [22].A ten-fold cross-validation technique is used for model validation.Diagnostic accuracies, sensitivity, specificity, positive and negative predictive values are recorded to assess the model performances.

E. Gene Ontology & Pathway Analysis
For a deeper understanding of differentially expressed genes in three renal cancer subtypes, Gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis [23] were performed using GeneCodis 3.0 program [24,25].Additionally, a hypergeometric test was used to map all differentially expressed genes to terms in the GO database for identifying significantly enriched GO terms in our gene list (p-values were calculated by using this hypergeometric test and corrected by FDR method).Subsequently, all identified GO terms were categorized according to three main GO categories; biological process (BP), molecular function (MF) and cellular component (CC).

A. Differentially Expressed Genes Among Renal Cell
Cancer Subtypes A venn diagram is constructed for all three pairwise limma differential expression results (KIRC vs. KICH, KIRP vs. KICH, KIRC vs. KIRP) and shown in Fig. 1.We detected the top 250 genes that were overlapped in all pairwise analysis.A total of 181 genes in this gene list were found as significantly up-regulated (p<0.001), while the remaining 69 were significantly down-regulated (p<0.001).Top 20 up-regulated and down-regulated genes with their corresponding fold change values are given in Table I and Table II.
When the three cancer subtypes are classified together, we obtained classification accuracies between 90.7-94.6%.Sensitivity measures vary between 81.4-91.2% and specificity measures vary between 92.9-97.0%.In addition to overall classification, each cancer subtype is discriminated from other cancer subtypes.Classification accuracies are obtained between 92.8-95.2%for KIRC versus KICH, 91.4-96.8%for KIRC versus KIRP and 97.1-99.3%for KIRP versus KICH.The highest discrimination is observed between KIRC and KICH subtypes.Among the selected classification methods, SVM performed the best for most of the situations (Table III).

IV. DISCUSSION
A systematic comparison and evaluation of transcriptome dynamics between different tissues at genome-wide scale enables us to determine the candidate genes whose expressions are highly induced during carcinogenesis.From possible candidate genes, the identification of most DEGs by a robust statistical method can be a good starting point for researchers, interested in early cancer diagnosis.In this study, a procedure for determining most DEGs in three subtypes of renal cell carcinoma (RCC) was applied using the large volume of RNA-Seq data.Gene expression profiles of 1,020 renal cell cancer samples were downloaded from the TCGA database and analyzed.
As explained above (Table I), two heterogeneous nuclear ribonucleoproteins (HNRNPK, HNRNPL) involved in multiple processes of gene expression, transcription and mRNA splicing were found to be most significantly up-regulated expressed genes and associated with von Hippel-Lindau (VHL) disease [26,27].As an autosomal-dominant inherited tumor syndrome, von Hippel-Lindau (VHL) disease is caused by mutations of the von Hippel-Lindautumor suppressor gene (VHL) and the mutations are linked to renal cell carcinoma [28].It was previously well reported that the mRNA expression of splicing factors such as; HNRNPK were increased in tumor samples and modulate the expression of many genes that control cellular proliferation and tumorigenesis.In addition, a research showed that one of RNA-binding protein (HNRNP A) is the potential direct targets of VHL [26].Therefore, we concluded that the overexpression of above-mentioned ribonucleoproteins in cellular splicing machinery disrupt somehow suppressor activity of VHL causing renal tumor progression.
Of note, ADP-ribosylation factor 1 (ARF1) was identified among the top DEGs.ARF1 belongs to the Ras superfamily of small GTPases and plays a crucial role in mediating vesicular transports and enhancing cell proliferation [29].There was a significant correlation between cell growth inhibition and Golgi disruption.It was demonstrated that AMF-26, a novel anticancer drug, inhibits the Golgi system by targeting ARF1 activation, inducing complete regression of human breast cancer [30].An evidence from a prostate cancer study showing that the overexpression of ARF1 activate the expression of ERK1/2, whereas its silencing with shRNA leads to inhibition of ERK1/2.Therefore, in the progression of prostate cancer, ARF1 may play an important role by modulating the expression of MAPK ERK1/2 [31].Another study showed that ARF1 was found to be significantly upregulated in gastric cancer tissue and its high level expression was evaluated as a biomarker for gastric cancer detection [32].This accumulating evidence points to an important role of ARF1 in tumorigenesis and ARF1 thus hold a potential as a biomarker in determining renal carcinomas subtypes because of its high expression.Among the down-regulated genes with significant fold-differences, a particular attention was given to transthyretin (TTR) gene, a transport protein that is required for retinol-binding protein and thyroxin and also works as a rapid turnover protein.Recently, a proteomic study indicated that the expression level of TTR in Pca tissue and its serum level correlate with Pca (advanced prostate cancer) malignancy.This report also showed the ACC: Accuracy rate, SEN: Sensitivity, SPE: Specificity, PPV: Positive predictive value, NPV: Negative predictive value, SVM: Support vector machine, RF: Random forests, PDA: Penalized discriminant analysis, NB: Naive bayes, KNN: k-nearest neighbors.
High expression level of TTR in Pca tissue may allow Pca cells to escape from immunological surveillance, whereas its down regulation can activate the immune system, causing cancer cell death.Therefore, researchers pointed out that the expression levels of TTR serve as biomarkers for the prognosis of Pca [33].Another study, using LDI-TOF-MS peptide mass fingerprinting, found that TTR expression was found to be decreased in sera of lung cancer and benign lung disease compared with normal sera and suggested that TTR are potential biomarkers for lung cancer [34].Additionally, the TTR was weakly expressed in some human cancers such as; bronchopulmonary carcinoid and lung squamous carcinoma [35].In contrary, another proteomic study using 2-DE and MALDI-TOF/MS demonstrated that the protein level of transthyretin (TTR) was up-regulated more than 2-fold in pancreatic cancer juice compared with chronic pancreatitis and choledocholithiasis.The western blot experiments in the same study also verified that TTR was found to be abundant in the pancreatic cancer juice [19].As correlated with the previous study, we found the expression level of TTR gene was significantly down regulated when comparing the renal carcinoma subtypes to each other.As reported in elsewhere, the kidneys are important organ in the metabolism and catabolism of TTR and its concentration is increased in patients with reduced kidney function [34].Thus, we deduced that the reduced expression of TTR could be a candidate biomarker signature that distinguishes renal carcinoma subtypes.
Moreover, classification analysis was carried out to identify the predictive performances of these DEGs in classifying the RCC subtypes.Based on the results of five statistical learning algorithms, we obtained very high predictive performances, mostly for SVM method.Using SVM, these three RCC subtypes can be differentiated with 94.6% classification accuracy, 91.2% sensitivity and 97.0% specificity.In addition to overall classification, each cancer subtype is discriminated successively.The highest discrimination was obtained between KIRP and KICH groups.SVM performed 95.2% accuracy, 97.1% sensitivity and 86.7% specificity in classifying KIRC versus KICH; 96.8% accuracy, 97.1% sensitivity and 96.6% specificity in classifying KIRC versus KIRP; 99.3% accuracy, 100.0%sensitivity and 91.7% specificity in classifying KIRP versus KICH.These genes combined with SVM method may be useful in molecular diagnosis of the RCC.In conclusion, we detected 250 genes as differentially expressed, where 181 were up regulated and 69 were down regulated, among RCC subtypes.Our analysis identified several DEGs may have crucial role on RCC including HRNPK, HRNPL, ARF1 and TTR.Further, we demonstrated that DEGs have very high classification accuracies and could be useful in distinguishing RCC subtypes.We believe that this study will provide groundwork to the most widespread renal tumor cancer disease for further experiments.It will also provide important insights in identification of candidate biomarkers and novel potential targets for targeted RCC therapies.

Figure 1 .
Figure 1.Venn diagram displaying the overlapping and non-overlapping gene numbers differentially expressed among RCC subtypes Classification Performances of Differentially Expressed Genes in Renal Cell Cancer Subtypes DEGs were included in the classification analysis to identify the multiple effects of these genes in separating the RCC subtypes.The classification methods used in this study, including SVM, RF, PDA, NB and KNN, are performed on both pairwise and overall RCC subtypes and the classification performances are given in TableIII.When the three cancer subtypes are classified together, we obtained classification accuracies between 90.7-94.6%.Sensitivity measures vary between 81.4-91.2% and specificity measures vary between 92.9-97.0%.In addition to overall classification, each cancer subtype is discriminated from other cancer subtypes.Classification

TABLE I .
LOG-FOLD CHANGES OF TOP 20 UP-REGULATED GENES DIFFERENTIALLY EXPRESSED AMONG RENAL CELL CANCER GROUPS.Kidney renal clear cell carcinoma, KIRP: Kidney renal papillary cell carcinoma, KICH: Kidney chromophobe TABLE II.LOG-FOLD CHANGES OF TOP 20 DOWN-REGULATED GENES DIFFERENTIALLY EXPRESSED AMONG RENAL CELL CANCER GROUPS.

TABLE III .
CLASSIFICATION PERFORMANCES OF STATISTICAL LEARNING ALGORITHMS IN DISTINGUISHING RENAL CELL CANCER SUBTYPES.
Figure 2. Number of hits for each GO term for differentially expressed genes.