Cancer survival classification using integrated data sets and intermediate information

https://doi.org/10.1016/j.artmed.2014.06.003Get rights and content

Abstract

Objective

Although numerous studies related to cancer survival have been published, increasing the prediction accuracy of survival classes still remains a challenge. Integration of different data sets, such as microRNA (miRNA) and mRNA, might increase the accuracy of survival class prediction. Therefore, we suggested a machine learning (ML) approach to integrate different data sets, and developed a novel method based on feature selection with Cox proportional hazard regression model (FSCOX) to improve the prediction of cancer survival time.

Methods

FSCOX provides us with intermediate survival information, which is usually discarded when separating survival into 2 groups (short- and long-term), and allows us to perform survival analysis. We used an ML-based protocol for feature selection, integrating information from miRNA and mRNA expression profiles at the feature level. To predict survival phenotypes, we used the following classifiers, first, existing ML methods, support vector machine (SVM) and random forest (RF), second, a new median-based classifier using FSCOX (FSCOX_median), and third, an SVM classifier using FSCOX (FSCOX_SVM). We compared these methods using 3 types of cancer tissue data sets: (i) miRNA expression, (ii) mRNA expression, and (iii) combined miRNA and mRNA expression. The latter data set included features selected either from the combined miRNA/mRNA profile or independently from miRNAs and mRNAs profiles (IFS).

Results

In the ovarian data set, the accuracy of survival classification using the combined miRNA/mRNA profiles with IFS was 75% using RF, 86.36% using SVM, 84.09% using FSCOX_median, and 88.64% using FSCOX_SVM with a balanced 22 short-term and 22 long-term survivor data set. These accuracies are higher than those using miRNA alone (70.45%, RF; 75%, SVM; 75%, FSCOX_median; and 75%, FSCOX_SVM) or mRNA alone (65.91%, RF; 63.64%, SVM; 72.73%, FSCOX_median; and 70.45%, FSCOX_SVM). Similarly in the glioblastoma multiforme data, the accuracy of miRNA/mRNA using IFS was 75.51% (RF), 87.76% (SVM) 85.71% (FSCOX_median), 85.71% (FSCOX_SVM). These results are higher than the results of using miRNA expression and mRNA expression alone. In addition we predict 16 hsa-miR-23b and hsa-miR-27b target genes in ovarian cancer data sets, obtained by SVM-based feature selection through integration of sequence information and gene expression profiles.

Conclusion

Among the approaches used, the integrated miRNA and mRNA data set yielded better results than the individual data sets. The best performance was achieved using the FSCOX_SVM method with independent feature selection, which uses intermediate survival information between short-term and long-term survival time and the combination of the 2 different data sets. The results obtained using the combined data set suggest that there are some strong interactions between miRNA and mRNA features that are not detectable in the individual analyses.

Introduction

Ovarian cancer is the fifth leading cause of death from gynecological malignancy in the United States and Western Europe [1]. The high mortality rate of ovarian cancer is largely due to the typically advanced stage at initial diagnosis [1]. According to statistics, 75% of patients with ovarian cancer are diagnosed at an advanced stage, for which the 5-year survival rate is only 5–30%, with an average survival time of 21 months. In contrast, the 5-year survival rate among patients diagnosed early exceeds 90% [1]. Glioblastoma multiforme (GBM) is the most common and most aggressive brain tumor and has poorest survival with median 9–15 month. Nevethless, a small percent of patient survived for longer than 3 years [2], [3]. The molecular mechanisms that mediate longer survival in patients with advanced stages of ovarian cancer [4] and GBM [3] are a focus of current research.

Several studies have shown that miRNAs, which are small, 19–22 nucleotide non-coding RNAs [5], play an important role in cancer progression by regulating target mRNAs and mediating translational repression [6], [7], [8]. Molecules such as miRNAs and other noncoding RNAs have received recent attention [9], [10] because of their complicated communication systems in the cell. miRNA expression profiles have been reported to be associated with the characteristics of many cancers, including lung, breast [11], and ovarian [12], [13]. Moreover, mRNA and miRNA studies have been performed for many cancer subtypes [13]. Guo [14] studied the survival time associated with miRNAs chosen for their distinctive expression in esophageal squamous cell carcinoma, and Yu [15] proposed a method that predicted survival and relapse in non-small-cell lung cancer using hazard ratios based on miRNA expression from univariate Cox regression. The study of mRNA in cancer subtypes and cancer prognosis is a much more established area that has had a number of important successes [13], [16], [17], [18], [19]. Nevertheless, developing a method for separating cancer phenotypes remains a challenge, and predicting cancer survival time in ovarian cancer patients based on these expression profiles has not been very successful.

Many studies have suggested the use of better-performing molecular data sets, such as miRNA or mRNA, for cancer-related classification. However, several unresolved questions about their relationships still remain. For example, Lu [20] distinguished multiple tumors (generally poorly differentiated) of 11 cancer types using gene expression profiles from a bead-based flow cytometric miRNA expression profiling technique. The accuracy of this classification, based only on miRNA expression, was 70.58% in a test set of 11 types of cancer. However, the accuracy was only 5.9% when classification was based on mRNA expression profiles. Conversely, using the same data set along with an second similar data set of 11 (of 14) cancers, Peng [21] used support vector machine (SVM) with a modification of recursive feature elimination and compared classification performance using miRNA and mRNA profiles. The results suggested that classification using mRNA is superior to that using miRNA.

In this study, we used 3 cancer data sets to predict survival time (1) only mRNA expression, (2) only miRNA expression, and (3) both mRNA and miRNA gene expression. In all 3 cases, we assessed the quality of these features as predictors of survival time. For these 3 data sets, we implemented 2 different methods to integrate information to predict cancer survival time. The first is a well-known classification algorithm, SVM and randomforest (RF), based on the discretization of survival times into 2 classes, while the second is feature selection with a Cox proportional hazards regression-based algorithm (FSCOX) based on continuous survival information. Our approaches for predicting survival time use machine learning (ML) protocols, which allow for transparent combinations of information types (miRNA and mRNA expression). In principle, integration of molecular information types in ML can be done using the standard ML method of kernel addition, without limiting the number of data types. This means taking kernel matrices representing multiple data sources (e.g., mRNA and miRNA expression), and adding their kernels to represent combined information. For biomarker-based prediction, kernel addition is a simple modular method that integrates different information sources to predict cancer phenotypes.

Based on the determination of significant miRNAs in the ML study, the data also separately predicted their target genes by integrating 2 sequence-based prediction methods, TargetScan [22] and miRanda [23], together with information on the (anti-)correlations of miRNA and mRNA expression profiles. For a given miRNA sequence, TargetScan [22] searches for complementary target mRNAs among 14,300 orthologous mRNA triplets (obtained via conservation from the human, rat, and mouse genomes), in untranslated region (UTR) mRNA. The miRanda algorithm [23] optimizes sequence complementarity between the miRNA and the putative miRNA target using position-specific criteria, and also requires interspecies conservation of mRNA-miRNA pairings, with all pairs scored using weighted sums of match and mismatch scores, including matching base pairs (positive scores) and gap penalties (negative scores). Currently, identification of the mRNA targets of miRNAs in silico is considered difficult due to a lack of reliable methodologies. We relied on corroborating targeting information from the 2 independent sources (i.e., sequence-based and anticorrelation-based sources). Such a systemic method currently has a good deal of noise and a lack of reliability in matching miRNAs with mRNAs; however, it is possible to identify potential miRNA targets for further study.

Section snippets

Materials

All data were obtained from The Cancer Genome Atlas (TCGA) available at http://cancergenome.nih.gov/ (accessed 01.10.13), a source of standardized and comprehensive cancer data sets. We downloaded 304 mRNA gene expression samples (AgilentG4502A) and 292 miRNA expression samples (Agilent miRNA_8 × 15 K) for ovarian cancer (updated in 2010) that were provided by the University of North Carolina. We obtained data from 147 deceased ovarian cancer patients that matched the patient IDs from the 292

Comparing 3 data types using SVM with feature selection (ovarian cancer data)

We performed the algorithm using feature selection sizes from 1 to 100 with LOOCV (results are shown in Fig. 2). The best performaces for the different data sets with selecting features were 75% accuracy using 60 features from only the miRNA expression profiles, and 63.64% using 5 features from only the mRNA expression profiles, and 84.09% using 8 features from the combined data sets with CFS in 22 short-term and 22 long-term survivor data sets. To select the most informative features from the

Discussion

One new aspect of this work is the new means of information combination in integrating miRNA and mRNA data. In previous studies, Lu [18] suggested that miRNA profiles perfomed better than mRNA for studying 11 cancer types, while Peng [19] alternatively suggested that mRNA performed better than miRNA in predicting subtypes/phenotypes. In this study our results show that in survival phenotype prediction, miRNA alone performed better than mRNA alone in ovarian cancer, while in some cases using

Conclusions

In this study we have studied two methods for feature integration between miRNA and mRNA data. The first method, a new approach presented as FSCOX, uses all survival data (not only short and long), to predict cancer survival time. The second uses exsiting machine learning methods (SVM, RF) to predict directly between long versus short survival times. Both methods were studied for their performance in integrating the two different data types (miRNA and mRNA). These methods are tested on two TCGA

Acknowledgements

This work was partially supported by NIH of the US (grants 1R21CA13582-01, 1R01GM080625-01A1) the National Resesarch Foundation of Korea (grants 355-2008-C00004, MEST 2012R1A3A2026438, 2012-0000644).

References (68)

  • Q. Zhang et al.

    Control of cyclin D1 and breast tumorigenesis by the EglN2 prolyl hydroxylase

    Cancer Cell

    (2009)
  • O. Saitoh et al.

    Differential glycosylation and cell surface expression of lysosomal membrane glycoproteins in sublines of a human colon cancer exhibiting distinct metastatic potentials

    J Biol Chem

    (1992)
  • E.J. Parkinson-Lawrence et al.

    Immunochemical analysis of CD107a (LAMP-1)

    Cell Immunol

    (2005)
  • I. Babel et al.

    Identification of tumor-associated autoantigens for the diagnosis of colorectal cancer in serum using high density protein microarrays

    Mol Cell Proteomics

    (2009)
  • F.Y. Tsai et al.

    Transcription factor GATA-2 is required for proliferation/survival of early hematopoietic cells and mast cell formation, but not for erythroid and myeloid terminal differentiation

    Blood

    (1997)
  • S.Y. Ko et al.

    The Mullerian HOXA10 gene promotes growth of ovarian surface epithelial cells by stimulating epithelial–stromal interactions

    Mol Cell Endocrinol

    (2010)
  • L.V. Hansen et al.

    Tumour cell expression of C4.4A, a structural homologue of the urokinase receptor, correlates with poor prognosis in non-small cell lung cancer

    Lung Cancer

    (2007)
  • J. Yasuda et al.

    Nemo-like kinase induces apoptosis in DLD-1 human colon cancer cells

    Biochem Biophys Res Commun

    (2003)
  • D.G. Quintana et al.

    ORC5L, a new member of the human origin recognition complex, is deleted in uterine leiomyomas and malignant myeloid diseases

    J Biol Chem

    (1998)
  • N. Sato et al.

    Differential and epigenetic gene expression profiling identifies frequent disruption of the RELN pathway in pancreatic cancers

    Gastroenterology

    (2006)
  • M.E. Hudson et al.

    Identification of differentially expressed proteins in ovarian cancer using high-density protein microarrays

    Proc Natl Acad Sci USA

    (2007)
  • R. Stupp et al.

    Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma

    N Engl J Med

    (2005)
  • Y. Liu et al.

    Polymorphisms of LIG4, BTBD2, HMGA2, and RTEL1 genes involved in the double-strand break repair pathway predict glioblastoma survival

    J Clin Oncol

    (2010)
  • R.W. Tothill et al.

    Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome

    Clin Cancer Res

    (2008)
  • B.M. Ryan et al.

    Genetic variation in microRNA networks: the implications for cancer research

    Nat Rev Cancer

    (2010)
  • A. Cimmino et al.

    miR-15 and miR-16 induce apoptosis by targeting BCL2

    Proc Natl Acad Sci USA

    (2005)
  • T.J. Chiou et al.

    Regulation of phosphate homeostasis by microRNA in Arabidopsis

    Plant Cell

    (2006)
  • E. Iorio et al.

    Alterations of choline phospholipid metabolism in ovarian tumor progression

    Cancer Res

    (2005)
  • M.V. Iorio et al.

    MicroRNA signatures in human ovarian cancer

    Cancer Res

    (2007)
  • Network CGAR

    Integrated genomic analyses of ovarian carcinoma

    Nature

    (2011)
  • Y. Guo et al.

    Distinctive microRNA profiles relating to patient survival in esophageal squamous cell carcinoma

    Cancer Res

    (2008)
  • A.P. Crijns et al.

    Survival-related profile, pathways, and transcription factors in ovarian cancer

    PLoS Med

    (2009)
  • L.J. van′t Veer et al.

    Gene expression profiling predicts clinical outcome of breast cancer

    Nature

    (2002)
  • H.K. Dressman et al.

    An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer

    J Clin Oncol

    (2007)
  • Cited by (21)

    • Deep reinforced neural network model for cyto-spectroscopic analysis of epigenetic markers for automated oral cancer risk prediction

      2022, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      Several studies [38,39] have also been reported which combined epigenetic changes with deep learning for lung cancer [40,41], hepatocellular carcinoma [42], breast cancer [43–45], glioblastoma [46] and in understanding morphology of cancer stem cells [47]. Pan-cancer studies utilizing deep learning and genomic/epigenetic changes have been reported for cancer prognosis [48–51], classifying primary and metastatic cancers [52–54] and for both cancer prediction [55–58] and distinguishing cancer sub-types and cells [57,59–61]. However, as per our knowledge, no study has been reported so far by combining both FTIR and RS along with deep learning to understand the epigenetic changes in oral cancer development and apply that information in early-stage oral cancer prediction.

    • Using machine learning approaches for multi-omics data analysis: A review

      2021, Biotechnology Advances
      Citation Excerpt :

      Metabolomics and proteomics have been integrated using RF for analysis of prostate cancer (Fan et al., 2011) and thyroid functioning (Pietzner et al., 2017). Similarly, metabolomics is integrated with mRNA for studying ulcerative colitis (Bjerrum et al., 2014) and cancer survival (Kim et al., 2014). On the other hand, glycomics and epigenomics have only appeared once in the multi-omics context (along with mRNA and metabolomics) and used by Zierer (Zierer et al., 2016) for the study of age-related comorbidities using a graphical variant of RF.

    • A primer on machine learning techniques for genomic applications

      2021, Computational and Structural Biotechnology Journal
    View all citing articles on Scopus
    View full text