Cancer survival classification using integrated data sets and intermediate information
Introduction
Ovarian cancer is the fifth leading cause of death from gynecological malignancy in the United States and Western Europe [1]. The high mortality rate of ovarian cancer is largely due to the typically advanced stage at initial diagnosis [1]. According to statistics, 75% of patients with ovarian cancer are diagnosed at an advanced stage, for which the 5-year survival rate is only 5–30%, with an average survival time of 21 months. In contrast, the 5-year survival rate among patients diagnosed early exceeds 90% [1]. Glioblastoma multiforme (GBM) is the most common and most aggressive brain tumor and has poorest survival with median 9–15 month. Nevethless, a small percent of patient survived for longer than 3 years [2], [3]. The molecular mechanisms that mediate longer survival in patients with advanced stages of ovarian cancer [4] and GBM [3] are a focus of current research.
Several studies have shown that miRNAs, which are small, 19–22 nucleotide non-coding RNAs [5], play an important role in cancer progression by regulating target mRNAs and mediating translational repression [6], [7], [8]. Molecules such as miRNAs and other noncoding RNAs have received recent attention [9], [10] because of their complicated communication systems in the cell. miRNA expression profiles have been reported to be associated with the characteristics of many cancers, including lung, breast [11], and ovarian [12], [13]. Moreover, mRNA and miRNA studies have been performed for many cancer subtypes [13]. Guo [14] studied the survival time associated with miRNAs chosen for their distinctive expression in esophageal squamous cell carcinoma, and Yu [15] proposed a method that predicted survival and relapse in non-small-cell lung cancer using hazard ratios based on miRNA expression from univariate Cox regression. The study of mRNA in cancer subtypes and cancer prognosis is a much more established area that has had a number of important successes [13], [16], [17], [18], [19]. Nevertheless, developing a method for separating cancer phenotypes remains a challenge, and predicting cancer survival time in ovarian cancer patients based on these expression profiles has not been very successful.
Many studies have suggested the use of better-performing molecular data sets, such as miRNA or mRNA, for cancer-related classification. However, several unresolved questions about their relationships still remain. For example, Lu [20] distinguished multiple tumors (generally poorly differentiated) of 11 cancer types using gene expression profiles from a bead-based flow cytometric miRNA expression profiling technique. The accuracy of this classification, based only on miRNA expression, was 70.58% in a test set of 11 types of cancer. However, the accuracy was only 5.9% when classification was based on mRNA expression profiles. Conversely, using the same data set along with an second similar data set of 11 (of 14) cancers, Peng [21] used support vector machine (SVM) with a modification of recursive feature elimination and compared classification performance using miRNA and mRNA profiles. The results suggested that classification using mRNA is superior to that using miRNA.
In this study, we used 3 cancer data sets to predict survival time (1) only mRNA expression, (2) only miRNA expression, and (3) both mRNA and miRNA gene expression. In all 3 cases, we assessed the quality of these features as predictors of survival time. For these 3 data sets, we implemented 2 different methods to integrate information to predict cancer survival time. The first is a well-known classification algorithm, SVM and randomforest (RF), based on the discretization of survival times into 2 classes, while the second is feature selection with a Cox proportional hazards regression-based algorithm (FSCOX) based on continuous survival information. Our approaches for predicting survival time use machine learning (ML) protocols, which allow for transparent combinations of information types (miRNA and mRNA expression). In principle, integration of molecular information types in ML can be done using the standard ML method of kernel addition, without limiting the number of data types. This means taking kernel matrices representing multiple data sources (e.g., mRNA and miRNA expression), and adding their kernels to represent combined information. For biomarker-based prediction, kernel addition is a simple modular method that integrates different information sources to predict cancer phenotypes.
Based on the determination of significant miRNAs in the ML study, the data also separately predicted their target genes by integrating 2 sequence-based prediction methods, TargetScan [22] and miRanda [23], together with information on the (anti-)correlations of miRNA and mRNA expression profiles. For a given miRNA sequence, TargetScan [22] searches for complementary target mRNAs among 14,300 orthologous mRNA triplets (obtained via conservation from the human, rat, and mouse genomes), in untranslated region (UTR) mRNA. The miRanda algorithm [23] optimizes sequence complementarity between the miRNA and the putative miRNA target using position-specific criteria, and also requires interspecies conservation of mRNA-miRNA pairings, with all pairs scored using weighted sums of match and mismatch scores, including matching base pairs (positive scores) and gap penalties (negative scores). Currently, identification of the mRNA targets of miRNAs in silico is considered difficult due to a lack of reliable methodologies. We relied on corroborating targeting information from the 2 independent sources (i.e., sequence-based and anticorrelation-based sources). Such a systemic method currently has a good deal of noise and a lack of reliability in matching miRNAs with mRNAs; however, it is possible to identify potential miRNA targets for further study.
Section snippets
Materials
All data were obtained from The Cancer Genome Atlas (TCGA) available at http://cancergenome.nih.gov/ (accessed 01.10.13), a source of standardized and comprehensive cancer data sets. We downloaded 304 mRNA gene expression samples (AgilentG4502A) and 292 miRNA expression samples (Agilent miRNA_8 × 15 K) for ovarian cancer (updated in 2010) that were provided by the University of North Carolina. We obtained data from 147 deceased ovarian cancer patients that matched the patient IDs from the 292
Comparing 3 data types using SVM with feature selection (ovarian cancer data)
We performed the algorithm using feature selection sizes from 1 to 100 with LOOCV (results are shown in Fig. 2). The best performaces for the different data sets with selecting features were 75% accuracy using 60 features from only the miRNA expression profiles, and 63.64% using 5 features from only the mRNA expression profiles, and 84.09% using 8 features from the combined data sets with CFS in 22 short-term and 22 long-term survivor data sets. To select the most informative features from the
Discussion
One new aspect of this work is the new means of information combination in integrating miRNA and mRNA data. In previous studies, Lu [18] suggested that miRNA profiles perfomed better than mRNA for studying 11 cancer types, while Peng [19] alternatively suggested that mRNA performed better than miRNA in predicting subtypes/phenotypes. In this study our results show that in survival phenotype prediction, miRNA alone performed better than mRNA alone in ovarian cancer, while in some cases using
Conclusions
In this study we have studied two methods for feature integration between miRNA and mRNA data. The first method, a new approach presented as FSCOX, uses all survival data (not only short and long), to predict cancer survival time. The second uses exsiting machine learning methods (SVM, RF) to predict directly between long versus short survival times. Both methods were studied for their performance in integrating the two different data types (miRNA and mRNA). These methods are tested on two TCGA
Acknowledgements
This work was partially supported by NIH of the US (grants 1R21CA13582-01, 1R01GM080625-01A1) the National Resesarch Foundation of Korea (grants 355-2008-C00004, MEST 2012R1A3A2026438, 2012-0000644).
References (68)
- et al.
MicroRNAs in tumorigenesis: a primer
Am J Pathol
(2007) - et al.
A ceRNA hypothesis: the Rosetta stone of a hidden RNA language
Cell
(2011) - et al.
Coding-independent regulation of the tumor suppressor PTEN by competing endogenous mRNAs
Cell
(2011) - et al.
MicroRNA signature predicts survival and relapse in lung cancer
Cancer Cell
(2008) - et al.
Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer
Lancet
(2005) - et al.
Multi-class cancer classification through gene expression profiles: microRNA versus mRNA
J Genet Genomics
(2009) - et al.
Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets
Cell
(2005) - et al.
D4S234E, a novel p53-responsive gene, induces apoptosis in response to DNA damage
Exp Cell Res
(2010) - et al.
ST14: (suppression of tumorigenicity 14) gene is a target for miR-27b, and the inhibitory effect of ST14 on cell growth is independent of miR-27b regulation
J Biol Chem
(2009) - et al.
ANKHD1, ankyrin repeat and KH domain containing 1, is overexpressed in acute leukemias and is associated with SHP2 in K562 cells
Biochim Biophys Acta
(2006)
Control of cyclin D1 and breast tumorigenesis by the EglN2 prolyl hydroxylase
Cancer Cell
Differential glycosylation and cell surface expression of lysosomal membrane glycoproteins in sublines of a human colon cancer exhibiting distinct metastatic potentials
J Biol Chem
Immunochemical analysis of CD107a (LAMP-1)
Cell Immunol
Identification of tumor-associated autoantigens for the diagnosis of colorectal cancer in serum using high density protein microarrays
Mol Cell Proteomics
Transcription factor GATA-2 is required for proliferation/survival of early hematopoietic cells and mast cell formation, but not for erythroid and myeloid terminal differentiation
Blood
The Mullerian HOXA10 gene promotes growth of ovarian surface epithelial cells by stimulating epithelial–stromal interactions
Mol Cell Endocrinol
Tumour cell expression of C4.4A, a structural homologue of the urokinase receptor, correlates with poor prognosis in non-small cell lung cancer
Lung Cancer
Nemo-like kinase induces apoptosis in DLD-1 human colon cancer cells
Biochem Biophys Res Commun
ORC5L, a new member of the human origin recognition complex, is deleted in uterine leiomyomas and malignant myeloid diseases
J Biol Chem
Differential and epigenetic gene expression profiling identifies frequent disruption of the RELN pathway in pancreatic cancers
Gastroenterology
Identification of differentially expressed proteins in ovarian cancer using high-density protein microarrays
Proc Natl Acad Sci USA
Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma
N Engl J Med
Polymorphisms of LIG4, BTBD2, HMGA2, and RTEL1 genes involved in the double-strand break repair pathway predict glioblastoma survival
J Clin Oncol
Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome
Clin Cancer Res
Genetic variation in microRNA networks: the implications for cancer research
Nat Rev Cancer
miR-15 and miR-16 induce apoptosis by targeting BCL2
Proc Natl Acad Sci USA
Regulation of phosphate homeostasis by microRNA in Arabidopsis
Plant Cell
Alterations of choline phospholipid metabolism in ovarian tumor progression
Cancer Res
MicroRNA signatures in human ovarian cancer
Cancer Res
Integrated genomic analyses of ovarian carcinoma
Nature
Distinctive microRNA profiles relating to patient survival in esophageal squamous cell carcinoma
Cancer Res
Survival-related profile, pathways, and transcription factors in ovarian cancer
PLoS Med
Gene expression profiling predicts clinical outcome of breast cancer
Nature
An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer
J Clin Oncol
Cited by (21)
Gynecological cancer prognosis using machine learning techniques: A systematic review of the last three decades (1990–2022)
2023, Artificial Intelligence in MedicineDeep reinforced neural network model for cyto-spectroscopic analysis of epigenetic markers for automated oral cancer risk prediction
2022, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :Several studies [38,39] have also been reported which combined epigenetic changes with deep learning for lung cancer [40,41], hepatocellular carcinoma [42], breast cancer [43–45], glioblastoma [46] and in understanding morphology of cancer stem cells [47]. Pan-cancer studies utilizing deep learning and genomic/epigenetic changes have been reported for cancer prognosis [48–51], classifying primary and metastatic cancers [52–54] and for both cancer prediction [55–58] and distinguishing cancer sub-types and cells [57,59–61]. However, as per our knowledge, no study has been reported so far by combining both FTIR and RS along with deep learning to understand the epigenetic changes in oral cancer development and apply that information in early-stage oral cancer prediction.
Using machine learning approaches for multi-omics data analysis: A review
2021, Biotechnology AdvancesCitation Excerpt :Metabolomics and proteomics have been integrated using RF for analysis of prostate cancer (Fan et al., 2011) and thyroid functioning (Pietzner et al., 2017). Similarly, metabolomics is integrated with mRNA for studying ulcerative colitis (Bjerrum et al., 2014) and cancer survival (Kim et al., 2014). On the other hand, glycomics and epigenomics have only appeared once in the multi-omics context (along with mRNA and metabolomics) and used by Zierer (Zierer et al., 2016) for the study of age-related comorbidities using a graphical variant of RF.
A primer on machine learning techniques for genomic applications
2021, Computational and Structural Biotechnology JournalCancer Prognosis and Diagnosis Methods Based on Ensemble Learning
2023, ACM Computing SurveysEARN: an ensemble machine learning algorithm to predict driver genes in metastatic breast cancer
2021, BMC Medical Genomics