Research ArticleNetwork regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer
Introduction
Comorbidity refers to the presence of one or more coexisting diseases that appear simultaneously with another disease, or interdependently with each other (Capobianco and Liò, 2013). Sometimes, a comorbidity is considered to be a secondary diagnosis, having been detected at the same time, or one after another, treatment for the principal diagnosis. It is associated with an elevated burden of symptoms, more complex clinical management, decreased length and quality of life and increased health-care costs (Valderas et al., 2009). Comorbidity has a significant predictive value on overall survival (Lagro et al., 2010). For an instance, comorbidity has a main effect on survival in cancer, particularly for cancer of the cervix (Ferrandina et al., 2012). However, there is also inverse comorbidity, which is characterised by a lower-than-expected probability of certain diseases occurring in individuals diagnosed with other health conditions, related to cancer (Tabarés-Seisdedos and Rubenstein, 2013). Most of the diseases associated with inverse cancer comorbidity are related to the Central Nervous System (CNS) or neuropsychiatric disorders (Tabarés-Seisdedos et al., 2011). There are some epidemiological evidences that patients with CNS disorders have a lower than expected probability of developing some types of cancer, such as Alzheimer's disease, Parkinsons disease and Schizophrenia (Ibáñez et al., 2014).
Data concerning the expression of various genes have been widely used for the prediction of the risk of disease. For appropriate disease-specific diagnostic, prognostic, and therapeutic approaches, disease–gene association studies provide valuable information (Tiffin et al., 2009). Understanding relationships between diseases and genes at the molecular level could help us to gain a better understanding of pathogenesis, and it leads to better prevent, treatment, and diagnosis (Du et al., 2009). Diseases are more likely to be comorbid if they share associated gene expression profiles (Park et al., 2009). These associations can be due to direct or indirect causal relationships and the shared risk factors among diseases (Moni and Liò, 2014, Liò et al., 2012). For instance, people with HIV-1 appear to have a markedly higher rate of end-stage renal disease (ESRD) than the healthy people (Kumar et al., 2005). It is because some of the risk factors associated with HIV-1 acquisition are the same as those that lead to kidney disease. Patients with chronic kidney disease increase risk of cardiovascular mortality (de Jager et al., 2014). Thus HIV-1 infections are associated with cardiovascular mortality.
Cancer is a group of complex diseases that is caused by abnormalities of biomarker genes. In cancer genomics, gene expression levels provide important molecular signatures for all types of cancer and that can be very useful in predicting the comorbidity and survival of cancer patients. Evidence of an increased comorbidity between CNS (Central Nervous System) disorders and certain cancers has existed for many years. For instance, Down's syndrome (DS) is strongly associated to increase the co-occurrence of cancer, specifically acute leukaemia, testicular cancer and some gastrointestinal cancers (Catalá-López et al., 2014). Mining of high-throughput gene expression data in order to identify biomarker associated with patient survival is an ongoing challenge in complex disease prognostic studies to achieve more accurate prognosis. However, one of the main challenges of microarray gene expression data analysis is the high dimensionality, due to the overwhelming number of measures of gene expression levels compared to the small number of cancer samples (Xu et al., 2010). To tackle this problem, variable selection has been applied to select significant subsets of genes in a microarray gene expression dataset. All the approaches have shown some limitations.
Many methods were proposed for survival analysis on high dimensional gene expression data with highly correlated variables (Van Wieringen et al., 2009, Witten and Tibshirani, 2010). To explore the relations between gene expression data and cancer survival with both censored samples and uncensored samples, Cox's proportional hazards model (Cox et al., 1972) is commonly used. It is the most popular survival model used to describe the relationship between the patient's survival time and predictor variables (Cox, 1992). However, when we have high-dimensional data (e.g. in a microarray study) where the number of predictors (genes) far exceeds the number of subjects (patients), the Cox model cannot be fitted directly unless the high-dimensionality is properly handled. Due to the high-dimensionality of microarray gene expression, a variety of regularisations with different penalties have been proposed including L1 penalty in lasso (least absolute shrinkage and selection operator) regression (Tibshirani et al., 1997, Gui and Li, 2005), adaptive lasso (Zhang and Lu, 2007, Zou, 2008) for gene selection and parameter estimation in high-dimensional microarray data. All of these methods can select important variables by shrinking some regression coefficients to equal exactly zero. These penalties can be imposed to individual variables to automatically remove unimportant ones. The lasso shrinks some of the coefficients to zero, and the amount of shrinkage is determined by the tuning parameter, often determined by cross validation. The model determined by this cross validation contains many false positives whose coefficients are actually zero. Hastie and Tibshirani (2004) and Hoshida et al. (2008) introduced an alternative method using L2 (Euclidean norm) penalty in ridge regression. Although the L1 and the L2 penalties have been designed as a statistical technique to solve the high-dimensional data problem it has some drawbacks. The primary one being that these procedures ignore important prior gene structure information regarding modular relations among gene expressions. A better approach is to identify the significant genes that are functionally related. Since it takes into account biological information it leads greater reliability. Groups of genes are co-expressed in different conditions through biological pathway or protein–protein interaction, and it provides prior information to reduce high dimensionally data based on removing confounding factors and statistical randomness for regression models (Chuang et al., 2007, Li and Li, 2008, Tian et al., 2009). Therefore, incorporating prior biological knowledge by exploiting the network structure in a statistical method is expected to improve its performance. The major advantage of the network-based models is the better generalisation across independent studies because the network information is consistent with the conserved patterns in the gene expression data.
In recent years, researchers have focussed on a particular data type, for example mRNA expression, to find profiles that are associated with particular diseases, prognosis, disease comorbidities and drug response. More recently, as the cost of collecting data using highthroughput technologies has decreased, studies have begun to integrate multiple data types collected from the same patient samples (Chalise et al., 2014). In addition, by analysing different types of data in isolation we may miss important information that results from the coordinated activity of biological components at various levels. Several methods have been proposed in the last few years that have aimed to address the issue of integrating multiple data types into a single analysis (Chalise et al., 2014). As data collection at the genomic, transcriptomic, epigenomic and proteomics levels is becoming easier it will become increasingly important to integrate these data in order to predict disease comorbidities. High-throughput omics-data such as messenger RNA (mRNA) expression, DNA copy number alterations, pathway dysregulation, DNA methylation and clinical information can provide a different view of the patients molecular status at various levels. Combining clinical and molecular data types may potentially improve prediction accuracy of disease comorbidity. However, currently there is a shortage of effective and efficient statistical and bioinformatics methods for true integrative data analysis. So far few methods have been proposed to integrate clinical and molecular data to obtain accurate cancer prognosis. Here, we have presented methods of integrating different types of data by modelling association between diseases in a multiplex network (a multilink between nodes and indicates the set of all links connecting these nodes in the different layers (Bianconi, 2013)). The multiplex network allows us to model disease comorbidities by representing each data type as layers in the multiplex network (Estrada and Gómez-Gardeñes, 2014). Importantly, this allows us to capture the interactions between the various types of data, such as the interdependence of pathway regulation with mRNA expression or mRNA expression with miRNA expression. Moreover comorbidity prediction of complex diseases is critical in the field of medicine. As data collection at the genomic, transcriptomic, epigenomic and proteomics levels is becoming easier it will become increasingly important to integrate these molecular data with the clinical information in order to predict disease comorbidities.
Section snippets
Materials and methods
In this article, we present the network-regularised Cox proportional hazard and multiplex network models to measure disease comorbidities based on the diverse set of data and to predict survival in cancer. At first each biological data were pre-processed. In the second stage, we apply PCA method to reduce the dimension of initial sample data. After PCA transform, a gene interaction network was built according to gene co-expression information, and the network regularised Cox regression model is
Results
We applied our network regularised Cox regression model and normalized mutual information approach to analyse the seven cancer microarray gene expression datasets. For this reason, we have preprocessed the microarray gene expression raw data. To compare among the datasets we have normalised all gene expression microarray dataset. In order to construct the network for survival analysis, we applied the PCA technique to reduce the dimensions of datasets. Survival analysis mostly with the Cox
Discussion and conclusion
In this study, our model integrates gene network information into the Cox proportional hazard method to explore the co-expression or functional association among high-dimensional gene expression features in the gene network. The goal of this study is to gain deeper insights into the benefits and drawbacks of the regression techniques in order to identify cancer biomarkers that are useful for prognosis, diagnosis and treatment. The regression coefficient in this method is used to find the most
References (104)
- et al.
Activation of ras-ral pathway attenuates p53-independent DNA damage g2 checkpoint
J. Biol. Chem.
(2004) - et al.
Identification of a common gene expression signature in dilated cardiomyopathy across independent microarray studies
J. Am. Coll. Cardiol.
(2006) - et al.
Baseline plasma proteomic analysis to identify biomarkers that predict radiation-induced lung toxicity in patients receiving radiation for non-small cell lung cancer
J. Thorac. Oncol.
(2011) - et al.
Comorbidity: a multidimensional approach
Trends Mol. Med.
(2013) Measuring comorbidity in older cancer patients
Eur. J. Cancer
(2000)- et al.
Role of comorbidities in locally advanced cervical cancer patients administered preoperative chemoradiation: impact on outcome and treatment-related complications
Eur. J. Surg. Oncol. (EJSO)
(2012) - et al.
Cxcl5, a promoter of cell proliferation, migration and invasion, is a novel serum prognostic marker in patients with colorectal cancer
Eur. J. Cancer
(2012) - et al.
Differential expression of acat1 and acat2 among cells within liver, intestine, kidney, and adrenal of nonhuman primates
J. Lipid Res.
(2000) - et al.
Panic disorder with familial bipolar disorder
Biol. Psychiatry
(1997) - et al.
Glypican-3 expression in clear cell adenocarcinoma of the ovary
Mod. Pathol.
(2009)
Endogenous retrotransposition activates oncogenic pathways in hepatocellular carcinoma
Cell
Dbc2 significantly influences cell-cycle, apoptosis, cytoskeleton and membrane-trafficking pathways
J. Mol. Biol.
Experimentally derived metastasis gene expression profile predicts recurrence and death in patients with colon cancer
Gastroenterology
No paradox, no progress: inverse cancer comorbidity in people with other complex diseases
Lancet Oncol.
Survival prediction using gene expression data: a review and comparison
Comput. Stat. Data Anal.
Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps
Artif. Intell. Med.
Mckusick's online Mendelian inheritance in man (OMIM)
Nucleic Acids Res.
A new face and new challenges for online Mendelian inheritance in man (OMIM)
Hum. Mutat.
Atypical ductal hyperplasia at margin of breast biopsy-is re-excision indicated?
Ann. Surg. Oncol.
The genetic association database
Nat. Genet.
Statistical mechanics of multiplex networks: Entropy and overlap
Phys. Rev. E
The Structure and Dynamics of Multilayer Networks
A gene signature predicting for survival in suboptimally debulked patients with ovarian cancer
Cancer Res.
Tumor-infiltrating macrophages are associated with metastasis suppression in high-grade osteosarcoma: a rationale for treatment with macrophage activating agents
Clin. Cancer Res.
Pharmacogenomics of central nervous system (CNS) drugs
Drug Dev. Res.
Pasture-feeding of charolais steers influences skeletal muscle metabolism and gene expression
J. Physiol. Pharmacol.
Inverse and direct cancer comorbidity in people with central nervous system disorders: a meta-analysis of cancer incidence in 577,013 participants of 50 observational studies
Psychother. Psychosom.
Integrative clustering methods for high-dimensional molecular data
Transl. Cancer Res.
Identified differently expressed genes in renal cell carcinoma by using multiple microarray datasets running head: differently expressed genes in renal cell carcinoma
Eur. Rev. Med. Pharmacol. Sci.
Network-based classification of breast cancer metastasis
Mol. Syst. Biol.
Expression of the i kr components kcnh2 (rerg) and kcne2 (rmirp1) during late rat heart development
Exp. Mol. Med.
Risk of cancer following lumbar fusion surgery with recombinant human bone morphogenic protein-2 (rh-bmp-2)
Spine
Regression models and life tables
JR Stat. Soc. B
Regression models and life-tables
Breakthroughs in Statistics
Noncardiovascular mortality in ckd: an epidemiological perspective
Nat. Rev. Nephrol.
shRNA kinome screen identifies TBK1 as a therapeutic target for HER2+ breast cancer
Cancer Res.
Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the transbig multicenter independent validation series
Clin. Cancer Res.
From disease ontology to disease-ontology lite: statistical methods to adapt a general-purpose ontology for the test of gene–ontology associations
Bioinformatics
Gene expression omnibus: NCBI gene expression and hybridization array data repository
Nucleic Acids Res.
Annual report to the nation on the status of cancer, 1975–2010, featuring prevalence of comorbidity and impact on survival among persons with lung, colorectal, breast, or prostate cancer
Cancer
Functional parsing of driver mutations in the colorectal cancer genome reveals numerous suppressors of anchorage-independent growth
Cancer Res.
Communicability reveals a transition to coordinated behavior in multiplex networks
Phys. Rev. E
Tuning parameter selection in high dimensional penalized likelihood
J. R. Stat. Soc. Ser. B (Stat. Methodol.)
Smad4-mediated signaling inhibits intestinal neoplasia by inhibiting expression of β-catenin
Gastroenterology
Bioconductor: open software development for computational biology and bioinformatics
Genome Biol.
Seromic profiling of ovarian and pancreatic cancer
Proc. Natl. Acad. Sci.
The human disease network
Proc. Natl. Acad. Sci.
Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data
Bioinformatics
Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders
Nucleic Acids Res.
Transcription factor egr-1 inhibits growth of hepatocellular carcinoma and esophageal carcinoma cells lines
World J. Gastroenterol.
Cited by (28)
In silico based analysis to explore genetic linkage between atherosclerosis and its potential risk factors
2023, Biochemistry and Biophysics ReportsDetection of molecular signatures and pathways shared in inflammatory bowel disease and colorectal cancer: A bioinformatics and systems biology approach
2020, GenomicsCitation Excerpt :Within the period, they found 12% of the patients are developed CRC, and 21% of the patients directly diagnosed with IBD [4]. Several studies showed that there are several convincing pieces of evidence for pathological and epidemiological links between IBD and CRC from the population-based studies which revealed that IBD is a strong significant risk factor of CRC and vice versa [5–11]. Analysis of transcriptomes is commonly used to identify the candidate biomarkers for various disease including CRC [12–14] and IBD [15,16].
A systems biology approach to identifying genetic factors affected by aging, lifestyle factors, and type 2 diabetes that influences Parkinson's disease progression
2020, Informatics in Medicine UnlockedCitation Excerpt :However, from a proteomics and signaling pathways point of view, their association and causation status can be investigated using studies of biological modules such as PPIs, GO or molecular pathways [37,38]. Network-based approaches for genetic studies of poorly understood diseases have become popular in recent years [33,39–42]. Indeed, several genetic studies have been conducted to identify PD risk, but none used network-based approaches [43–46].
Network-based identification of genetic factors in ageing, lifestyle and type 2 diabetes that influence to the progression of Alzheimer's disease
2020, Informatics in Medicine UnlockedCitation Excerpt :An undirected graph representation was used for the PPI network, where the nodes indicate proteins and the edges symbolize the interactions between the proteins. We performed a topological analysis using the cyto-Hubba plugin [57] to identify highly connected proteins (i.e., hub proteins) in the network, and the degree metrics were employed [58,59]. For further insight into the pathways altered in AD, we incorporated pathway and gene ontology analysis on all the DEGs that were common between the AD and
Machine learning and bioinformatics models to identify gene expression patterns of ovarian cancer associated with disease progression and mortality
2019, Journal of Biomedical InformaticsA computational approach to identify blood cell-expressed Parkinson's disease biomarkers that are coordinately expressed in brain tissue
2019, Computers in Biology and Medicine
- 1
Joint first author.