Research Article
Network regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer

https://doi.org/10.1016/j.compbiolchem.2015.08.010Get rights and content

Abstract

In cancer genomics, gene expression levels provide important molecular signatures for all types of cancer, and this could be very useful for predicting the survival of cancer patients. However, the main challenge of gene expression data analysis is high dimensionality, and microarray is characterised by few number of samples with large number of genes. To overcome this problem, a variety of penalised Cox proportional hazard models have been proposed. We introduce a novel network regularised Cox proportional hazard model and a novel multiplex network model to measure the disease comorbidities and to predict survival of the cancer patient. Our methods are applied to analyse seven microarray cancer gene expression datasets: breast cancer, ovarian cancer, lung cancer, liver cancer, renal cancer and osteosarcoma. Firstly, we applied a principal component analysis to reduce the dimensionality of original gene expression data. Secondly, we applied a network regularised Cox regression model on the reduced gene expression datasets. By using normalised mutual information method and multiplex network model, we predict the comorbidities for the liver cancer based on the integration of diverse set of omics and clinical data, and we find the diseasome associations (disease–gene association) among different cancers based on the identified common significant genes. Finally, we evaluated the precision of the approach with respect to the accuracy of survival prediction using ROC curves. We report that colon cancer, liver cancer and renal cancer share the CXCL5 gene, and breast cancer, ovarian cancer and renal cancer share the CCND2 gene. Our methods are useful to predict survival of the patient and disease comorbidities more accurately and helpful for improvement of the care of patients with comorbidity. Software in Matlab and R is available on our GitHub page: https://github.com/ssnhcom/NetworkRegularisedCox.git.

Introduction

Comorbidity refers to the presence of one or more coexisting diseases that appear simultaneously with another disease, or interdependently with each other (Capobianco and Liò, 2013). Sometimes, a comorbidity is considered to be a secondary diagnosis, having been detected at the same time, or one after another, treatment for the principal diagnosis. It is associated with an elevated burden of symptoms, more complex clinical management, decreased length and quality of life and increased health-care costs (Valderas et al., 2009). Comorbidity has a significant predictive value on overall survival (Lagro et al., 2010). For an instance, comorbidity has a main effect on survival in cancer, particularly for cancer of the cervix (Ferrandina et al., 2012). However, there is also inverse comorbidity, which is characterised by a lower-than-expected probability of certain diseases occurring in individuals diagnosed with other health conditions, related to cancer (Tabarés-Seisdedos and Rubenstein, 2013). Most of the diseases associated with inverse cancer comorbidity are related to the Central Nervous System (CNS) or neuropsychiatric disorders (Tabarés-Seisdedos et al., 2011). There are some epidemiological evidences that patients with CNS disorders have a lower than expected probability of developing some types of cancer, such as Alzheimer's disease, Parkinsons disease and Schizophrenia (Ibáñez et al., 2014).

Data concerning the expression of various genes have been widely used for the prediction of the risk of disease. For appropriate disease-specific diagnostic, prognostic, and therapeutic approaches, disease–gene association studies provide valuable information (Tiffin et al., 2009). Understanding relationships between diseases and genes at the molecular level could help us to gain a better understanding of pathogenesis, and it leads to better prevent, treatment, and diagnosis (Du et al., 2009). Diseases are more likely to be comorbid if they share associated gene expression profiles (Park et al., 2009). These associations can be due to direct or indirect causal relationships and the shared risk factors among diseases (Moni and Liò, 2014, Liò et al., 2012). For instance, people with HIV-1 appear to have a markedly higher rate of end-stage renal disease (ESRD) than the healthy people (Kumar et al., 2005). It is because some of the risk factors associated with HIV-1 acquisition are the same as those that lead to kidney disease. Patients with chronic kidney disease increase risk of cardiovascular mortality (de Jager et al., 2014). Thus HIV-1 infections are associated with cardiovascular mortality.

Cancer is a group of complex diseases that is caused by abnormalities of biomarker genes. In cancer genomics, gene expression levels provide important molecular signatures for all types of cancer and that can be very useful in predicting the comorbidity and survival of cancer patients. Evidence of an increased comorbidity between CNS (Central Nervous System) disorders and certain cancers has existed for many years. For instance, Down's syndrome (DS) is strongly associated to increase the co-occurrence of cancer, specifically acute leukaemia, testicular cancer and some gastrointestinal cancers (Catalá-López et al., 2014). Mining of high-throughput gene expression data in order to identify biomarker associated with patient survival is an ongoing challenge in complex disease prognostic studies to achieve more accurate prognosis. However, one of the main challenges of microarray gene expression data analysis is the high dimensionality, due to the overwhelming number of measures of gene expression levels compared to the small number of cancer samples (Xu et al., 2010). To tackle this problem, variable selection has been applied to select significant subsets of genes in a microarray gene expression dataset. All the approaches have shown some limitations.

Many methods were proposed for survival analysis on high dimensional gene expression data with highly correlated variables (Van Wieringen et al., 2009, Witten and Tibshirani, 2010). To explore the relations between gene expression data and cancer survival with both censored samples and uncensored samples, Cox's proportional hazards model (Cox et al., 1972) is commonly used. It is the most popular survival model used to describe the relationship between the patient's survival time and predictor variables (Cox, 1992). However, when we have high-dimensional data (e.g. in a microarray study) where the number of predictors (genes) far exceeds the number of subjects (patients), the Cox model cannot be fitted directly unless the high-dimensionality is properly handled. Due to the high-dimensionality of microarray gene expression, a variety of regularisations with different penalties have been proposed including L1 penalty in lasso (least absolute shrinkage and selection operator) regression (Tibshirani et al., 1997, Gui and Li, 2005), adaptive lasso (Zhang and Lu, 2007, Zou, 2008) for gene selection and parameter estimation in high-dimensional microarray data. All of these methods can select important variables by shrinking some regression coefficients to equal exactly zero. These penalties can be imposed to individual variables to automatically remove unimportant ones. The lasso shrinks some of the coefficients to zero, and the amount of shrinkage is determined by the tuning parameter, often determined by cross validation. The model determined by this cross validation contains many false positives whose coefficients are actually zero. Hastie and Tibshirani (2004) and Hoshida et al. (2008) introduced an alternative method using L2 (Euclidean norm) penalty in ridge regression. Although the L1 and the L2 penalties have been designed as a statistical technique to solve the high-dimensional data problem it has some drawbacks. The primary one being that these procedures ignore important prior gene structure information regarding modular relations among gene expressions. A better approach is to identify the significant genes that are functionally related. Since it takes into account biological information it leads greater reliability. Groups of genes are co-expressed in different conditions through biological pathway or protein–protein interaction, and it provides prior information to reduce high dimensionally data based on removing confounding factors and statistical randomness for regression models (Chuang et al., 2007, Li and Li, 2008, Tian et al., 2009). Therefore, incorporating prior biological knowledge by exploiting the network structure in a statistical method is expected to improve its performance. The major advantage of the network-based models is the better generalisation across independent studies because the network information is consistent with the conserved patterns in the gene expression data.

In recent years, researchers have focussed on a particular data type, for example mRNA expression, to find profiles that are associated with particular diseases, prognosis, disease comorbidities and drug response. More recently, as the cost of collecting data using highthroughput technologies has decreased, studies have begun to integrate multiple data types collected from the same patient samples (Chalise et al., 2014). In addition, by analysing different types of data in isolation we may miss important information that results from the coordinated activity of biological components at various levels. Several methods have been proposed in the last few years that have aimed to address the issue of integrating multiple data types into a single analysis (Chalise et al., 2014). As data collection at the genomic, transcriptomic, epigenomic and proteomics levels is becoming easier it will become increasingly important to integrate these data in order to predict disease comorbidities. High-throughput omics-data such as messenger RNA (mRNA) expression, DNA copy number alterations, pathway dysregulation, DNA methylation and clinical information can provide a different view of the patients molecular status at various levels. Combining clinical and molecular data types may potentially improve prediction accuracy of disease comorbidity. However, currently there is a shortage of effective and efficient statistical and bioinformatics methods for true integrative data analysis. So far few methods have been proposed to integrate clinical and molecular data to obtain accurate cancer prognosis. Here, we have presented methods of integrating different types of data by modelling association between diseases in a multiplex network (a multilink between nodes and indicates the set of all links connecting these nodes in the different layers (Bianconi, 2013)). The multiplex network allows us to model disease comorbidities by representing each data type as layers in the multiplex network (Estrada and Gómez-Gardeñes, 2014). Importantly, this allows us to capture the interactions between the various types of data, such as the interdependence of pathway regulation with mRNA expression or mRNA expression with miRNA expression. Moreover comorbidity prediction of complex diseases is critical in the field of medicine. As data collection at the genomic, transcriptomic, epigenomic and proteomics levels is becoming easier it will become increasingly important to integrate these molecular data with the clinical information in order to predict disease comorbidities.

Section snippets

Materials and methods

In this article, we present the network-regularised Cox proportional hazard and multiplex network models to measure disease comorbidities based on the diverse set of data and to predict survival in cancer. At first each biological data were pre-processed. In the second stage, we apply PCA method to reduce the dimension of initial sample data. After PCA transform, a gene interaction network was built according to gene co-expression information, and the network regularised Cox regression model is

Results

We applied our network regularised Cox regression model and normalized mutual information approach to analyse the seven cancer microarray gene expression datasets. For this reason, we have preprocessed the microarray gene expression raw data. To compare among the datasets we have normalised all gene expression microarray dataset. In order to construct the network for survival analysis, we applied the PCA technique to reduce the dimensions of datasets. Survival analysis mostly with the Cox

Discussion and conclusion

In this study, our model integrates gene network information into the Cox proportional hazard method to explore the co-expression or functional association among high-dimensional gene expression features in the gene network. The goal of this study is to gain deeper insights into the benefits and drawbacks of the regression techniques in order to identify cancer biomarkers that are useful for prognosis, diagnosis and treatment. The regression coefficient in this method is used to find the most

References (104)

  • R. Shukla et al.

    Endogenous retrotransposition activates oncogenic pathways in hepatocellular carcinoma

    Cell

    (2013)
  • V. Siripurapu et al.

    Dbc2 significantly influences cell-cycle, apoptosis, cytoskeleton and membrane-trafficking pathways

    J. Mol. Biol.

    (2005)
  • J.J. Smith et al.

    Experimentally derived metastasis gene expression profile predicts recurrence and death in patients with colon cancer

    Gastroenterology

    (2010)
  • R. Tabarés-Seisdedos et al.

    No paradox, no progress: inverse cancer comorbidity in people with other complex diseases

    Lancet Oncol.

    (2011)
  • W.N. Van Wieringen et al.

    Survival prediction using gene expression data: a review and comparison

    Comput. Stat. Data Anal.

    (2009)
  • R. Xu et al.

    Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps

    Artif. Intell. Med.

    (2010)
  • J. Amberger et al.

    Mckusick's online Mendelian inheritance in man (OMIM)

    Nucleic Acids Res.

    (2009)
  • J. Amberger et al.

    A new face and new challenges for online Mendelian inheritance in man (OMIM)

    Hum. Mutat.

    (2011)
  • S. Arora et al.

    Atypical ductal hyperplasia at margin of breast biopsy-is re-excision indicated?

    Ann. Surg. Oncol.

    (2008)
  • K.G. Becker et al.

    The genetic association database

    Nat. Genet.

    (2004)
  • G. Bianconi

    Statistical mechanics of multiplex networks: Entropy and overlap

    Phys. Rev. E

    (2013)
  • S. Boccaletti et al.

    The Structure and Dynamics of Multilayer Networks

    (2014)
  • T. Bonome et al.

    A gene signature predicting for survival in suboptimally debulked patients with ovarian cancer

    Cancer Res.

    (2008)
  • E.P. Buddingh et al.

    Tumor-infiltrating macrophages are associated with metastasis suppression in high-grade osteosarcoma: a rationale for treatment with macrophage activating agents

    Clin. Cancer Res.

    (2011)
  • R. Cacabelos

    Pharmacogenomics of central nervous system (CNS) drugs

    Drug Dev. Res.

    (2012)
  • I. Cassar-Malek et al.

    Pasture-feeding of charolais steers influences skeletal muscle metabolism and gene expression

    J. Physiol. Pharmacol.

    (2009)
  • F. Catalá-López et al.

    Inverse and direct cancer comorbidity in people with central nervous system disorders: a meta-analysis of cancer incidence in 577,013 participants of 50 observational studies

    Psychother. Psychosom.

    (2014)
  • P. Chalise et al.

    Integrative clustering methods for high-dimensional molecular data

    Transl. Cancer Res.

    (2014)
  • Y. Cheng et al.

    Identified differently expressed genes in renal cell carcinoma by using multiple microarray datasets running head: differently expressed genes in renal cell carcinoma

    Eur. Rev. Med. Pharmacol. Sci.

    (2014)
  • H.-Y. Chuang et al.

    Network-based classification of breast cancer metastasis

    Mol. Syst. Biol.

    (2007)
  • K.R. Chun et al.

    Expression of the i kr components kcnh2 (rerg) and kcne2 (rmirp1) during late rat heart development

    Exp. Mol. Med.

    (2004)
  • G.S. Cooper et al.

    Risk of cancer following lumbar fusion surgery with recombinant human bone morphogenic protein-2 (rh-bmp-2)

    Spine

    (2013)
  • D.R. Cox

    Regression models and life tables

    JR Stat. Soc. B

    (1972)
  • D.R. Cox

    Regression models and life-tables

    Breakthroughs in Statistics

    (1992)
  • D.J. de Jager et al.

    Noncardiovascular mortality in ckd: an epidemiological perspective

    Nat. Rev. Nephrol.

    (2014)
  • T. Deng et al.

    shRNA kinome screen identifies TBK1 as a therapeutic target for HER2+ breast cancer

    Cancer Res.

    (2014)
  • C. Desmedt et al.

    Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the transbig multicenter independent validation series

    Clin. Cancer Res.

    (2007)
  • P. Du et al.

    From disease ontology to disease-ontology lite: statistical methods to adapt a general-purpose ontology for the test of gene–ontology associations

    Bioinformatics

    (2009)
  • R. Edgar et al.

    Gene expression omnibus: NCBI gene expression and hybridization array data repository

    Nucleic Acids Res.

    (2002)
  • B.K. Edwards et al.

    Annual report to the nation on the status of cancer, 1975–2010, featuring prevalence of comorbidity and impact on survival among persons with lung, colorectal, breast, or prostate cancer

    Cancer

    (2014)
  • U. Eskiocak et al.

    Functional parsing of driver mutations in the colorectal cancer genome reveals numerous suppressors of anchorage-independent growth

    Cancer Res.

    (2011)
  • E. Estrada et al.

    Communicability reveals a transition to coordinated behavior in multiplex networks

    Phys. Rev. E

    (2014)
  • Y. Fan et al.

    Tuning parameter selection in high dimensional penalized likelihood

    J. R. Stat. Soc. Ser. B (Stat. Methodol.)

    (2013)
  • T.J. Freeman et al.

    Smad4-mediated signaling inhibits intestinal neoplasia by inhibiting expression of β-catenin

    Gastroenterology

    (2012)
  • R.C. Gentleman et al.

    Bioconductor: open software development for computational biology and bioinformatics

    Genome Biol.

    (2004)
  • S. Gnjatic et al.

    Seromic profiling of ovarian and pancreatic cancer

    Proc. Natl. Acad. Sci.

    (2010)
  • K.-I. Goh et al.

    The human disease network

    Proc. Natl. Acad. Sci.

    (2007)
  • J. Gui et al.

    Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data

    Bioinformatics

    (2005)
  • A. Hamosh et al.

    Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders

    Nucleic Acids Res.

    (2005)
  • M.-W. Hao et al.

    Transcription factor egr-1 inhibits growth of hepatocellular carcinoma and esophageal carcinoma cells lines

    World J. Gastroenterol.

    (2002)
  • Cited by (28)

    • Detection of molecular signatures and pathways shared in inflammatory bowel disease and colorectal cancer: A bioinformatics and systems biology approach

      2020, Genomics
      Citation Excerpt :

      Within the period, they found 12% of the patients are developed CRC, and 21% of the patients directly diagnosed with IBD [4]. Several studies showed that there are several convincing pieces of evidence for pathological and epidemiological links between IBD and CRC from the population-based studies which revealed that IBD is a strong significant risk factor of CRC and vice versa [5–11]. Analysis of transcriptomes is commonly used to identify the candidate biomarkers for various disease including CRC [12–14] and IBD [15,16].

    • A systems biology approach to identifying genetic factors affected by aging, lifestyle factors, and type 2 diabetes that influences Parkinson's disease progression

      2020, Informatics in Medicine Unlocked
      Citation Excerpt :

      However, from a proteomics and signaling pathways point of view, their association and causation status can be investigated using studies of biological modules such as PPIs, GO or molecular pathways [37,38]. Network-based approaches for genetic studies of poorly understood diseases have become popular in recent years [33,39–42]. Indeed, several genetic studies have been conducted to identify PD risk, but none used network-based approaches [43–46].

    • Network-based identification of genetic factors in ageing, lifestyle and type 2 diabetes that influence to the progression of Alzheimer's disease

      2020, Informatics in Medicine Unlocked
      Citation Excerpt :

      An undirected graph representation was used for the PPI network, where the nodes indicate proteins and the edges symbolize the interactions between the proteins. We performed a topological analysis using the cyto-Hubba plugin [57] to identify highly connected proteins (i.e., hub proteins) in the network, and the degree metrics were employed [58,59]. For further insight into the pathways altered in AD, we incorporated pathway and gene ontology analysis on all the DEGs that were common between the AD and

    View all citing articles on Scopus
    1

    Joint first author.

    View full text