Elsevier

Information Fusion

Volume 50, October 2019, Pages 71-91
Information Fusion

Full Length Article
Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities

https://doi.org/10.1016/j.inffus.2018.09.012Get rights and content

Highlights

  • New biomedical technologies generate measurements at scale and in multiple dimensions.

  • Large and diverse biomedical data present fundamentally new challenges for machine learning.

  • Integrative approaches combine different types of data to provide a comprehensive systems view.

  • Data integration creates a holistic picture of the cell, human body, and disease.

  • Advances in machine learning bring exciting future for biomedical data integration.

Abstract

New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.

Introduction

Understanding complex biological systems has been an ongoing quest for many researchers. The rapidly decreasing costs of high-throughput sequencing, development of massively parallel technologies, and new sensor technologies have enabled generation of data that describe biological systems on multiple dimensions. These dimensions include DNA sequence [1], epigenomic states [2], single-cell gene expression activity [3], proteomics [4], functional and phenotypic measurements [5], and ecological and lifestyle properties [6]. These technological advances in data generation have driven the field of bioinformatics for the past decade, producing ever increasing amounts of data of different types as researchers develop data analysis tools. Many of these data types have associated analytical methods designed to examine one data type specifically. Using these methods, we have assembled some of the puzzle of biological architecture. Usually, however, the factors necessary to understand a phenomenon such as a disease, cannot be captured by a single data type (Fig. 1). Much of the complexity in biology and medicine thus remains unexplained. If the field relies strictly on single-data-type studies, it never will be explained.

Ideally, one can combine different types of data to create a holistic picture of the cell, human health, and disease. Researchers have developed multiple approaches to do this, and therefore address the challenges brought forward by large and heterogeneous biomedical data. For example, one can identify DNA sequence variation through association studies in family- and population-based data, and then integrate it with molecular pathway information to predict the risk of developing a particular disease [7]. Data integration can have numerous meanings, however, it is used here to mean the process by which different types of biomedical data in their broadest sense are combined as predictor variables to allow for more thorough and comprehensive modeling of biomedically relevant outcomes. As reviewed previously (e.g., [8], [9], [10]), a data integration approach can achieve a more thorough and informative analysis of biomedical data than an approach that uses only a single data type. Combining multiple data types can compensate for missing or unreliable information in any single data type, and multiple sources of evidence pointing to the same outcome are less likely to lead to false positives. A complete model of a system like the human body is only likely to be discovered if information from different dimensions is considered, from the genome and transcriptome to organismal environment.

In this Review, we describe the principles of data integration, and provide a taxonomy of machine learning methods presently in use to integrate biomedical data. We discuss current methods, implementations of these methods, and their successful applications in biology and medicine. Furthermore, we discuss challenges in optimally combining and interpreting data from multiple sources and the advantages of integrating multiple data types. For example, one technology may address shortcomings of another to provide a more precise insight into human disease. In addition, we provide our perspective on how integrative data analysis might develop in the future.

Section snippets

Challenges in data integration for biology and medicine

When one develops machine learning approaches to integrate biomedical data, several challenges arise. Biological and medical datasets have inherent complexity beyond their large sizes. Biomedical datasets are also high-dimensional, incomplete, biased, heterogeneous, dynamic, and noisy. We briefly describe these challenges below.

Biomedical data is often high-dimensional but sparse. This contrasts with large datasets in other domains, such as social networks, computer vision, and natural

Conceptual organization of methods for data integration

We broadly categorize data integration methods into two types of approaches. We refer to approaches that combine models and datasets across spatial and temporal scales as vertical data integration,which depends on integration of cellular, cell type, tissue, organism, and population models at several temporal scales [23], [26], [27]. In contrast, horizontal data integration focuses on combining datasets and models at one particular level  [28], [29], for example, at the microbiome [30] or at the

Focus of this Review

This Review is intended for computational researchers who are interested in recent developments and applications of machine learning to biology and medicine and its potential for advancing biomedicine given the vast amounts of heterogeneous data being generated today. In the Review, we focus on statistical approaches and machine learning methods for data integration. We describe the principles of integrative approaches and provide an overview of some of the methods used to address various

Epigenomic variation and gene regulation

Individual cells within a multicellular organism usually have nearly identical DNA sequences, but still develop distinct cellular identities. These cellular identities manifest as diverse physical forms and behaviors, but ultimately represent differing programs of gene expression. The different gene expression programs also materialize in site-specific physical and chemical changes to the DNA and the thousands of biomolecules that interact with it. These include chemical modification of DNA

Noncoding variant effects

Researchers and medical professionals often want to know what effects DNA changes will have on cellular and organismal phenotype. While interpreting the effects of changes to the sequence coding for proteins is relatively easy, interpreting the noncoding sequence that makes up most of a complex organism’s genome has proven far more challenging. Many noncoding sequence variants are associated with particular phenotypic traits or genetic diseases [127]. Noncoding changes often cause phenotypic

Integrative single-cell analysis

A major question in biology is how to describe and quantify every cell in a multicellular organism [140], such as human, which can contain a myriad of different types of cells. Cell type, such as muscle or nerve, is typically defined based on function of a tissue in which the cells reside and unique morphological properties of that tissue [141]. However, considerable cell-to-cell variation in cells within a single cell type indicates the existence of distinct cell states (e.g., mitotic,

Cellular phenotype and function

Our ability to generate sequence data has been improving at a rapid rate for the past decade, and this trend is likely to continue for the next decade (Section 5). A vast majority of these sequences are of proteins of unknown function and their worth could be substantially increased by knowing the biological roles that they play. Accurate annotation of protein function is a key to understanding life at the molecular level and has great biomedical and pharmaceutical implications. To this aim,

Computational pharmacology

The goal of computational pharmacology is to use data to predict and better understand how drugs affect the human body, support decision making in the drug discovery process, improve clinical practice and avoid unwanted side effects (for an excellent review, see [20], [252]). The properties of drugs and their interactions with the human body can be described in a variety of ways and measured at the physicochemical, pharmacological, and phenotypic levels. One can measure the physicochemical

Disease subtyping and biomarker discovery

Many diseases are characterized by incredible heterogeneity among patients. This includes many common diseases of which neuropsychiatric and autoimmune disorders (e.g., Autism Spectrum Disorder (ASD), Attention Deficit Hyperactive Disorder (ADHD), Obsessive Compulsive Disorder (OCD), arthritis, lupus, chronic fatigue syndrome (CFS)) are among the most diverse. This means that individuals present at the clinic with widely ranging symptoms. ASD patients, for example, range from those with mild

Challenges and future directions

There are great opportunities at the intersection of machine learning and biomedical data integration. However, there are equally great challenges that need to be overcome. In particular, the days of studying biomedical datasets in isolation and independently of each other are slowly coming to an end and the reductionist paradigms of looking for ‘low-hanging fruit’ (i.e., a single variable that would fully explain a trait) are becoming less prevalent. The realization that performing all

Conclusions

Machine learning is becoming integral to modern biomedical research. Importantly, approaches have emerged that can integrate data from many different biomedical datasets. These approaches aim to bridge the gap between our ability to generate vast amounts of data and our understanding of biomedical systems and thus reflect the intricate complexity of biology. Ongoing methodological developments and emerging applications of machine learning promise an exciting future for biomedical data

Acknowledgements

M.Z. and J.L. were supported in part by National Science Foundation IIS-1149837, NIH BD2K U54EB020405, DARPA SIMPLEX, Stanford Data Science Initiative and the Chan Zuckerberg Biohub. F.N. and M.M.H. were supported by the Natural Sciences and Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03948 to M.M.H.).

References (355)

  • X. Li et al.

    Digital health: tracking physiomes and activity using wearable biosensors reveals useful health-related information

    PLoS Biol.

    (2017)
  • N. Chatterjee et al.

    Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies

    Nat. Genet.

    (2013)
  • M.D. Ritchie et al.

    Methods of integrating data to uncover genotype-phenotype interactions

    Nat. Rev. Genet.

    (2015)
  • K.J. Karczewski et al.

    Integrative omics for health and disease

    Nat. Rev. Genet.

    (2018)
  • A.E. Teschendorff et al.

    Statistical and integrative system-level analysis of DNA methylation data

    Nat. Rev. Genet.

    (2018)
  • Y. Hu et al.

    GWAS of 89,283 individuals identifies genetic variants associated with self-reporting of being a morning person

    Nat. Commun.

    (2016)
  • B. Linghu et al.

    Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network

    Genome Biol.

    (2009)
  • M. Hofree et al.

    Network-based stratification of tumor mutations

    Nat. Methods

    (2013)
  • A. Lundby et al.

    Annotation of loci from genome-wide association studies using tissue-specific quantitative interaction proteomics

    Nat. Methods

    (2014)
  • M. Zitnik et al.

    Data imputation in epistatic MAPs by network-guided matrix completion

    J. Comput. Biol.

    (2015)
  • C.L. Hyde et al.

    Identification of 15 genetic loci associated with risk of major depression in individuals of European descent

    Nat. Genet.

    (2016)
  • J. Menche

    Uncovering disease-disease relationships through the incomplete interactome

    Science

    (2015)
  • M. Campillos

    Drug target identification using side-effect similarity

    Science

    (2008)
  • N. Zong et al.

    Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations

    Bioinformatics

    (2017)
  • R.A. Hodos

    In silico methods for drug repurposing and pharmacology

    Wiley Interdiscip. Rev. Syst. Biol. Med.

    (2016)
  • A.-R. Carvunis et al.

    Siri of the cell: what biology could learn from the iPhone

    Cell

    (2014)
  • C.S. Greene et al.

    Understanding multicellular function and disease with human tissue-specific networks

    Nat. Genet.

    (2015)
  • M. Zitnik et al.

    Predicting multicellular function through multi-layer tissue networks

    Bioinformatics

    (2017)
  • S. Mullainathan et al.

    Does machine learning automate moral hazard and error?

    Am. Econ. Rev.

    (2017)
  • S. Pilosof et al.

    The multilayer nature of ecological networks

    Nature Ecology & Evolution

    (2017)
  • M. Zitnik et al.

    Jumping across biomedical contexts using compressive data fusion

    Bioinformatics

    (2016)
  • D. Bujold et al.

    The International Human Epigenome Consortium Data Portal

    Cell Syst.

    (2016)
  • M.W. Libbrecht et al.

    Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell-type-specific expression

    Genome Res.

    (2015)
  • S.A. Smits et al.

    Seasonal cycling in the gut microbiome of the Hadza hunter-gatherers of Tanzania

    Science

    (2017)
  • P. Pavlidis et al.

    Learning gene functional classifications from multiple data types

    J. Comput. Biol.

    (2002)
  • P. Maragos et al.

    Cross-modal integration for performance improving in multimedia: a review

    Multimodal Processing and Interaction

    (2008)
  • M. Zitnik et al.

    Data fusion by matrix factorization

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • M. Zitnik et al.

    Nimfa: A Python library for nonnegative matrix factorization

    J. Mach. Learn. Res.

    (2012)
  • P. Vincent et al.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

    J. Mach. Learn. Res.

    (2010)
  • A. Sarajlić et al.

    Graphlet-based characterization of directed networks

    Sci. Rep.

    (2016)
  • P. Yang et al.

    A review of ensemble methods in bioinformatics

    Curr. Bioinform.

    (2010)
  • C.-C. Wu et al.

    Prediction of human functional genetic networks from heterogeneous data using RVM-based ensemble learning

    Bioinformatics

    (2010)
  • N. Iam-On et al.

    LCE: a link-based cluster ensemble method for improved gene expression data analysis

    Bioinformatics

    (2010)
  • J. Brayet et al.

    Towards a piRNA prediction using multiple kernel fusion and support vector machine

    Bioinformatics

    (2014)
  • J. Mariette et al.

    Unsupervised multiple kernel learning for heterogeneous data integration

    Bioinformatics

    (2017)
  • M. Zitnik et al.

    Survival regression by data fusion

    Systems Biomedicine

    (2014)
  • B. Wang et al.

    Similarity network fusion for aggregating data types on a genomic scale

    Nat. Methods

    (2014)
  • J. Tan et al.

    Unsupervised extraction of stable expression signatures from public compendia with an ensemble of neural networks

    Cell Syst.

    (2017)
  • M. Zitnik et al.

    Modeling polypharmacy side effects with graph convolutional networks.

    Bioinformatics

    (2018)
  • S. Mostafavi et al.

    GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function

    Genome Biol.

    (2008)
  • Cited by (336)

    View all citing articles on Scopus
    View full text