Full Length ArticleMachine learning for integrating data in biology and medicine: Principles, practice, and opportunities
Introduction
Understanding complex biological systems has been an ongoing quest for many researchers. The rapidly decreasing costs of high-throughput sequencing, development of massively parallel technologies, and new sensor technologies have enabled generation of data that describe biological systems on multiple dimensions. These dimensions include DNA sequence [1], epigenomic states [2], single-cell gene expression activity [3], proteomics [4], functional and phenotypic measurements [5], and ecological and lifestyle properties [6]. These technological advances in data generation have driven the field of bioinformatics for the past decade, producing ever increasing amounts of data of different types as researchers develop data analysis tools. Many of these data types have associated analytical methods designed to examine one data type specifically. Using these methods, we have assembled some of the puzzle of biological architecture. Usually, however, the factors necessary to understand a phenomenon such as a disease, cannot be captured by a single data type (Fig. 1). Much of the complexity in biology and medicine thus remains unexplained. If the field relies strictly on single-data-type studies, it never will be explained.
Ideally, one can combine different types of data to create a holistic picture of the cell, human health, and disease. Researchers have developed multiple approaches to do this, and therefore address the challenges brought forward by large and heterogeneous biomedical data. For example, one can identify DNA sequence variation through association studies in family- and population-based data, and then integrate it with molecular pathway information to predict the risk of developing a particular disease [7]. Data integration can have numerous meanings, however, it is used here to mean the process by which different types of biomedical data in their broadest sense are combined as predictor variables to allow for more thorough and comprehensive modeling of biomedically relevant outcomes. As reviewed previously (e.g., [8], [9], [10]), a data integration approach can achieve a more thorough and informative analysis of biomedical data than an approach that uses only a single data type. Combining multiple data types can compensate for missing or unreliable information in any single data type, and multiple sources of evidence pointing to the same outcome are less likely to lead to false positives. A complete model of a system like the human body is only likely to be discovered if information from different dimensions is considered, from the genome and transcriptome to organismal environment.
In this Review, we describe the principles of data integration, and provide a taxonomy of machine learning methods presently in use to integrate biomedical data. We discuss current methods, implementations of these methods, and their successful applications in biology and medicine. Furthermore, we discuss challenges in optimally combining and interpreting data from multiple sources and the advantages of integrating multiple data types. For example, one technology may address shortcomings of another to provide a more precise insight into human disease. In addition, we provide our perspective on how integrative data analysis might develop in the future.
Section snippets
Challenges in data integration for biology and medicine
When one develops machine learning approaches to integrate biomedical data, several challenges arise. Biological and medical datasets have inherent complexity beyond their large sizes. Biomedical datasets are also high-dimensional, incomplete, biased, heterogeneous, dynamic, and noisy. We briefly describe these challenges below.
Biomedical data is often high-dimensional but sparse. This contrasts with large datasets in other domains, such as social networks, computer vision, and natural
Conceptual organization of methods for data integration
We broadly categorize data integration methods into two types of approaches. We refer to approaches that combine models and datasets across spatial and temporal scales as vertical data integration,which depends on integration of cellular, cell type, tissue, organism, and population models at several temporal scales [23], [26], [27]. In contrast, horizontal data integration focuses on combining datasets and models at one particular level [28], [29], for example, at the microbiome [30] or at the
Focus of this Review
This Review is intended for computational researchers who are interested in recent developments and applications of machine learning to biology and medicine and its potential for advancing biomedicine given the vast amounts of heterogeneous data being generated today. In the Review, we focus on statistical approaches and machine learning methods for data integration. We describe the principles of integrative approaches and provide an overview of some of the methods used to address various
Epigenomic variation and gene regulation
Individual cells within a multicellular organism usually have nearly identical DNA sequences, but still develop distinct cellular identities. These cellular identities manifest as diverse physical forms and behaviors, but ultimately represent differing programs of gene expression. The different gene expression programs also materialize in site-specific physical and chemical changes to the DNA and the thousands of biomolecules that interact with it. These include chemical modification of DNA
Noncoding variant effects
Researchers and medical professionals often want to know what effects DNA changes will have on cellular and organismal phenotype. While interpreting the effects of changes to the sequence coding for proteins is relatively easy, interpreting the noncoding sequence that makes up most of a complex organism’s genome has proven far more challenging. Many noncoding sequence variants are associated with particular phenotypic traits or genetic diseases [127]. Noncoding changes often cause phenotypic
Integrative single-cell analysis
A major question in biology is how to describe and quantify every cell in a multicellular organism [140], such as human, which can contain a myriad of different types of cells. Cell type, such as muscle or nerve, is typically defined based on function of a tissue in which the cells reside and unique morphological properties of that tissue [141]. However, considerable cell-to-cell variation in cells within a single cell type indicates the existence of distinct cell states (e.g., mitotic,
Cellular phenotype and function
Our ability to generate sequence data has been improving at a rapid rate for the past decade, and this trend is likely to continue for the next decade (Section 5). A vast majority of these sequences are of proteins of unknown function and their worth could be substantially increased by knowing the biological roles that they play. Accurate annotation of protein function is a key to understanding life at the molecular level and has great biomedical and pharmaceutical implications. To this aim,
Computational pharmacology
The goal of computational pharmacology is to use data to predict and better understand how drugs affect the human body, support decision making in the drug discovery process, improve clinical practice and avoid unwanted side effects (for an excellent review, see [20], [252]). The properties of drugs and their interactions with the human body can be described in a variety of ways and measured at the physicochemical, pharmacological, and phenotypic levels. One can measure the physicochemical
Disease subtyping and biomarker discovery
Many diseases are characterized by incredible heterogeneity among patients. This includes many common diseases of which neuropsychiatric and autoimmune disorders (e.g., Autism Spectrum Disorder (ASD), Attention Deficit Hyperactive Disorder (ADHD), Obsessive Compulsive Disorder (OCD), arthritis, lupus, chronic fatigue syndrome (CFS)) are among the most diverse. This means that individuals present at the clinic with widely ranging symptoms. ASD patients, for example, range from those with mild
Challenges and future directions
There are great opportunities at the intersection of machine learning and biomedical data integration. However, there are equally great challenges that need to be overcome. In particular, the days of studying biomedical datasets in isolation and independently of each other are slowly coming to an end and the reductionist paradigms of looking for ‘low-hanging fruit’ (i.e., a single variable that would fully explain a trait) are becoming less prevalent. The realization that performing all
Conclusions
Machine learning is becoming integral to modern biomedical research. Importantly, approaches have emerged that can integrate data from many different biomedical datasets. These approaches aim to bridge the gap between our ability to generate vast amounts of data and our understanding of biomedical systems and thus reflect the intricate complexity of biology. Ongoing methodological developments and emerging applications of machine learning promise an exciting future for biomedical data
Acknowledgements
M.Z. and J.L. were supported in part by National Science Foundation IIS-1149837, NIH BD2K U54EB020405, DARPA SIMPLEX, Stanford Data Science Initiative and the Chan Zuckerberg Biohub. F.N. and M.M.H. were supported by the Natural Sciences and Natural Sciences and Engineering Research Council of Canada (RGPIN-2015-03948 to M.M.H.).
References (355)
Elucidation of the impact of P-glycoprotein and breast cancer resistance protein on the brain distribution of catechol-O-methyltransferase inhibitors
Drug Metab. Dispos.
(2017)- et al.
Interpreting the language of histone and DNA modifications
Biochimica et Biophysica Acta
(2014) - et al.
Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position
Nat. Methods
(2013) - et al.
Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning
Nature
(2008) Comprehensive mapping of long-range interactions reveals folding principles of the human genome
Science
(2009)An integrated encyclopedia of DNA elements in the human genome
Nature
(2012)Integrative analysis of 111 reference human epigenomes
Nature
(2015)- et al.
Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris
bioRxiv
(2018) - et al.
Mass-spectrometry-based draft of the human proteome
Nature
(2014) - et al.
A global genetic interaction network maps a wiring diagram of cellular function
Science
(2016)
Digital health: tracking physiomes and activity using wearable biosensors reveals useful health-related information
PLoS Biol.
Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies
Nat. Genet.
Methods of integrating data to uncover genotype-phenotype interactions
Nat. Rev. Genet.
Integrative omics for health and disease
Nat. Rev. Genet.
Statistical and integrative system-level analysis of DNA methylation data
Nat. Rev. Genet.
GWAS of 89,283 individuals identifies genetic variants associated with self-reporting of being a morning person
Nat. Commun.
Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network
Genome Biol.
Network-based stratification of tumor mutations
Nat. Methods
Annotation of loci from genome-wide association studies using tissue-specific quantitative interaction proteomics
Nat. Methods
Data imputation in epistatic MAPs by network-guided matrix completion
J. Comput. Biol.
Identification of 15 genetic loci associated with risk of major depression in individuals of European descent
Nat. Genet.
Uncovering disease-disease relationships through the incomplete interactome
Science
Drug target identification using side-effect similarity
Science
Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations
Bioinformatics
In silico methods for drug repurposing and pharmacology
Wiley Interdiscip. Rev. Syst. Biol. Med.
Siri of the cell: what biology could learn from the iPhone
Cell
Understanding multicellular function and disease with human tissue-specific networks
Nat. Genet.
Predicting multicellular function through multi-layer tissue networks
Bioinformatics
Does machine learning automate moral hazard and error?
Am. Econ. Rev.
The multilayer nature of ecological networks
Nature Ecology & Evolution
Jumping across biomedical contexts using compressive data fusion
Bioinformatics
The International Human Epigenome Consortium Data Portal
Cell Syst.
Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell-type-specific expression
Genome Res.
Seasonal cycling in the gut microbiome of the Hadza hunter-gatherers of Tanzania
Science
Learning gene functional classifications from multiple data types
J. Comput. Biol.
Cross-modal integration for performance improving in multimedia: a review
Multimodal Processing and Interaction
Data fusion by matrix factorization
IEEE Trans. Pattern Anal. Mach. Intell.
Nimfa: A Python library for nonnegative matrix factorization
J. Mach. Learn. Res.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion
J. Mach. Learn. Res.
Graphlet-based characterization of directed networks
Sci. Rep.
A review of ensemble methods in bioinformatics
Curr. Bioinform.
Prediction of human functional genetic networks from heterogeneous data using RVM-based ensemble learning
Bioinformatics
LCE: a link-based cluster ensemble method for improved gene expression data analysis
Bioinformatics
Towards a piRNA prediction using multiple kernel fusion and support vector machine
Bioinformatics
Unsupervised multiple kernel learning for heterogeneous data integration
Bioinformatics
Survival regression by data fusion
Systems Biomedicine
Similarity network fusion for aggregating data types on a genomic scale
Nat. Methods
Unsupervised extraction of stable expression signatures from public compendia with an ensemble of neural networks
Cell Syst.
Modeling polypharmacy side effects with graph convolutional networks.
Bioinformatics
GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function
Genome Biol.
Cited by (336)
PMFN-SSL: Self-supervised learning-based progressive multimodal fusion network for cancer diagnosis and prognosis
2024, Knowledge-Based SystemsA knowledge graph-supported information fusion approach for multi-faceted conceptual modelling
2024, Information FusionData reformation – A novel data processing technique enhancing machine learning applicability for predicting streamflow extremes
2023, Advances in Water ResourcesPrenatal exposures to endocrine disrupting chemicals: The role of multi-omics in understanding toxicity
2023, Molecular and Cellular Endocrinology