Incorporating biological structure into machine learning models in biomedicine

In biomedical applications of machine learning, relevant information often has a rich structure that is not easily encoded as real-valued predictors. Examples of such data include DNA or RNA sequences, gene sets or pathways, gene interaction or coexpression networks, ontologies, and phylogenetic trees. We highlight recent examples of machine learning models that use structure to constrain model architecture or incorporate structured data into model training. For machine learning in biomedicine, where sample size is limited and model interpretability is crucial, incorporating prior knowledge in the form of structured data can be particularly useful. The area of research would benefit from performant open source implementations and independent benchmarking efforts.


Introduction
It can be challenging to distinguish signal from noise in biomedical datasets, and machine learning methods are particularly hampered when the amount of available training data is small.Incorporating biomedical knowledge into machine learning models can reveal patterns in noisy data [1] and aid model interpretation [2].Biological knowledge can take many forms, including genomic sequences, pathway databases, gene interaction networks, and knowledge hierarchies such as the Gene Ontology [3].However, there is often no canonical way to encode these structures as real-valued predictors.
Modelers must creatively decide how to encode biological knowledge that they expect will be relevant to the task.
Biomedical datasets often contain more input predictors than data samples [4,5].A genetic study may genotype millions of single nucleotide polymorphisms (SNPs) in thousands of individuals, or a gene expression study may pro le the expression of thousands of genes in tens of samples.Thus, it can be useful to include prior information describing relationships between predictors to inform the representation learned by the model.This contrasts with non-biological applications of machine learning, where one might t a model on millions of images [6] or tens of thousands of documents [7], making inclusion of prior information unnecessary.
We review approaches that incorporate external information about the structure of desirable solutions to learn from biomedical data.One class of commonly used approaches learns a representation that considers the context of each base pair from raw sequence data.For models that operate on gene expression data or genetic variants, it can be useful to incorporate networks or pathways describing relationships between genes.We also consider other examples, such as neural network architectures that are constrained based on biological knowledge.
There are many complementary ways to incorporate heterogeneous sources of biomedical data into the learning process, which have been covered elsewhere [8,9].These include feature extraction or representation learning prior to modeling and/or other data integration methods that do not necessarily involve customizing the model itself.

Sequence models
Early neural network models primarily used hand-engineered sequence features as input to a fully connected neural network [10,11] (Figure 1).As convolutional neural network (CNN) approaches matured for image processing and computer vision, researchers leveraged biological sequence proximity similarly.CNNs are a neural network variant that groups input data by spatial context to extract features for prediction.
The de nition of "spatial context" is speci c to the input: one might group image pixels that are nearby in 2D space, or genomic base pairs that are nearby in the linear genome.In this way, CNNs consider context without making strong assumptions about exactly how much context is needed or how it should be encoded; the data informs the encoding.A detailed description of how CNNs are applied to sequences can be found in Angermueller et al. [12].

A T A G C G A T A T C G C T Sequence data
Tabular features defined by hand

Applications in regulatory biology
Many early applications of deep learning to biological sequences were in regulatory biology.Early CNNs for sequence data predicted binding protein sequence speci city from DNA or RNA sequence [13], variant e ects from noncoding DNA sequence [14], and chromatin accessibility from DNA sequence [15].
Recent sequence models take advantage of hardware advances and methodological innovation to incorporate more sequence context and rely on fewer modeling assumptions.BPNet, a CNN that predicts transcription factor binding pro les from DNA sequences, accurately mapped known locations of binding motifs in mouse embryonic stem cells [16].BPNet considers 1000 base pairs of context around each position when predicting binding probabilities with a technique called dilated convolutions [17], which is particularly important because motif spacing and periodicity can in uence binding.cDeepbind [18] combines RNA sequences with information about secondary structure to predict RNA binding protein a nities.Its convolutional model acts on a feature vector combining sequence and structural information, using context for both to inform predictions.APARENT [19] is a CNN that predicts alternative polyadenylation (APA) from a training set of over 3 million synthetic APA reporter sequences.These diverse applications underscore the power of modern deep learning models to synthesize large sequence datasets.
Models that consider sequence context have also been applied to epigenetic data.DeepSignal [20] is a CNN that uses contextual electrical signals from Oxford Nanopore single-molecule sequencing data to predict 5mC or 6mA DNA methylation status.MRCNN [21] uses sequences of length 400, centered at CpG sites, to predict 5mC methylation status.Deep learning models have also been used to predict gene expression from histone modi cations [22,23].Here, a neural network model consisting of long short-term memory (LSTM) units was used to encode the long-distance interactions of histone marks in both the 3' and 5' genomic directions.In each of these cases, proximity in the linear genome helped model the complex interactions between DNA sequence and epigenome.

Applications in variant calling and mutation detection
Identi cation of genetic variants also bene ts from models that include sequence context.DeepVariant [24] applies a CNN to images of sequence read pileups, using read data around each candidate variant to accurately distinguish true variants from sequencing errors.CNNs have also been applied to single molecule (PacBio and Oxford Nanopore) sequencing data [25], using a di erent sequence encoding that results in better performance than DeepVariant on single molecule data.However, many variant calling models still use hand-engineered sequence features as input to a classi er, including current state-of-the-art approaches to insertion/deletion calling [26,27].Detection of somatic mutations is a distinct but related challenge to detection of germline variants, and has also recently bene tted from use of CNNs [28].

Network-and pathway-based models
Rather than operating on sequences, many machine learning models in biomedicine operate on inputs that lack intrinsic order.Models may make use of gene expression data matrices from RNA sequencing or microarray experiments in which rows represent samples and columns represent genes.To account for relationships between genes, one might incorporate known interactions or correlations when making predictions or generating a low-dimensional representation of the data (Figure 2).This is comparable to the manner in which sequence context pushes models to consider nearby base pairs similarly.

Tabular data
Gene sets, networks, etc.

Figure 2:
The relationships between genes provide structure that can be incorporated into machine learning models.One common approach is to use a network or collection of gene sets to embed the data in a lower-dimensional space, in which genes that are in the same gene sets or that are well-connected in the network have a similar representation in the lower-dimensional space.The embedded data can then be used for classi cation or clustering tasks.

Applications in transcriptomics
Models built from gene expression data can bene t from incorporating gene-level relationships.One form that this knowledge commonly takes is a database of gene sets, which may represent biological pathways or gene signatures for a biological state of interest.PLIER [29] uses gene set information from MSigDB [30] and cell type markers to extract a representation of gene expression data that corresponds to biological processes and reduces technical noise.The resulting gene set-aligned representation accurately decomposed cell type mixtures.MultiPLIER [31] applied PLIER to the recount2 gene expression compendium [32] to develop a model that shares information across multiple tissues and diseases, including rare diseases with limited sample sizes.PASNet [33] uses MSigDB to inform the structure of a neural network for predicting patient outcomes in glioblastoma multiforme (GBM) from gene expression data.This approach aids interpretation, as pathway nodes in the network with high weights can be inferred to correspond to certain pathways in GBM outcome prediction.
Gene-level relationships can also be represented with networks.Network nodes typically represent genes and real-valued edges may represent interactions or correlations between genes, often in a tissue or cell type context of interest.netNMF-sc [34] incorporates coexpression networks [35] as a smoothing term for dimension reduction and dropout imputation in single-cell gene expression data.The coexpression network improves performance for identifying cell types and of cell cycle marker genes, as compared to using raw gene expression or other single-cell dimension reduction methods.
Combining gene expression data with a network-derived smoothing term also improved prediction of patient drug response in acute myeloid leukemia [36] and detection of mutated cancer genes [37].PIMKL [38] combines network and pathway data to predict disease-free survival from breast cancer cohorts.This method takes as input both RNA-seq gene expression data and copy number alteration data, but can also be applied to gene expression data alone.
Gene regulatory networks can also augment models for gene expression data.These networks describe how the expression of genes is modulated by biological regulators such as transcription factors, microRNAs, or small molecules.creNET [39] integrates a gene regulatory network, derived from STRING [40], with a sparse logistic regression model to predict phenotypic response in clinical trials for ulcerative colitis and acute kidney rejection.The gene regulatory information allows the model to identify the biological regulators associated with the response, potentially giving mechanistic insight into di erential clinical trial response.GRRANN [41], which was applied to the same data as creNET, uses a gene regulatory network to inform the structure of a neural network.Several other methods [42,43] have also used gene regulatory network structure to constrain the structure of a neural network, reducing the number of parameters to be t and facilitating interpretation.

Applications in genetics
Approaches that incorporate gene set or network structure into genetic studies have a long history [44,45].Recent applications include expression quantitative trait loci (eQTL) mapping studies, which aim to identify associations between genetic variants and gene expression.netReg [46] implements a graph-regularized dual LASSO algorithm for eQTL mapping [47] in a publicly available R package.This model smooths regression coe cients simultaneously based on networks describing associations between genes (target variables in the eQTL regression model) and between variants (predictors in the eQTL regression model).eQTL information is also used in conjunction with genetic variant information to predict phenotypes, in an approach known as Mendelian randomization (MR).In [48], a smoothing term derived from a gene regulatory network is used in an MR model.The model with the network smoothing term, applied to a human liver dataset, more robustly identi es genes that in uence enzyme activity than a network-agnostic model.As genetic datasets grow, we expect that researchers will continue to develop models that leverage gene set and network databases.

Other models incorporating biological structure
Knowledge about biological entities is often organized in an ontology, which is a directed graph that encodes relationships between entities (see Figure 3 for a visual example).The Gene Ontology (GO) [3] describes the relationships between cellular subsystems and other attributes describing proteins or genes.DCell [49] uses GO to inform the connectivity of a neural network predicting the e ects of gene deletions on yeast growth.DCell performs comparably to an unconstrained neural network for this task.Additionally, it is easier to interpret: a cellular subsystem with high neuron outputs under a particular gene deletion can be inferred to be strongly a ected by the gene deletion, providing a putative genotype-phenotype association.DeepGO [50] uses a similar approach to predict protein function from amino acid sequence with a neural network constrained by the dependencies of GO.However, a follow-up paper by the same authors [51] showed that this hierarchy-aware approach can be outperformed by a hierarchy-naive CNN, which uses only amino acid sequence and similarity to labeled training set proteins.This suggests a tradeo between interpretability and predictive accuracy for protein function prediction.Structure of ontology, phylogeny, etc.

Tabular data
Figure 3: Directed graph-structured data, such as an ontology or phylogenetic tree can be incorporated into machine learning models.Here, the connections in the neural network used to predict a set of labels parallel those in the tree graph.This type of constraint can also be useful in model interpretation: for example, if the red-shaded nodes have high neuron outputs for a given sample, then the subsystem encoded in the red-shaded part of the tree graph is most likely important in making predictions for that sample.
Phylogenetic trees, or hierarchies describing the evolutionary relationships between species, can be useful for a similar purpose.glmmTree [52] uses a phylogenetic tree describing the relationship between microorganisms to improve predictions of age based on gut microbiome data.The same authors combine a similar phylogeny smoothing strategy with sparse regression to model ca eine intake and smoking status based on microbiome data [53].Phylogenetic trees can also describe the relationships between subclones of a tumor, which are fundamental to understanding cancer evolution and development.Using a tumor phylogeny inferred from copy number aberration (CNA) sequencing data as a smoothing term for deconvolving tumor subclones provided more robust predictions than a phylogeny-free model [54].The tree structure of the phylogeny and the subclone mixture model are t jointly to the CNA data.
Depending on the application, other forms of structure or prior knowledge can inform predictions and interpretation of the model's output.CYCLOPS [55] uses a circular node autoencoder [56] to order periodic gene expression data and estimate circadian rhythms.The authors validated the method by correctly ordering samples without temporal labels and identifying genes with known circadian expression.They then applied it to compare gene expression in normal and cancerous liver biopsies, identifying drug targets with circadian expression as candidates for chronotherapy.NetBiTE [57] uses drug-gene interaction information from GDSC [58], in addition to protein interaction data, to build a tree ensemble model with splits that are biased toward high-con dence drug-gene interactions.The model predicts sensitivity to drugs that inhibit critical signaling pathways in cancer, showing improved predictive performance compared to random forests, another commonly used tree ensemble model.

Conclusions and future directions
As the quantity and richness of biomedical data has increased, sequence repositories and interaction databases have expanded and become more robust.This raises opportunities to integrate these resources into the structure of machine learning models.Going forward, there is an outstanding need for benchmarks comparing these approaches across diverse datasets and prediction problems, along the lines of the evaluation in [59] but updated and expanded to include recent methods and applications.Improved benchmarking should lead to a better understanding of which dataset characteristics align with which approaches.
Many methods described in this review have open-source implementations available; however, increased availability of performant and extensible implementations of these models and algorithms would facilitate further use and development.In the future, we foresee that incorporating structured biomedical data will become commonplace for improving model interpretability and boosting performance when sample size is limited.
This paper describes BPNet, a neural network for predicting transcription factor (TF) binding pro les from raw DNA sequence.The model is able to accurately infer the spacing and periodicity of pluripotency-related TFs in mouse embryonic stem cells, leading to an improved understanding of the motif syntax of combinatorial TF binding in cell development.
[*] Annotation for cDeepbind [18]: cDeepbind is a neural network model for predicting RNA binding protein (RBP) speci city from RNA sequence and secondary structure information.The authors show that this combined approach provides an improvement over previous models that use only sequence information.
[*] Annotation for DeepDi [23]: DeepDi uses a long short-term memory neural network to predict di erential gene expression from the spatial structure of histone modi cation measurements.The network has a multi-task objective, enabling gene expression predictions to be made simultaneously in multiple cell types.
[**] Annotation for DeepVariant [24]: This paper describes DeepVariant, a neural network model for distinguishing true genetic variants from errors in next-generation DNA sequencing data.The model adapts techniques from the image processing community to t a model on images of read pileups around candidate variants, using information about the sequence around the candidate variant site to make predictions about the true genotype at the site.
[**] Annotation for PLIER [29]: This paper describes a "pathway-level information extractor" (PLIER), a method for reducing the dimension of gene expression data in a manner that aligns with known biological pathways or informative gene sets.The method can also reduce the e ects of technical noise.The authors show that PLIER can be used to improve cell type inference and as a component in eQTL studies.
[**] Annotation for netNMF-sc [34]: netNMF-sc is a dimension reduction method that uses network information to "smooth" a matrix factorization of single-cell gene expression data, such that genes that are connected in the network have a similar low-dimensional representation.Inclusion of network information is particularly useful when analyzing single-cell expression data, due to its ability to mitigate "dropouts" and other sources of variability that are present at the single cell level.
[*] Annotation for Attribution Priors [36]: This paper describes "model attribution priors", or a method for constraining a machine learning model's behavior during training with prior beliefs or expectations about the data or problem structure.As an example of this concept, the authors show that incorporation of network data improves the performance of a model for drug response prediction in acute myeloid leukemia.
[*] Annotation for PIMKL [38]: In this paper, the authors present an algorithm for combining gene expression and copy number data with prior information, such as gene networks and pathways or gene set annotations, to predict survival in breast cancer.The weights learned by the model are also interpretable, providing a putative set of explanatory features for the prediction task.
[**] Annotation for creNET [39]: This work describes creNET, a regression model for gene expression data that uses information about gene regulation to di erentially weight or penalize gene sets that are co-regulated.The authors show that the model can be used to predict phenotype from gene expression data in clinical trials.The model also provides interpretable weights for each gene regulator.
[**] Annotation for DCell [49]: This paper presents DCell, a neural network model for prediction of yeast growth phenotype from gene deletions.The structure of the neural network is constrained by the relationships encoded in the Gene Ontology (GO), enabling predictions for a given input to be interpreted based on the subsystems of GO that they activate.Thus, the neural network can be seen as connecting genotype to phenotype.
[*] Annotation for DeepGO [50]: Here, the authors describe a method for predicting protein function from amino acid sequence, incorporating the dependency structure of the Gene Ontology (GO) into their neural network used for prediction.Using the GO information provides a performance improvement over similar models that do not incorporate this information.
[**] Annotation for NetBITE [57]: This paper describes a method for using prior knowledge about drug targets to inform the structure of a tree ensemble model, used for predicting IC50 drug sensitivity from gene expression data.The model also uses a protein interaction network to "smooth" the gene weights, such that genes that are related in the network will have a similar in uence on predictions.
Contrasting approaches to extracting features from DNA or RNA sequence data.Early models de ned features of interest by hand based on prior knowledge about the prediction or clustering problem of interest, such as GC content or sequence melting point.Convolutional models use sequence convolutions to derive features directly from sequence proximity, without requiring features of interest to be identi ed before the model is trained.