Integrative construction of regulatory region networks in 127 human reference epigenomes by matrix factorization

Abstract Despite large experimental and computational efforts aiming to dissect the mechanisms underlying disease risk, mapping cis-regulatory elements to target genes remains a challenge. Here, we introduce a matrix factorization framework to integrate physical and functional interaction data of genomic segments. The framework was used to predict a regulatory network of chromatin interaction edges linking more than 20 000 promoters and 1.8 million enhancers across 127 human reference epigenomes, including edges that are present in any of the input datasets. Our network integrates functional evidence of correlated activity patterns from epigenomic data and physical evidence of chromatin interactions. An important contribution of this work is the representation of heterogeneous data with different qualities as networks. We show that the unbiased integration of independent data sources suggestive of regulatory interactions produces meaningful associations supported by existing functional and physical evidence, correlating with expected independent biological features.


INTRODUCTION
The disruption of cis-regulatory elements is considered the key mechanism through which disease risk is conferred by noncoding mutations (1)(2)(3). However, in order to support this hypothesis and apply it in the development of rational therapeutic strategies, several difficulties have to be surpassed. First, identification of cis-regulatory elements proved a difficult task given the dimension of the noncoding genome (4). This has been overcome using the association of chromatin marks with genome activity in coding and noncoding regions as a widely accepted approximation to map the tissue-specific activity and dynamics of distal and proximal cis-regulatory elements (5)(6)(7). The highly correlated structure displayed by combinatorial patterns of marks across the genome enables computational identification of a reduced number of robust chromatin states (8,9) for display in a single annotation track. Thanks to these experimental and computational advances, reference epigenomes were recently profiled and annotated for a large number of human tissues (10), including the tissuespecific annotation of active cis-regulatory elements (e.g. enhancers).
Having defined systematic strategies for genome-wide mapping of cis-regulatory elements, efforts have more recently shifted toward tackling the more challenging problem of determining what genes are likely to be targeted by given cis-regulatory elements, mostly enhancers. Numerous solutions have been proposed on both the computational and experimental fronts.
On the computational side, several efforts have exploited the correlated structure of epigenomic features to infer associations between enhancers and target promoters. Enhancer-promoter associations have been mapped by quantifying patterns of coactivity of annotated enhancer elements and promoters across and within tissues (8,10). Supervised machine learning approaches with the goal of learning epigenomic patterns discriminative of functional interactions have also been proposed (11)(12)(13). On the experimental side, techniques to measure chromatin conformation enable the mapping of high confidence interactions at different levels of resolution and across several human celltypes and tissues (14,15), These methods can be targeted to elucidate regulatory interactions by enrichment of potential enhancer-promoter contacts in assays like the Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) or promoter capture Hi-C, a promoter centered chromosome conformation capture technique (14,16). However, both approaches suffer from limitations. First, there is no gold standard interaction set. Second, it is currently not feasible to profile chromatin interactions in a large number of cells and tissues to provide a tissue-specific reference for an organism. In addition, the level of resolution of Hi-C experiments makes it far from trivial to precisely localize the particular enhancer and promoter pairs that might be involved in functional transcriptional regulatory interactions. Given these and associated limitations, and the availability of recently published human reference epigenomes (127 cell/tissue types) (10) and the largest sets of mapped chromatin interactions across human tissues (17 primary blood cell types and 21 cell/tissue types) (14,15), we reasoned that a hybrid and integrative computational approach is timely.
This article presents SWIPE-NMF, a computational method that implements Sliding WIndow PEnalized Nonnegative Three-factor Matrix Factorization on heterogeneous association data represented as networks. This approach was used to integrate the functional and physical evidence of regulatory interactions provided by computational coactivity inference and experimental data, respectively. This method was applied to annotate a weighted set of potential interactions for each of the 127 cell and tissue types within human reference epigenomes (10). Furthermore, SWIPE-NMF was implemented as a flexible tool that can be applied to integrate any set of enhancer annotations with prior evidence sources for regulatory interactions to infer tissue-specific weighted interactions.

Datasets and data processing
High resolution HiC interactions were obtained for a total of 21 human cell lines and primary tissues (15,17); in both cases, the significant interactions reported by the authors were used. Cell/tissue types were matched to the corresponding reference epigenome identifiers (EID) from the Roadmap epigenomics project, or matched to the closest EID according to information from the authors. Only interactions with q-value <1e-3 were considered. Tissuespecific coactivity based enhancer-promoter associations inferred as described in Ernst et al. were obtained from the Roadmap Epigenomics Consortium 2015 (10). eQTL data (V6p) were obtained from the GTEX consortium, considering only associations with a P-value < 1e-5. TAD annotations were obtained from Dixon et al. (18). DHS data were obtained from Thurman et al. (19), considering only associations with a score >0.9. Transcription factor binding motifs were obtained from Marbach et al. (20). ChIA-PET data were obtained from Li et al. (21). The tissuespecific enhancer annotations used as reference were extracted from the Roadmap epigenomics project, using the non-genic chromatin state (7 Enh) annotated with the core ChromHMM 15-state model. CTCF-binding peaks were downloaded from ENCODE website; Broad and Narrow peaks were combined (4).

Tissue-specific algorithm inputs
The inputs into the SWIPE-NMF algorithms consist of 17 types of matrices for each cell or tissue type. Each row or column of a matrix is a genomic segment such as an enhancer or promoter ( Figure 1B). A total of six different types of genomic segments, namely enhancers, promoters, Hi-C anchors, eQTL SNPs, TADs and NaseI hypersensitivity, were included in this study. Entries in the input matrices are binary, with 1 representing presence significant activity association or physical interactions (both called 'interactions' in this study) and 0 standing for absence of such interactions. Details of each type of interactions are given in Supplementary Table S1. Tissue/cell type specific genomic segments and their interactions were used when available. When not available, the union over the total available cell and tissue types was used as global reference of potential association.

Three factor penalized sliding-window matrix factorization
The SWIPE-NMF method includes matrix factorization algorithm proposed byŽitnik and Zupan (22), a sliding window setting to run the algorithm on different segments of the genome and a data processing pipeline to convert different data types into appropriate matrix format as input into the matrix factorization algorithm. Regularized factorization was conducted on each sliding windows of size 5M bp along each chromosome. The window slid at step size of 2.5M bp, making first half of each window overlapped of the previous one and second half overlapped with next window. Many of the input matrices are very big in size with more than 10 000 rows or columns, making it impossible to conduct matrix operations using computers available in most academic settings. The NMF algorithms can only run when we break the giant matrices into smaller ones using sliding window approach. We set the sizes of sliding windows to 5M bp so that they are similar to sizes of TADs, and balance performance and computational speed (Supplementary Table S2). In each 5M bp window, NMF was conducted on all the 17 input matrices to produce new interaction matrices. In the overlapped region between windows, mean values of overlapped matrices were used. Output matrices from different windows are concatenated together as the new interaction matrix of the corresponding chromosome. Interaction matrices of all chromosomes were then concatenated to form the interaction map of the tissues/cell type for later analysis and performance evaluation.
The algorithm can be formulated mathematically as follows. In each tissue/cell type ε 1 , ε 2 . . . ε r are the genomic segment types. In this study, r = 6. If there are n i segments of type ε i and n j segments of type ε j , the input data matrix that relates the two datatypes (ε i , ε j ) is represented as a sparse matrix R i j of dimension n i × n j . Matrices R i j and R ji are generally asymmetric. An example of data source R i j is the activity correlation between enhancer and promoter pairs. An input data matrix that provides interactions among genomic segment of the same type ε i is referred to as a constraint matrix Θ ii with dimension of n i × n j . An example of such constraint is Hi-C data that suggest enhancerenhancer interactions in a specific tissue or cell type. All the interaction matrices R i j are factorized simultaneously constrained by matrices Θ. One advantage of the method is that it produces factors that are specific to each data source (e.g. Hi-C) and factors that are specific to each type of genomic segments (e.g. enhancer). The three-factor penalized matrix factorization method simultaneously decomposes all relation matrices R i j into G i (with dimension n i × k j ), G j (with Nucleic Acids Research, 2019, Vol. 47, No. 14 7237 dimension n j × k j ) and S i j (with dimension k i × k j ) such that R i j ≈ G i S i j G T i j . k i and k j are chosen so that k i n i and k j n j . One intuitive way to understand this is thanking k i and k j are clusters or 'topics'. G i describes membership of each genomic segment of type ε i to each cluster in k i . G j describes membership of each genomic segment of type ε i to each cluster in k j . Therefore, the factorization algorithm is, to some extent, mathematically equivalent to a clustering algorithm. Unobserved entries in R i j are estimated from the factorization. The initial input into the factorization algorithm can be conceptually represented as a block matrix R : ⎡ The i th row and j th column of the block matrix R is the interactions between genomic segments i and, i.e. R i j . This method does not require all relation matrices be available. Therefore, any R i j can be missing. The p th segment of type ε i and q th segment of type ε j is R i j ( p, q). In this project, constraint matrices are organized as a diagonal matrix: If constraint matrix for a type of genomic segment is missing. The i th block of Θ is zero. The entries in Θ ii are positive if the two segments are not interacting and negative if interacting. The block matrix R is factorized into block matrix factors G and S.
Factor S i j defines latent association between genomic segment type ε i and ε j . G i is specific to ε i . The original block matrix R is reconstructed as: Be reminded that the objective function is: The procedures of SWIPE-NMF method is as follows: The ranks k i is chosen using a hyperparameter k such that k 1 , k 2 , . . . , k r = kn 1 , kn 2 , . . . , kn r . Different values for the hyperparameter k were used. The rank of each matrix is integer N*k, where N is the number of columns in the data type. Ten different values of k (0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5) were tried on each window. As different initializations of G matrices gave different factorization and there is no guarantee of global minimum, we used an ensemble learning strategy of running 20 rounds of the algorithm with slightly different initialization, using random Acol initialization and averaged the outputs (22). The rank for each window was determined by selecting k where a maximum kink was attained in total reconstruction error curve (22,23). The algorithm was stopped if the difference between two iterations was smaller than 0.01 or the maximum number of interaction (200) was reached.

Enhancer-promoter interaction set
In order to provide a set of high-confidence scored interactions for other researchers to use, in addition to the direct output from the inference, we determined a cut-off value for the enhancer-promoter association score produced by SWIPE-NMF, filtering out interactions with lower scores. The cutoff was set so that the average number of promoters per enhancer was consistent with previous estimates (∼3) (11). Given that a similar criterion for enhancer-enhancer and promoter-promoter interactions is not available, the filtering step was not performed for these.

Enhancer-enhancer and promoter-promoter interactions
The factor G 2 ( Figure 1C) produced by three-factor penalized factorization provides information about the learned structures of promoter networks. A weighted enhancerenhancer interaction matrix was calculated as G 2 G T 2 for each sliding window. Similarly, an enhancer-enhancer interaction matrix was calculated as G 1 G 1 T .

Baseline data integration method
A total of four baseline methods were compared with SWIPE-NMF in enhancer-promoter network prediction. The first baseline is concatenation where means of normalized scores of all input data type were used. The second method is to link each enhancer to its nearest promoter, which is called 'nearest promoter assignment'. The third baseline used is interaction prediction using only enhancerpromoter activity correlation. The fourth baseline is a widely used unsupervised network integration method in biological community called similarity network fusion (24). In similarity network fusion, nodes of all input networks were standardized to enhancers or promoters of the corresponding cell or tissue type based on overlaps of genomic segments. No local neighborhood restriction was used to avoid subjective bias. Following the notations of original authors, P i is a square matrix with its dimension equal to sum of total number of enhancers and promoters in a specific cell or tissue type i. An entry in the matrix is normalized as P where m is the number of input matrix types. After fusion, the final enhancer-promoter network is computed as P

Performance evaluation
For 5-fold cross validation experiments, 20% of the associations were randomly chosen and excluded from inputs. In addition, an equal number of non-interacting pairs were randomly selected to balance the data. When a whole dataset was left out to evaluate the performance of SWIPE-NMF, that dataset was used a ground truth for testing. Only interactions occurring at a distance larger than 5 kb were considered for the analyses of biological correlates.

A matrix dimensionality reduction framework integrating evidence of enhancer-promoter interactions
In order to computationally integrate the large set of qualitatively different data suggestive of potential enhancerpromoter interactions in a principled way, we first curated a database including five experimental data sources (Figure 1). We considered previously published (i) enhancerpromoter coactivity associations (8,10), (ii) physical chromatin interaction calls from Hi-C data in 21 tissues (15,17), (iii) cis-regulatory-gene associations defined by eQTL in 53 human tissues (25), (iv) cis-regulatory-promoter associations defined by activity correlation between DNase-I hypersensitivity sites (DHS) and promoters (19), and (5) topologically associated domain (TAD) annotations defined from Hi-C data (18). A description of each data type and the nature of evidence provided is included in the Supplementary Data (Supplementary Table S1). As an example, Figure 1A shows the density of data in a randomly selected region in K562 cells.
For consistency, we defined a reference set of potential enhancer elements to which all data were mapped. The reference selected was the non-genic enhancer ChromHMM chromatin state (7-Enh) annotated for all 127 reference human epigenomes in the roadmap epigenomics project (10). The heterogeneous nature of the data, i.e. association data at different length scales, in addition to annotations of discrete genomic regions (e.g. TAD domains), made the integration task challenging. We approached the problem by first devising an individual network representation for each data source representable in matrix form and compatible for mapping across sources ( Figure 1B and C). We then applied an extended NMF algorithm to fuse the independent network data.
Specifically, we considered six types of genomic segments: enhancer, promoter, Hi-C anchor, cis-eQTL (i.e. the SNP position having the association), DHS and TAD. Each network is composed of the total genomic segments in the data. We can define two qualitatively different types of associations ( Figure 1C): interaction and incidence associations. Interaction matrices (blue) code associations between genomic segments of different types that are supported by either physical or functional experimental data. Incidence matrices represent the incidence of one element within the other (orange), i.e. one genomic segment overlapping with the other. Finally, we define two diagonal incidence matrices (red), which operationalize the prior knowledge that regulatory interactions are expected to be supported by Hi-C physical interactions and preferentially occurred within TAD domains. We defined a consistent set of matrices for each cell/tissue type (for details, see 'Materials and Methods' section). Thus, we solved the problem of heterogeneous data representation by operationally defining networks of experimentally supported association as binary matrices R i j that code associations between genomic segments of types i and j. Importantly, integrating the data into this format enables the application of well-established matrix factorization algorithms. The matrices R and are the inputs of the matrix factorization algorithm.

Three-factor penalized matrix factorization (PMF)
The traditional three-factor penalized matrix factorization (PMF) approach has been recently used for gene functions and pharmacologic action predictions with an additional constraint imposing genomic locality of regulatory interactions (22,26,27). In this study, we extended original approach the sliding windows of size 5M bp along each chromosome to make the algorithm computational feasible given the large sizes of the input data (see more details in 'Materials and Methods' section). This method is designed to fuse the heterogeneous network datasets and infer a scored set of enhancer-promoter, enhancer-enhancer and promoter-promoter interactions ( Figure 1C).
Our method seeks to decompose the observed interaction matrix into a lower-dimensional representation that reveals biologically meaningful components. All the association matrices R i j are simultaneously factorized: each individual matrix is decomposed into G i , G j and S i j so that R i j ≈ G i S i j G T i j . In other words, an entry R i j ( p, q) is approximated by the inner product of the p-th row of matrix From a biological perspective, a matrix R i j defines the association between two different genomic segment types i and j, such as enhancers and promoters. A matrix G is specific to a type of genomic segments and records associations among genomic segments of that type.
• Each row of G i is a genomic segment of type i (e.g. an enhancer). • The columns of G i can be understood as clusters dividing genomic segments of type i based on shared patterns of regulatory or physical interactions. • The matrix G i specifies the probability of each genomic segment of type i belonging to each cluster. • The matrix S i j can be interpreted as defining association among clusters of genomic segment type i and type j .
Using a sliding window of size 5Mb, the matrix factorization algorithm estimated interactions among genomic segments within the windowed genomic region, G i S i j G j . Interactions estimated from each segment were then conjugated together to form the interaction map of the whole chromosome and genome, which are used for biological analysis and performance evaluation later.
In addition to interactions between different types of genomic segment, G i S i j G j , association among genomic segments of the same type, such as enhancer-enhancer interactions, can be estimated from G i matrices by G i G j T . In this way, R i j is dissected into, and can be reconstructed from, three matrices, G i , G j and S i j in a systematic, tractable and interpretable way.
The algorithm iteratively updates G and S by fixing one of them in an alternate way. We applied this method to all tivity correlations (EP) are shown in blue. Hi-C links are in red. Links between SNPs and promoters detected by eQTL are in green. Correlation between DNaseI hypersensitivity sites and promoters across multiple cell types are in sky blue. Topologically associated domains are not shown to avoid confusion with links among genomic elements. Locations of genes, reference enhancers and histone marks were also included. (B) All data types were organized into matrix/networks. Each row or column represents a type of genomic segments such as enhancers, promoters or Hi-C anchors. (C) SWIPE-NMF was used to integrate all five data types to produce cell/tissue type specific enhancer-promoter, enhancer-enhancer and promoter-promoter links for each of the 127 cell/tissue types. Each matrix R i j was decomposed into three matrices G i , S i j and G T j such that R i j ≈ G i S i j G T j . R i j is the relation between data type i and j. R 12 is enhancer-promoter interaction. G i is an n × m matrix where n is the number of elements in that data type (e.g. number of enhancers) and m is the number of ranks. S i j is a matrix representing the relation between columns in G i and G j . Joint factorization of matrices allows integration of information from all data types while minimizing information loss. This factorization was conducted on 5 Mb overlapped windows on each chromosome in each cell and tissue type. the integrative tissue-specific sets of matrices, including Hi-C, enhancer-promoter activity correlation, DHS-promoter correlation, eQTL and TAD domains; obtaining a tissuespecific weighted set of interaction matrices for enhancerpromoter (reconstructed R 12 ), enhancer-enhancer (G 1 G 1 T ) and promoter-promoter (G 2 G 2 T ) interactions.

Evaluation of the data integration strategy
There is currently no large gold-standard compendium of known regulatory region interactions, and lines of evidence for physical, functional and genetic interactions each capture different aspects of the underlying regulatory network. However, these complementary biological datasets enabled us to validate our predictions on a genome-wide basis using a diversity of methods and evidence. (i) For enhancerpromoter c-activity associations, we used 5-fold cross validation; (ii) for each independent empirical dataset suggestive of regulatory associations, we excluded one whole dataset from inputs; (iii) with orthogonal experimental data, we used separate ChiA-PET data and (iv) finally, we examined biological correlates and cell/tissue-type specificity of the scored sets of interactions inferred by SWIPE-NMF.
In 5-fold cross validation, SWIPE-NMF was used to reconstruct the functional coactivity data of enhancerpromoter associations (8), with 20% of the associations excluded from inputs. SWIPE-NMF showed good performance on this task (AUC = 0.82) (Figure 2A). Next, in four evaluation experiments, each evidence source (HiC, eQTL, TAD and DHS) was left out and used to test the model's consistency with the interactions inferred by integrating the rest of the datasets. When each eQTL and DHS was individually excluded from inputs and used as ground truth, the inference also performed well (AUROC > 0.7, Figure 2B), indicating that the interactions predicted by integration are supported by eQTL and DHS correlation (19). In addition, when TAD incidence annotation is excluded from inputs, the corresponding inferred interactions occurring within TAD domains have much higher scores compared with those involving cross-domain interactions (P-values <10 −10 ) ( Figure 2C). Finally, most of the interactions with high confidence scores are within TAD domains (15) (Figure 2D). When testing using orthogonal ChiA-PET data (from the K562 cell line) (21), the performance of SWIPE-NMF (AUROC ≈ 0.75) is better compared with either enhancer-promoter activity correlation (AUROC ≈0.6), averaging links across data types (AU-ROC ≈ 0.6), nearest promoter assignment (the brown point in Figure 2E) and similarity network fusion (AUCORC ≈ 0.71). Performance of SWIPE-NMF on two randomly picked 5M bp window in K516 cell using ChiA-PET as gold standard are shown in Supplementary Figure S1. The AU-ROC is different from previous publications because enhancers are defined in different way, we only focus on enhancers in non-genic regions in this project and input data were handled in a more conservative way (see 'Materials and Methods' section for details). Interestingly, we found that a considerable portion (50-80% depending on cell and tissue type) of enhancer-promoter interactions inferred by SWIPE-NMF with low scores does not occur in any of the input datasets ( Figure 2F). When comparing with ChiA-PET links, we found that these interactions uniquely predicted by SWIPE-NMF show better performance than random expectation (AUROC ≈0.6). This suggests that factorization is able to transfer information by learning association patterns in observed enhancer-promoter interactions. The results are consistent with orthogonal data of chromatin interactions mediated by RNA polymerase (21). Presumably the degree of overlap will increase when matching tissue-specific ChiA-PET data are considered, once available. Overall, the evaluation experiments demonstrate that, through integration by SWIPE-NMF, different sources of evidence provide complementary information with predictive power.

Predicted enhancer-promoter interactions are biologically meaningful
Several studies have shown a strong correlation between chromatin interactions and gene co-expression, due to the spatial colocalization of transcribed genes and their regulatory elements (28,29). We tested whether the inferred associations present a similar behavior. Using a large set of tissue-specific gene coexpression networks (30), we found that coexpressed gene pairs tend to share common interacting enhancers (P-value < 1e-30, hypergeometric test), agreeing with the expected behavior ( Figure 3A). We also found enrichment of transcription factor binding sites (TF-BSs) within enhancers and promoters, and moreover, we show that inferred interacting enhancer-promoter pairs (G 2 S 21 G 1 ) sharing TF-binding motifs are more likely to interact ( Figure 3B) than those without co-occurring motifs. These results are consistent with previous reports suggesting that transcription factors might facilitate enhancerpromoter interactions (30,31). Previous studies have also shown that CTCF, an insulator binding protein that is thought to be involved in the regulation of chromatin structure and DNA looping (32), is enriched near interacting promoters and enhancers (11,13). We found enhancers interacting with promoters and promoters interacting with enhancers are both highly enriched in CTCF Chip-seq peaks with significant P-values ( Figure 3C and D), within the inferred associations.

Tissue-specific promoters show more enhancer interactions
Enhancers are known to regulate tissue-specificity predominantly by modulating the expression of different target genes across tissues (33). We tested whether genes being targeted by enhancers show distinctive properties of gene expression relative to other genes using the inferred interactions. We found that genes interacting with enhancers show higher expression (RPKM) than those without enhancer regulation ( Figure 3E). In addition, the level of gene expression in each cell type shows a positive correlation with the number of incoming enhancer interactions to the promoters, consistent with an additive effect of the regulatory input ( Figure 3F). Furthermore, given the role of enhancers in establishing tissue-specificity, we hypothesized that genes with tissue-specific functionality are more prone to targeting by enhancers. To test this hypothesis, we used an entropy    based measure of gene expression specificity for each gene across the reference human transcriptome of the Roadmap epigenomics project (10,11), and found that gene expression specificity does correlate with the number of incoming enhancer interactions of a promoter ( Figure 3G). This result is consistent with the expectation that enhancer-promoter interactions contribute to the cell/tissue type specific expression of genes and agrees with findings in previous publications (11).
In order to further test whether genes targeted by more enhancers tend to be associated with tissue-specific (related) functions, we performed gene ontology (GO) term enrichment tests using the genes with the top 5% of incoming enhancer interactions as gene query set. In Table 1, we show examples of the enriched terms found for randomly chosen cell and tissue types. By looking at the top 3 GO terms in biological processes for each cell and tissue types, we found that highly targeted genes were generally enriched in functions related to the underlying biology of the tissue. This further supports our hypothesis that the inferred enhancerpromoter interactions contribute to the regulation of tissuespecificity.

SWIPE-NMF enhancer-promoter, enhancer-enhancer and promoter-promoter interactions
In addition to enhancer-promoter interactions, chromatin interactions involving only promoters or only enhancers have been shown to spatially organize the transcriptional machinery (21,34). Although the mapping and characterization of enhancer-promoter interactions have received much more attention, chromatin interactions occurring at similar resolution but involving only promoters or enhancers might be relevant under normal and abnormally disrupted conditions. One advantage of using SWIPE-NMF for data integration is that all three types of chromatin interactions are simultaneously learned during the matrix factorization process (see details in 'Materials and Methods' section). When using ChiA-PET as gold standard, both enhancer-enhancer and promoter-promoter networks show considerable AUROC scores (>0.6) (Figure 4A and B). Similar to enhancer-promoter interactions, promoters interacting with each other tend to preferentially show co-expression ( Figure 4C), consistent with the existence of chromatin-mediated transcription factories within the cell (35). Interacting enhancer-enhancer pairs and promoter-promoter pairs sharing more TF motifs have higher chances of interaction ( Figure 4D and E), and interacting enhancers and promoters are enriched in CTCF ChIP-Seq peaks ( Figure 4F and G). Overall, the observed consistency of biological correlates across the different types of inferred interactions indicates that, by integrating physical and coactivity evidence of association, SWIPE-NMF is able to infer general chromatin interactions, with enhancer-promoter associations as an important subset.

DISCUSSION: CONSTRUCT ENHANCER PROMOTER NETWORKS USING AN INTERMEDIATE INTEGRA-TION STRATEGY
After mapping cis-regulatory elements to their target genes, testable mechanistic hypotheses can be proposed for detrimental effects conferred by non-coding pathogenic mutations. With the goal of accelerating such mapping genomewide, and to provide a starting reference set of potential chromatin mediated regulatory interactions across reference human tissues, here we introduced and applied SWIPE-NMF.
Several features distinguish the proposed computational framework from other tools concerned with particular instances of the same problem. When dealing with data integration, most existing methods either transform all data sources into a single feature-based table and apply to it well-established feature-based machine learning algorithms ('early integration') or build an independent model for each data source ('late integration'). SWIPE-NMF, instead, is based on a more recent, 'intermediate integration' strategy that explicitly addresses the multiplicity of data types by fusing them through inference of a single joint model (36). Importantly, such an intermediate level of integration retains the structure of the data sources, incorporating them within the structure of the learned model. SWIPE-NMF was specifically designed to exploit the information provided by both computational coactivity-based inferences and experimentally grounded physical evidence of chromatin interactions, overcoming their individual limitations. An important contribution of this method is the representation of heterogeneous data with multiple resolutions as net-