InterTADs: Integration of multi-omics data on topological associated domains

Background: The integration of multi-omics data can greatly facilitate the advancement of research in Life Sciences by providing new insights on how biological systems interact. However, there is currently no widespread procedure for a robust, efficient and meaningful multi-omics data integration; the approach presented here is a first attempt towards increasing the reliability of data discovery power compared to the processing of individual biodata sets. Results: Here, we proposed a high-speed framework, called InterTADs, for integrating multi-omics data from the same physical source (e.g. patient) taking into account the chromatin configuration of the genome, i.e. the topologically associating domains (TADs). The main concept of the proposed methodology is to create a single matrix with all different events (e.g. DNA methylation, expression, mutation) combined with their genome coordinates and the respective quantitative metrics after application of the appropriate scaling. The events are divided into their related TADs according to the chromosomal location and each TAD is evaluated for statistically significant differences between the groups of interest (e.g. normal cells vs cancer cells). Finally, several visualization approaches are available, including the mapping of the events on the chromosomal location of the TAD as well as the distribution of the counts within a given TAD across the different study groups. Conclusions: InterTADs provides a general framework for integrating multi omics data and relating them with the TADs. This could lead to the extraction of new biological insight of the examined case study. InterTADs is an open-source tool implemented in R and licensed under the MIT License.


Background
The study of the molecular mechanisms that may lead to cancer was revolutionized by the advent of Next Generation Sequencing (NGS) (1,2). NGS include studies of whole genomes (whole-genome sequencing), smaller regions of the genome (exome sequencing), of the transcriptome (RNA-seq), the methylome (Bisulfite-seq) and protein-DNA binding sites (ChIPseq) (3). Using NGS to sequence the entire human genome can produce more than 100 GB of raw data (4), thus leading to a whole cadre of challenges towards their analysis. The raw NGSdata are consequently analyzed by established and widely accepted bioinformatics tools (e.g. bwa, TrimGalore, HISAT2, MACS2, R) (5). The process of analyzing omics data usually leads to a high dimensional matrix, with the different cases listed as columns and the locations on the genome in which the examined event happened (e.g. mutation, gene expression etc.) as rows.
The integration of several types of data that originate from the same physical source (e.g. patient) but focus on different mechanisms (gene expression, DNA Methylation, histone modifications etc.) is remaining a promising field since there is no widely accepted approach for it. The most common processes to compare different omics data are by (i) comparing the gene lists produced at the end of each individual analysis, with the assumption that overlapping genes were influenced by different mechanisms (6,7) and (ii) checking the correlation of two events that are associated with the same gene, using statistical methods such as spearman or pearson correlation test (8,9). However, as interactions in biological systems are generally not linear, methods such as PCA, Bayesian or non-Bayesian networkbased were applied as extended data integration approaches (10). Although these methods are promising, they show instability and tend to over-fit to a given dataset. Moreover, there are several existing tools that integrate different kinds of omics data but they perform the analysis at the gene level (e.g. CNAmet, iGC, PLRS, Oncodrive-CIS) or they focus on sample classification based on the driving clinical perspective (e.g. iClusterPlus and mixOmics) (10,11).
Moreover, many existing tools consider pathway databases for further evaluation of the biological meaning (12,13) Going to a level of organization further than the simple chromosomal position, the introduction of NGS methods, like Hi-C, provides insight into chromatin organization such as the topologically associated domains (TADs). TADs represent segments of chromatin domains characterized by frequent interactions within themselves, and are conserved in mammals (14,15). Since the human genome is organized in three dimensions and the chromosomes fold locally driving gene regulation, our tool is integrating the multi-omics data in relation to TADs.
Moreover, recent studies have been shown that integrating multi-omics data that included TAD information revealed novel insights into the mechanism of the regulation of genes causing tumor development (16,17).
We developed an R-based framework, called InterTADs, that integrates the tabular output of multiple experiments of different NGS types (e.g. tables with expression values, mutation and DNA methylation values) into a single file. The tool then combines the joined representation of the multiple experiments, with the 3D organization of the genome, the TADs. Statistical analysis is performed according to predefined groups of interest (e.g. normal cells vs cancer cells) and the events related to multi-omics data (CpG site -CpGs, transcript, mutation, histone marker, etc.) which are divided into the associated TADs based on the overlap of the chromosomal locations. Finally, visualization options are available for the statistically significant results.
Our approach was tested on different omics data sets from 9 patients with Chronic Lymphocytic Leukemia (CLL) (18). CLL is the most common adult leukemia in Western countries (19) and is characterized by clinical and biological heterogeneity at both the genetic and epigenetic levels (20,21). Due to its great heterogeneity, CLL provides a paradigmatic case to decode complex associations of the events within the same TADs.

Implementation
The proposed method was evaluated on data from CLL stereotyped subsets #6 and #8. These are two distinct subgroups of CLL, with subset #8 showing more aggressive disease than subset #6. In our previous study (18), we explored the DNA methylation values (450K) and the expression values as independent datasets. Here, the RNA-seq raw data were analyzed for variants using the GATK pipeline for Haplotype detection (22).

Multi-omics data from patients with CLL
The InterTADs was applied on previous published omics data from two distinct and aggressive subgroups of CLL as described in implementation section. The tool includes three main phases: (i) data integration of multi omics data, (ii) TADiff: association with the 3D organization of the genome through the TADs, and (iii) visualization of the statistically significant TADs ( Figure   1A). The integrated table contains 2,208,909 events i.e. CpG sites, transcripts or variants. The First, we calculated the FDR between the two groups on the DNA methylation matrix, on gene expression matrix, on variants matrix and based on our approach regarding the TADs. To demonstrate the ability of the tool in revealing the most significant result compared to the individual dataset, we checked for the distribution of the FDR in each file and the FDR on our approach ( Figure 1B). Statistically significant results were found only in our approach, thus clearly showing the value of the tools as a data discovery approach.
Regarding the differential analyses, we applied a threshold of 0.05 on FDR. Additionally, the cut off of the differences between the groups regarding the TADs was automatically selected on 16.24, which represent the 3rd quartile and considered as the most significant. We found 459 statistically significant TADs between the two categories, #6 and #8. The most significant TAD based on FDR was TAD2130 (diff = 19.8, FDR < 0.0001), which included 4,299 events (CpGs, transcripts and variants), and based on the diff was TAD854 (diff = 27.7, FDR < 0.0001), which included only 43 events (CpGs and variants).
We continued on the visualization phase focusing on TAD2130, which showed the lower FDR and includes high number of associated events (n = 4,299). The first option of the visualization phase is to provide a dotplot with connecting lines based on the mean of each event is generated as a second option combining with a violin plot of the distribution of the mean values (Figure 2A). Also, a dot plots with the cases values of the associated events in the TAD of interest for the two groups separately ( Figure 2B) is provided. The black line highlights the median in each group. The third option generates scatter plots regarding the chromosomal location of each TAD on x axis and the value of each associated event on y axis. The most significant TAD based on FDR, the TAD2130, was plotted for stereotyped subset #6 and stereotyped subset #8 separately (Figure 2C, D).

Benchmarking
The tool was evaluated on its computational time by generating artificial inputs from the original multi-omics data, and comparing it to the quick sort algorithm. For this purpose, a set of experiments were used as benchmarking. During these experiments, fourteen artificial datasets were created, by randomly sampling different number of rows from the original dataset. The datasets consisted of 10,000, 100,000, 150,000, 200,000, 300,000, 500,000, 1,000,000, 1,500,000, 2,000,000, 2,500,000, 3,000,000, 3,500,000, 4,000,000 and 4,500,000 rows (e.g. events; CpG, expression, mutation etc.). All experiments were executed on an SSD drive computer with 16 GB RAM at 1.80 GHz and a 64-bit operating system. Figure 3A shows the compute times for the Data integration phase of the algorithm, and Figure 3B shows the compute times for the TADiff phase. For smaller dataset's sizes the Data Integration needs more time to produce the required results, while on larger datasets the TADiff part takes longer to terminate. Figure 3C  Despite the increasing amount of data though, there is no single approach to efficiently integrate multi-omics data that originate from the same source (e.g. patient). Here we propose a novel tool, namely InterTADs, which provides a complete framework to analyze multi-omics data, either available in-house or through public repositories. InterTADs implementation supports very fast execution in order to (i) generate a single file from multi omics inputs, (ii) find significant differences on the TADs between predefined groups of interest, and (iii) visualize the TADs of interest. Our approach clearly supports efficient data discovery in multi-omics data by increasing the statistically significant differences across higher level organizational units (i.e. TADs), as compared to the individual data sets.
In relationship to other existing tools(10), InterTADs is rather on a genome-wide approach by separating the genome into TADs. Applying this approach, we omit the gene level analysis and the coincidence windows analysis, which generates sliding windows within the chromosome by taking into account the chromatin configuration and the high level of interactions within the TADs.
InterTADs is an open-source R package, easily applicable to any type of omics data. The tool is in line with the FAIR principles (Findable, Accessible, Interoperable, Reusable) since it is freely available on GitHub and it is also accompanied by detailed documentation and examples, highlighting the reproducibility of the source code.
Altogether, there are currently several publicly available resources for multi-omics data but also several community-supported bioinformatics tools and databases that can manipulate this kind of data. However, the downstream analysis and especially the integration of different technologies of NGS data is a very promising area since it is still an ongoing process, with no single widely accepted approach. Our method gives a new perspective towards analyzing multi-omics data, by offering short execution time, and meaningful representation of information structure, and clear visualization options.

Conclusions
InterTADs offers a novel software framework for integrating multi omics datasets considering the chromatin configuration, i.e. the interactions within the TADs. It takes advantage of the new aspect of analysis and directly benefits by adhering to the open science principles, in order to produce high quality and reproducible scientific results based on already published data.

Methods
The tool is implemented as an R script. The input multi-omics files are BED-formatted containing the coordinates of each event (mutation, CpG site, transcript etc), and the corresponding score values. In more detail, the input files have to include in the 1st column a unique identifier (e.g. cg00000029, XLOC_032721, mut_1, etc.) for each event, in the next 2nd to 4th columns the BED format information (i.e. chromosome, start, end), and in the rest of the columns the values for each patient. These files are produced by tools performing the analysis of the raw data such as HISAT2 (29), featureCounts, MACS2 (30), minfi (31), GATK (22), etc. On our study case, the format of the omics data were transformed to a BED-format adding the scores of each patient using the library IlluminaHumanMethylation450kanno.ilmn12.hg19 in R for the CpG events and using the GTF file from StringTie of the HISAT2 pipeline. In order for the algorithm to run properly, all files are placed into two folders, named freq and counts, based on the type of information they are carrying (frequency score values or count values).
Along with these files a meta-data file is created containing information about the mapping between the files' columns. -Reformatting: Next, each file is transformed into a data table based on the given meta-data file. This transformation ensures that same index columns from different tables, point to the same physical source (chromosome information, patient ID etc.).

Workflow
-Scaling: In order for any further analysis to be possible, all tables are transformed so that they correspond to the same scale. A range between 0 − 100 has been chosen for convenience purposes. Hence, numeric data, which contain frequency score values in the above range, are slightly (DNA methylation) or not at all changed (mutation data). On the other hand, a function is applied to count values so that they correspond to the desired range. The transformation process is as follows; supposing that corresponds to expression counts, then a logarithmic scale is applied: TADs. The overlap of the ranges between the events and the TADs is tested using the R package GenomicRanges (32). Next, the tool splits the samples in two subgroups, according to a predefined metadata file that includes a list of sample IDs and the corresponding group e.g.
normal/tumor. Then, the statistical analysis of the two subgroups includes the calculation of the Benjamini-Hochberg False Discovery Rate (FDR) choosing automatically the parametric or nonparametric test according to the cohort of the metadata file (i.e. >30 samples, parametric method). Also, there is an option for paired data e.g. diagnosis/progression. Additionally, the tool calculates the differences on the values between the two subgroups for each TAD, as follows: ) where = / , 1 refers to the data related to the first group and 2 refers to the data related to the second group. Events with no difference between the two subgroups are excluded from the downstream analysis. Finally, the user can select the cutoff of the FDR significance; however, the cutoff of the differences (diff) is automatically set to be equal to the 3 rd Quartile, based on the distribution of the diff value. The third option takes as input the integrated matrix and a desired chromosomal location, and produces plots showing the chromosomal location of the TAD of interest on the x axis and the associated events combined with their values on the y axis. The plots are generated based on each case separately or on each group. Also, a single plot with the differences of the events between the groups is produced. All plotting functions were generated using ggplot2, gghalves and karyoploteR (33).
Considering all the steps, the InterTADs generates 6 different output files:

Consent for publication
Not applicable