Enhort: a platform for deep analysis of genomic positions

The rise of high-throughput methods in genomic research greatly expanded our knowledge about the functionality of the genome. At the same time, the amount of available genomic position data increased massively, e.g., through genome-wide profiling of protein binding, virus integration or DNA methylation. However, there is no specialized software to investigate integration site profiles of virus integration or transcription factor binding sites by correlating the sites with the diversity of available genomic annotations. Here we present Enhort, a user-friendly software tool for relating large sets of genomic positions to a variety of annotations. It functions as a statistics based genome browser, not focused on a single locus but analyzing many genomic positions simultaneously. Enhort provides comprehensive yet easy-to-use methods for statistical analysis, visualization, and the adjustment of background models according to experimental conditions and scientific questions. Enhort is publicly available online at enhort.mni.thm.de and published under GNU General Public License.


22
Some viruses like HIV (Craigie and Bushman, 2012) and AAV (Deyle and Russell, 2009) are able to 23 copy their genomic sequence into the genome of an infected cell. This can have severe impact on host 24 cell stability as the integration may hit and disable a gene or a regulatory region. The investigation of 25 characteristics and underlying driving factors for virus integration is not only relevant for virology and 26 infectious diseases research but also for approaches in gene therapy that apply virus-derived vectors and 27 transposons to deliver functional DNA fragments into host cells (Riviere et al., 2012;Li et al., 2015). 28 Each gene delivery system has its own mechanisms for genomic integration and preferences for choosing 29 integration sites, hence different systems may have different risks for causing undesired side effects. 30 Next Generation Sequencing (NGS) facilitates the genome-wide profiling of integration sites, as they 31 are collected e.g. in investigations of protein binding, virus/transposon integration or DNA methylation.

32
Integration sites are available from databases like the Retrovirus Integration Database (Shao et al., 2016) 33 and are regularly created for novel targeted vectors. Typically, the identified sites are related to a variety 34 of genomic features and any integration preferences are determined by a comparison of actual integration 35 sites to a set of random control sites (Gogol-Döring et al., 2016). A proper background model should 36 mimic all known biases of the signal data originating from experimental or laboratory conditions. If, 37 for example, a profiling method is only capable of detecting integration events that are close to certain 38 enzyme restriction sites then the control sites should also be selected accordingly. 39 Several tools have been published that are capable of processing genomic positions and annotations, 40 like the Genomic HyperBrowser (Sandve et al., 2013). Genome browsers like the UCSC Genome Browser 41 (Kent et al., 2002), IGV (Robinson et al., 2011) or Artemis (Carver et al., 2011) are designed for inspecting 42 single genomic locations. Also custom written scripts are commonly used for the analysis of genomic 43 positions (Cook et al., 2014) or libraries like PyBedTools (Janovitz et al., 2014;Dale et al., 2011) Figure 1. Overview of preparatory work and data gathering for analysis in Enhort. Reads containing viral integration sites are identified and sequenced in the WebLab and mapped to a reference genome. Identified insertion sites are converted to a BED file for the usage in Enhort. Together with genomic annotations from public database the analysis in Enhort is conducted to generated analysis of the given integration sites.
newly developed. Additionally, comparability across laboratories is afflicted by varying functionality and 47 different implementations of background models. There is yet no specialized tool for genomic positions 48 analysis that combines the features of instant analysis and user defined adaptable background models that 49 mimic known biases.

50
In this paper we present Enhort, a user-friendly web-platform for deep analysis of large sets of genomic 51 positions. Our aim is to accelerate and simplify the data analysis process as well as to standardize it in 52 order to increase reproducibility. Enhort is capable of adjusting background sites used for comparison by 53 user selected covariates. This includes annotation tracks like restriction sites or chromatin accessibility, 54 gene expression tracks and sequence motifs. With covariates it is possible to adjust the background sites 55 selection in a way that they match the investigated sites for a specific track. The adaptation rules out the 56 effects of this annotation for the background. This feature can be used to adjust for experimental bias 57 as well as specific questions. Figure 1 shows the schematic process of data gathering and the usage of 58 Enhort in the workflow of analyzing genomic positions.

60
Integration sites of viruses are gathered by sequencing infected cells and preprocessing as shown in   appropriate figures. Example results for a virus can be seen in Figure 3A. The software has been designed 68 in a way that analysis results are almost immediately available after upload.

69
In many cases a background model consisting of random sites is not sufficient for an adequate analysis. 70 Some protocols, for example, can only detect integration events that occurred in close proximity to a 71 restriction site of a specific enzyme, like EcoRI, which cuts inside of GAATTC hexamers (Pingoud and 72 Jeltsch, 2001). Background models should be adapted to mimic the actual integration pattern with regard 73 to any known technical bias. In this case, the control sites should also be selected to be near restriction 74 sites. This can be achieved in Enhort by setting the appropriate genome annotation as a covariate. When 75 selecting the track that contains all possible genomic positions of GAATTC hexamers as covariate, 76 Enhort will generate a set of control sites having exactly the same distribution of distances to the enzyme 77 restriction sites as the actual virus integration sites.

78
Covariates help to adapt the background model both for technical circumstances, for example, 79 restriction sites and for eliminating a bias or biological preferences such as motifs or genetic features.

80
Covariates can also be used to identify dependent or separate weak integration preferences that are covered   (Roth et al., 2011). The results are presented in a table containing for each annotation the p value, effect size and a visual representation of the integration. The annotations are ranked by effect strength. B: Effect of covariate selection. The upper diagram contains integration frequencies of MLV compared to random sites for a selection of annotations. This virus is known for preferentially integrating near transcription start sites (TSS) and H3K4me3 histone marks (LaFave et al., 2014). The lower diagram shows the same data after selecting H3K4me3 as covariate. The adapted background model is generated in a way that control sites and MLV integration sites have the same frequency relative to H3K4me3. This also changed the control site frequencies for other annotations: MLV integration is no longer enriched but depleted in CpG islands when compared to the adapted background model.      The key feature of the PB integration preference is the TTAA motif in which all integrations occur. To

155
To further review the analytic capabilities of our software, the integration counts of PB sites are 156 compared to published results from Wilson et al. (2007). The comparison can be seen in Table 2 In this publication we present Enhort, a fast and easy-to-use analyzing platform for genomic positions.

189
Based on a comprehensive library of genomic annotations, Enhort provides a wide range of methods to 190 analyze large sets of sites. In contrast to multi-purpose software such as bioconductor, Enhort enables 191 scientists to analyze data without programming effort or extensive manual work.

192
Our literature review shows that Enhort is able to perform most of the analyses commonly used in   Table 3. Comparison between fold changes of Wu et al. (2003) and Enhort over different annotations on the same integration sites. * P < 0.002, † with RefSeq genes as covariate, ‡ with RefSeq genes and TSS (± 5kb) as covariates.
Enhort is able to reproduce analyses from literature with little effort. It was not possible to reproduce the