intansv: an R package for integrative analysis of structural variations

Identification of structural variations between individuals is very important for the understanding of phenotype variations and diseases. Despite the existence of dozens of programs for prediction of structural variations, none of them is the golden standard in this field and the results of multiple programs were usually integrated to get more reliable predictions. Annotation and visualization of structural variations are important for the understanding of their functions. However, no program provides these functions currently as far as we are concerned. We report an R package, intansv, which can integrate the predictions of multiple programs as well as annotate and visualize structural variations. The source code and the help manual of intansv is freely available at https://github.com/venyao/intansv and http://www.bioconductor.org/packages/devel/bioc/html/intansv.html.


INTRODUCTION
Next-generation sequencing (NGS) has greatly enhanced our ability to detect genomic variations between individuals which is the key to the understanding of phenotype variations and diseases. Although the third-generation sequencing technologies have emerged as new options to detect genomic variations, NGS is still the preferred approach concerning the sequencing cost and accuracy (Spealman, Burrell & Gresham, 2019). Genomic variation is comprised of single nucleotide polymorphisms (SNPs), insertions/deletions (indels) and structural variations (SVs). SNPs have been widely detected and utilized in quantitative trait locus (QTL) mapping and genome wide association study (GWAS) while SVs, which usually cause large effects, are more difficult to be detected (Alkan, Coe & Eichler, 2011;Sadowski et al., 2019).
The algorithms employed by these programs could be categorized into three groups: (1) read-pair methods which analyze the span size and mapping orientation of paired-end reads and their inconsistency from expectation (Chen et al., 2009), (2) split-read algorithms which align the unmappable end of read pairs of which only one end can be mapped to the reference genome using the split-read alignment (Ye et al., 2009) and (3) read depth analysis which utilizes the increase and decrease in sequence coverage (Abyzov et al., 2011). However, the results of different tools have been reported to be discordant with each other and none of the currently existing programs is the golden standard in this field (Lin et al., 2015). As a result, the outputs of different tools were usually integrated to get more reliable predictions (Sudmant et al., 2015).
SVMerge (Wong et al., 2010), HugeSeq (Lam et al., 2012) and iSVP (Mimori et al., 2013) are tools to detect SVs by integrating different SV callers as part of their pipelines. BreakDancer (Chen et al., 2009), Pindel (Ye et al., 2009), SECluster, RDXplorer (Yoon et al., 2009) and RetroSeq (Keane, Wong & Adams, 2012 are integrated in the pipeline of SVMerge. HugeSeq and iSVP only allow the use of supplied default callers. HugeSeq integrates four SV callers including BreakDancer, Pindel, CNVnator (Abyzov et al., 2011) andBreakSeq (Lam et al., 2010) while iSVP incorporates BreakDancer and Pindel as part of its pipeline. The installation and configuration of these tools are challenging for users with no computing experience as different SV callers are implemented with varying programming languages and are glued together by these pipelines utilizing Linux shell, Perl and Python. To use these pipelines, the users must successfully install several dependency software required by these pipelines in advance. In addition, it is important to annotate SVs and visualize the distribution of SVs in the whole genome or a specified genomic region. However, no software provides these functions simultaneously for now.
We report intansv, an R package which is easy to install and use, aiming to integrate the output of multiple popular programs, annotate the effects of SVs to genes and visualize the SVs in the whole genome or specified genomic regions (Fig. 1). The installation of intansv is independent to the installation of SV callers as it only relies on the output of SV callers. The intansv package supports the integration of the results of some or all seven prevalent SV callers.

MATERIAL AND METHODS
The NGS data of Zhenshan 97 were downloaded from NCBI SRA (accession number SRA012177) and were aligned to the rice Nipponbare reference genome (MSU version 7.0) using BWA (version 0.6.1-r104) with default parameters (Kawahara et al., 2013;Li & Durbin, 2009). The alignment result was processed by SAMtools and stored in BAM format (Li et al., 2009). The alignment result in BAM format and the reference genome sequence are essential inputs for SV callers. Some of the SV callers use only the alignment result in BAM format as the input file while some of the SV callers also need the reference genome sequence file as the input file. All the seven SV callers were run with default parameters.
For each SV caller, various information was output to evaluate the quality of each predicted SV, including read mapping quality, number of discordant read pairs supporting  Figure 1 The pipeline of intansv for integrative analysis of structural variations (SV). The raw outputs of SV callers are first processed by intansv to get user-friendly results. During this process, the low-quality predictions were filtered out. The processed outputs of different SV callers are then merged for downstream analysis, including annotation and visualization. Full-size DOI: 10.7717/peerj.8867/ fig-1 the SV, number of split reads supporting the SV, etc. The information was utilized to filter low quality predictions of these programs. For example, SVs predicted by BreakDancer with fewer than 3 discordant read pairs support were filtered by intansv by default. Each SV was assigned a score representing the quality of the SV by BreakDancer and SVs with a score smaller than 60 were filtered by intansv. In addition, SVs shorter than 100 bp or longer than 10 Mb predicted by any SV caller were also filtered by intansv by default. For the results of Pindel, SVs supported by fewer than 3 split reads were filtered. For the results of SoftSearch, SVs supported by fewer than 3 split reads or supported by fewer than 3 discordant read pairs were filtered (Hart et al., 2013). To cluster SVs predicted by different SV callers, we first combined the same type of SVs of different SV callers and detected the coordinate overlapping between SVs of the same type. The findOverlaps function of the R package GenomicRanges was used to detect the coordinate overlapping between different SVs, which could be represented as genomic ranges (Lawrence et al., 2013). Then the coordinate overlapping percentage between pairwise overlapped SVs was calculated. We further define the distance between pair-wise SVs as 0 if they have a reciprocal coordinate overlap of more than 80%. The distance between non-overlapped pair-wise SVs or pair-wise SVs with a reciprocal coordinate overlap of no more than 80% was defined as 1. In this way, a distance matrix could be constructed for combined SVs of the same type. We can then use hierarchical clustering to divide the combined SVs of the same type into different clusters. A cluster of SVs supported by two or more methods were then merged as a unified SV by taking the mean of the start coordinates of all SVs in the cluster as the start coordinate of the merged SV and taking the mean of the end coordinates of all SVs in the cluster as the end coordinate of the merged SV. Any cluster of SVs supported by only one method were discarded.

RESULTS
The intansv package takes the predictions of multiple SV callers as input and outputs an integrated list of SVs. To use intansv, the users need to run SV callers and prepare the output files as input to intansv. Here, we use the NGS data of a rice variety Zhenshan 97 to illustrate the versatility of intansv (Materials and Methods) (Xie et al., 2010). Seven different programs, BreakDancer (Chen et al., 2009), CNVnator (Abyzov et al., 2011), DELLY (Rausch et al., 2012), Pindel (Ye et al., 2009), SVseq2 (Zhang, Wang & Wu, 2012, Lumpy (Layer et al., 2014) and SoftSearch (Hart et al., 2013) were used with default parameters to predict SVs based on the alignment of the NGS data of Zhenshan 97 to the Nipponbare reference genome (Table 1). The Nipponbare reference genome is composed of 374.5 million nucleotide base pairs with a GC content of 43.5%. The NGS data of Zhenshan 97 provided over 14× coverage of the Nipponbare reference genome. The first step to use intansv is the reading of the output of SV callers. The intansv package provides reusable and efficient tools to deal with the outputs of the seven programs as they are tedious and not user-friendly for downstream analysis (Fig. 1). During the reading process, low quality predictions given by these programs were filtered out by intansv (Table 1, Materials and Methods). In addition, overlapped predictions output by a single SV caller would be resolved by intansv.
BreakDancer is able to predict deletions and inversions. The prediction of all type of SVs were provided as a single file by BreakDancer. The output of BreakDancer contains 9010 deletions and 1202 inversions in the whole genome of Zhenshan 97 (Table 1). The following R script shows the processing of the output of BreakDancer by intansv. The parameter breakdancer.file.path stores the file path of the output file of BreakDancer in the disk. A total of 6890 deletions and 153 inversions in the whole genome of Zhenshan 97 were read into R as an R list by intansv after filtering low quality predictions (1034 deletions and 24 inversions for chromosome chr05 and chr10 in this demo). The evaluated score and evidence supporting each SV were also read into R. Each element of the R list is an R data frame containing the results of a type of SV, which can be easily wrote into the disk as text files using the write.table function of R. CNVnator is able to predict deletions and duplications. The final output of CNVnator usually contains several files and each file is the output for a single chromosome. All these files should be put in the same directory and the path of this directory should be input to the function readCnvnator of intansv. Then the output of CNVnator will be read into R. The directory given to readCnvnator should only contain the final output files of CNVnator. The output of CNVnator contains 2,806 deletions and 566 duplications in the whole genome of Zhenshan 97 (Table 1). The following R script shows the processing of the output of CNVnator by intansv. The parameter cnvnator.dir.path stores the directory path of all the output files of CNVnator in the disk. A total of 2,805 deletions and 566 duplications in the whole genome of Zhenshan 97 were read into R (369 deletions and 113 duplications for chromosome chr05 and chr10 in this demo). As SVs are much more complex to detect than single nucleotide variations, predictions of different programs were usually integrated to get more reliable results. The second step to use intansv is to integrate the predictions of multiple SV callers. The intansv package provides efficient functions to cluster and merge the predictions of different programs (Materials and Methods). Predictions of varying programs were first clustered together if they have reciprocal coordinate overlap of more than a specified percentage (default value 80%) (Fig. 2). Then clusters supported by more than a predefined number (default value 2) of methods were merged as one SV respectively while the rest clusters were discarded (Fig. 2). The intansv package is also able to integrate the predictions of other SV callers with appropriate file format provided. By integrating the results of different SV callers, intansv provides the most reliable and comprehensive SV predictions (Table 1). The following R script shows the integration of the output of BreakDancer, Pindel, CNVnator, DELLY, SVseq2 by intansv. The integrated results contain 851 deletions, 13 duplications and 16 inversions for chromosome chr05 and chr10 of the Zhenshan 97 genome. A total of 6,255 deletions, 139 duplications and 92 inversions in the whole genome of Zhenshan 97 were detected by integrating the results of seven SV callers (Tables 1, 2). The average length of the 6,255 deletions, 139 duplications and 92 inversions were 3,817 bp (ranging from 103 bp to 312,652 bp), 4,0174 bp (ranging from 109 bp to 898,618 bp) and 92,364 bp (ranging from 143 bp to 904,615 bp), respectively (Figs. 3A-3C). The distribution of SVs in different chromosomes is not even (Fig. 3D) Figure 2 Merging the predictions of different SV callers. Predictions of different programs were first clustered together if they have reciprocal coordinate overlap of more than a defined threshold (default 80%). Then clusters supported by more than two methods were merged as one SV (A and Cluster 1 in B) respectively while the rest clusters supported by no more than two methods were discarded (Cluster 2 in B). Analysis of the effect of SVs to the structure of genes is essential for the understanding of their functions. The third step to use intansv is to annotate the effect of SVs to the genome structure. For each identified SV, intansv provides details on the overlapping between this SV and genes or the elements of genes (Fig. 4A). Functions of the R package GenomicRanges was utilized in the intansv package to dissect the overlapping between SV and gene elements (Lawrence et al., 2013). Based on the NGS data of Zhenshan 97, we  (Tables S1-S3). The final step to use intansv is to visualize the distribution of SVs in the whole genome or a specified genomic region. The chromosomes were split into windows of specific size and the number of SVs within each window were counted and displayed as circular bar plot (Fig. 4B). For a specified region, the SVs and genes in it were displayed as circular rectangles to show the effects of SVs to the structures of genes and elements of genes (Fig. 4C). Functions of the R package ggbio were utilized to realize this functionality in intansv (Yin, Cook & Lawrence, 2012).

DISCUSSION
We provide an R package, intansv, for integrative analysis of structural variations. It can process raw output of seven prevalent programs, merge predictions of different programs, annotate effects caused by SVs to gene and its elements, visualize SVs distribution in the whole genome or a specified genomic region, and output an integrated SV list as a single file. Visualization of SVs in the whole genome is helpful for identifying SV hotspots while visualization of SVs in specific genomic regions is beneficial to uncover the potential effects of SVs through comparing SVs with genomic features or other experimental data. The intansv package is a program with incorporate functions to deal with miscellaneous output styles of different SV callers which has great potential to be widely accepted by the users.
The parameter values of all the functions of intansv can be adjusted by the users to meet the criteria of different users. The source code and the help manual of intansv are deposited in GitHub (https://github.com/venyao/intansv). The functionalities of old versions of the intansv package and the example output file format of different SV callers integrated by intansv is provided at https://venyao.github.io/intansv/.

CONCLUSION
The discovery and genotyping of SV is very important for understanding its role in phenotype variations. Precise and comprehensive detection of SV is still difficult due to the complexity of SVs. We developed the intansv package which is able to integrate the predictions of multiple SV callers. New features were implemented in the intansv package including the annotation and visualization of SVs. With the development of NGS and long read sequencing technology, more and more programs for the prediction of SVs would be developed. We will further enhance the functionalities of intansv in the future by implementing new functions to deal with the output of more programs to liberate researchers from tedious work of data cleansing.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This research was supported by the research start-up fund of Henan Agricultural University (30500581), the Scientific and Technological Project of Henan Province (182102110278), the Project of Henan Provincial Department of Education (18A210017) and the open funds of the National Laboratory of Wheat Engineering, Key Laboratory of Wheat Biology and Genetic Breeding in Central Huang-huai Region, Ministry of Agriculture, Henan Key Laboratory of Wheat Biology. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: