Identifying micro-inversions using high-throughput sequencing reads

The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads. The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp. To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from: http://cqb.pku.edu.cn/ZhuLab/MID.

length. Generally speaking, existing methods use features such as read depth, read pair information, and split reads to identify SVs [15]. Yet there are still several challenges to use these features to detect MIs. The small size of MIs makes the identification difficult. The approaches used in methods based on read depth, e.g., CNVnator, would perform poorly with the small size of MIs without locating the breakpoints. The strategies in utilizing read pairs (e.g., BreakDancer, DELLY) and split reads (e.g., Pindel, PRISM, Gustaf) are also not appropriate to identify MIs less than 100 bp, because the size of SVs detected by these tools is dependent on insert size of paired-end reads and insensitive to variants as small as MIs. Additionally, to increase the applicability of the method, it is important to handle datasets with low coverage. For large-scale projects such as the 1000 Genomes Project (1KGP), there is increasing number of low coverage data [16]. For example, the sample NA12878, provided by 1KGP from two paired-end libraries with 15X coverage, has been frequently selected by variant detection tools for performance comparisons (642 deletions, 271 duplications and 30 insertions) [17]. However, most of the human individual genomes were generated with a low coverage of 2-4X in 1KGP so far. Finally, it is also challenging to identify MIs in a read with multiple variants, as other types of variants (e.g., single nucleotide variations (SNVs), small indels and other rearrangements), may also occur in nearby regions of the MI.
In this paper, we focus on MI identification from unmapped NGS short reads. Our algorithm, named MID (Micro-Inversion Detector), specifically detects MIs shorter than the read length. Moreover, MID can be applied to very low coverage samples. What makes MID different from other variant detection tools is that it is sensitive to very small size MIs, and capable of detecting MIs with multiple breakpoints in one read. Moreover, MID proves reliability in low coverage data by integrating multiple samples. The simulation results showed that MID can detect MIs efficiently and reliably using unmapped NGS short reads with low false positives. By applying MID to 1KGP data, we identified 721 MIs, 349 of which are intergenic, 342 MIs are intronic, and 30 MIs are exonic. We also applied MID to the Lung Squamous Cell Carcinoma whole exome sequencing (WXS) data from CCLE, where we identified 12 MIs.

Performance of MI detection on simulated data
Until recently, few attempts have been made for MI detection using NGS data. To the best of our knowledge, only a recent tool called Gustaf [10], can detect inversions from 30 bp to 500 bp, which in principle can recognize MIs shorter than the read length. Gustaf is a tool based on a split-read approach, which records the local alignments provided by other tools and draws a split-read graph to use standard graph algorithms to evaluate relationships of the alignments. In addition, it should be noted that no suitable real datasets with confident annotation are available as a benchmark. We therefore used simulated datasets to evaluate the performance of MID. Two different types of simulated datasets were generated in the current work: one with 1,000 MIs only (Dataset 1), and the other one with 1,000 MIs surrounded by other types of variants (Dataset 2). For Dataset 1, we simulated 1,000 MIs ranging from 15 bp to 40 bp, following a normal distribution for MI size, randomly on the whole chr10 (135 Mb) from the human genome assembly hg19 [18]. For Dataset 2, we simulated 1,000 MIs surrounded by 4,000 other structural variants (including SNVs, deletions, insertions, and duplications) following a normal distribution for MI size randomly on chr10, with a size range of 15-40 bp. We also simulated Illumina paired-end short reads (in 76 bp; same as what the sample NA19213 from the 1KGP has) using Maq [19] with an error rate 0.02. Both simulated datasets had 10 different sub-datasets with coverage varying from 2X to 60X (see Table 1). We then ran the Burrows-Wheeler alignment (BWA) tool [20] in the same way as in the 1KGP to get unmapped short reads.
To have a clear view of our evaluation results, we define several metrics. If at least one simulated read generated from the reference sequence cover the MI, we call the MI as "detectable MI". If 80 % of both the detected MI and the original MI overlap, we call the MI as "correctly detected MI" [10]. We then calculate the sensitivity (SN) as the ratio of correctly detected MIs over all detectable MIs and positive predictive value (PPV) as the ratio of correctly detected MIs over all detected MIs.
As mentioned above, unmapped short reads have not been well studied by most existing tools including Gustaf. However, it is very important for the analysis of personalized NGS data to study unmapped short reads so that we can exploit more SV information especially for MIs. Therefore, preprocessing of unmapped reads is needed. After running BWA to get unmapped short reads, we aligned unmapped reads to chr10 the same as what we did in MID, and then recorded the anchored alignment results for Gustaf, including the re-aligned reads and the corresponding reference sequence on chr10. Afterwards the target regions from both realigned read and reference were used as the input for Gustaf single-end detecting. For MID, we can directly run the whole pipeline with unmapped short reads discarded by BWA. Table 1 shows the comparison between MID and Gustaf. Note that the sub-datasets in 2X and 4X coverage are most similar to the real data we got from the 1KGP. The high SN and high PPV of MID in both Dataset 1 and Dataset 2 demonstrates high accuracy of MID in detecting MIs, even with different variants around. Gustaf demonstrates significantly lower SN and PPV in the simulated data. Furthermore, lower standard deviation (SD) of MID compared with Gustaf in simulated data shows stable performance of MID. Overall, MID outperforms Gustaf in identifying MIs.

Identifying MIs in 1000 genomes project data
The 1KGP provides large number of individual human genomes data with NGS reads [16]. At present, 1KGP data has been widely used for SV detection by existing tools [5,8,16]. However, analysis on MIs is still lacking. To detect MIs, MID was applied on population-scale sequencing data from the 1KGP based on publicly released unmapped BAM files [16]. By running MID with reference human genome assembly hg19, MID reports the detailed alignment of each read containing MIs and a list of unique MIs detected. For a typical sample with a total number of 13.5 M unmapped short reads (NA19917), it takes 12 min to run MID with 16 CPU threads.
Generally speaking, low coverage raises problems for MI detection since less support from reads may lead to unreliable results. However, many datasets from largescale projects (e.g., the 1KGP) are in low coverage. Therefore, it is very useful to develop methods for identifying MIs reliably and efficiently based on multiple low-coverage samples. It has been suggested that integrated analysis using multiple samples can be helpful in improving the reliability of SV detection [11]. We assessed the performance of MID on a total of 770 Illumina samples from the 1KGP [16], which have been categorized by different populations, and then integrated the results at the population level for more reliable and informative results. We eventually focused on 638 samples (full sample list in sample list of Additional file 1), in which MID reported at least one MI.
We calculated the number of unique MIs and the number of reads supporting each MI first. In the following analysis, if not specified, the number of MIs is equal to the number of unique MIs, either in one particular sample or in one population. Moreover, if one MI is supported by multiple reads, we calculate the number of reads containing the same MI as its "occurrence". Altogether, MID reported 2,413 occurrences of 721 MIs in 638 samples (full list of MIs and annotations refer to Additional file 1: Table S1). Of the 721 detected MIs, 349 are intergenic, 342 MIs are intronic, and 30 MIs are exonic, including five MIs overlapping with CDS regions, seven MIs overlapping with UTR regions, two MIs overlapping with both CDS and UTR regions, and the rest overlapping with miRNA, Mt_rRNA etc. (more details can be found in Additional file 1: Table S1), annotated by GENCODE [21]. Using ENCODE annotations [22], we identified 13 MIs overlapping with proximal transcription factor binding sites (TFBS), and 48 MIs overlapping with distal TFBS following ChIP peaks of transcription factors. As many regulatory elements are in intronic regions [22,23], plus the effect of MIs extends beyond the inverted regions [24], thus 342 MIs found overlapping with introns, as well as 19 MIs found locate within the 2,000 bp upstream genomic regions of genes can also be informative for further analysis.  Table 2 presents the overview of MIs detected in 638 individual samples, illustrating an average of 1.48 individual samples supporting one MI,~24 % of MIs supported by at least two samples in the same population, and approximately 24 % of MIs are supported by different populations in the same ancestry category. Herein the results of MIs supported by multiple reads, individual samples, and populations suggest that our method is probably informative and reliable with an integrated view of individual samples. In addition, Additional file 1: Figure S2 shows that the length of MIs detected in 1KGP data is from 15 bp to 43 bp (mean length is 24 bp). As the majority of MIs concentrate within the size range from 18 bp to 31 bp, we assume that the incidence of MI would be higher in this size, which might be helpful for understanding the mechanism of MI. Furthermore, Additional file 1: Figure S3 shows the distribution of number of MIs across the human chromosomes, where the number of MIs generally has positive correlation with the length of chromosomes except chr11.
To focus on the exonic MIs detected, we checked 30 MIs overlapping with exons annotated by GENCODE. For instance, in Fig. 1a, an MI overlaps with the 3' UTR of gene PREPL and gene SLC3A1, which both have strong correlation with Hypotonia-Cystinuria Syndrome [25,26]. This MI is supported by 46 individual samples across different populations, including two samples in East Asian group, 10 samples in European group, 12 samples in Americas group, and 22 samples in African group (Fig. 1b). The chimpanzee sequence in this location is almost identical to the sequence of Neanderthal (Vi33.25 Sequence Reads) and the MI we found, which suggests that the reference human genome is inverted in this region as compared to the most recent common ancestor of the human population. However, this MI was reported as multiple nucleotide variation (GenBank:rs71416108) due to the poor understanding of MI [27]. Thus our identification of MI can be helpful for understanding genomic variants. Another MI changes 6 amino acids in the CDS region of gene OR51I1 (Fig. 1c). These amino acids are located on the fourth transmembrane domain facing the less conserved extracellular side comparing to the intracellular side, which might cause severe influence.

Application to CCLE lung squamous cell carcinoma data
In addition to the 1KGP data, cancer genome sequencing data are widely available. A full understanding of The "sam-num" column illustrates the number of samples for each category (either population or population group); the "MI-num" column illustrates the number of different MIs detected in each population or population group; the "MI-occ" column illustrates the sum of occurrences of MIs in each population or population group; the "read-num" column illustrates the number of reads supporting MIs. For the population lines, the "mul-sup" column illustrates the number of MIs supported by at least two individual samples (named "multiple samples supported MIs") in one population, the "ooc/num" column illustrates the ratio of MI occurrence over MI number, which indicates the average number of individual samples supporting one MI in the same population, and the "mul-sup/num" column illustrates the ratio of multiple samples supported MIs over the number of all MIs. For the population group lines (which started with "Total"), the "mul-sup" column illustrates the number of MIs supported by at least two populations (named "multiple populations supported MIs") in the same population group, the "ooc/ num" column illustrates the ratio of MI occurrence over MI number, which indicates the average number of populations supporting one MI in the same population group, and the "mul-sup/num" column illustrates the ratio of multiple populations supported MIs over the total number of MIs. The last "read/num" column illustrates the ratio of the number of reads containing MIs over the number of MIs, which indicates the average number of reads supporting each MI somatic alterations in cancer genomes is important to better understand the genetic basis for cancer development [28,29]. Of the whole CCLE data, Lung Squamous Cell Carcinoma is a typical kind of lung cancer, which is a leading cause of death among all cancer patients [30].
In this work, we selected 14 WXS datasets of Lung Squamous Cell Carcinoma from CCLE to provide a proof-of-concept demonstration of the applicability of MID to cancer sequencing data (full sample list in Additional file 1). We found 12 MIs overlapping with 15 genes (Additional file 1: Table S2). To be more specific, seven MIs overlap with CDS regions, one MI overlaps with one gene in CDS regions and another gene in UTR regions, and the rest locate in the CDS nearby regions, which are annotated by GENCODE. Moreover, three MIs overlap with the proximal transcription factor binding sites (TFBS) annotated by ENCODE following ChIP peaks of transcription factors (more details can be found in Additional file 1: Table S2). We also found one MI within the 2,000 bp upstream region of gene VPS54. In Fig. 2a, we show that an MI breaks the edge of CDS region and changes three amino acids of gene PSRC1, which encodes a proline-rich protein that is a target for regulation by the tumor suppressor protein p53. In addition, Fig. 2b presents an MI changing five amino acids of gene JMJD4 and overlapping with 5'UTR of gene SNAP47. While JMJD4 is a member of JmjCdomain-only family, which only contains the JmjC domain, and plays an important role in demethylation and involves in cancer diagnosis [31]. The variants and expression bias of genes in this family are related to cancer regulations, thus the change made by MI in the JMJD4 gene may help enrich the study of JMJD4 and related cancer diagnosis.

Discussion and conclusions
As presented above, upon test of the simulated data, MID has a steady performance of PPV for both low coverage and high coverage data, varying from 2X to 60X. Significantly, MID demonstrates high PPV in simulated dataset with low coverage 2-4X, which is the same as the coverage range used by low coverage samples from the 1KGP. In fact, the reason of our method      showing stable performance on low coverage data is that the identification process is designed based on the target regions of read and reference sequence, regardless of read depth. Thus coverage bias has little influence in our method. By contrast, most existing tools require enough coverage to detect breakpoints, and only optimize good performance with high coverage data [4][5][6][7][8][9][10]. However, during our anchoring approach, in which the Bowtie program was called to make anchored alignment as preprocessing, higher coverage data can provide more detectable reads before identification process. Therefore we will have more reads containing possible MIs to be detected during identification, which means our strategy will benefit from higher coverage data. Therefore, although MID performs well in low coverage data, it can also benefit from data with higher coverage. In fact, the pipeline of MID contains preprocessing and sequence mapping process. During the preprocessing of MID, we did anchored alignment by calling Bowtie. The reason we chose Bowtie [32] instead of the latest Bowtie2 [33] is that the length of paired-end reads (anchors) is shorter than 50 bp in~100 bp NGS short reads, and Bowtie outperforms Bowtie2 in this size range, which is proved by our test and also claimed clearly in the tool clarification [33]. For sequence mapping, MID can handle small size MIs and MIs with multiple breakpoints, owing to the flexible segment mapping and scoring system during path finding. In existing tools such as BWA tool, although a number of discontinuous gaps and SNVs might be detected, it is extremely difficult to identify more complex SVs including inversions and duplications. In addition, more complicated scenario, i.e., MIs with multiple breakpoints within the read, would hardly be taken into consideration by these tools. MID uses flexible segment mapping on both strands based on k-mers, which is more suitable for MI detection, especially for dealing with small MIs and MIs with multiple breakpoints around. The scoring process also helps confirm the final path and distinguish incorrect matches, including palindromic sequences, and MIs as well.

C A C A A A A G G G A A A G G G A A G A G A G T G C T C T C T C C C T T T C C C T T T G T G T T C A C T C T C T C C C T T T C C C T T T G T G
In summary, we have developed a novel computational tool MID to identify MIs by mapping initially unmapped short reads back onto human genome sequence. What makes MID different from other SV detecting tools is that our approach is sensitive to very small size MIs, and capable of detecting MIs with multiple breakpoints in one read due to flexible segment mapping process, as well as scoring system in distinguishing MIs from palindromic sequence and other incorrect matches. The pipeline of MID can start from unmapped BAM files and find the optimum solution automatically rather than parameter changing. To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short NGS reads. Moreover, MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1KGP. We realize that the mechanism and the function of MI, as a kind of SV (as both germline and somatic alterations), are still poorly understood. Nevertheless, we expect that our tool would be useful to better understand MIs and their roles in genetic diversity and diseases. In conclusion, MID is suitable for large-scale short reads produced by present high-throughput sequencing technologies (e.g., Illumina), especially for low coverage data, and MID might have positive impact on identifying key genetic variants in human diseases with the further development of sequencing technologies.

Methods
With an input of a BAM file containing unmapped short reads, MID takes three steps to report the output of a list of MIs, as well as detailed alignment information for each MI. The three main steps for identifying MIs are: (1) Create anchored alignment, which determines the corresponding region on the reference genome for the reads that might harbor MI. (2) Perform detailed alignment between the read and the genomic region identified in Step (1). (3) Report a list of MIs with additional information on how the MI-containing reads can be aligned. The workflow of MID is shown in Fig. 3.

Anchoring and k-mer mapping Anchoring
We provide an option of cutting size (0 bp as default) for both head and tail regions, which might be more error-prone, of each unmapped short read, and then select the new head and tail regions after cutting as "anchors". As for short reads of~100 bp, we set no cutting of two ends and select the head and tail regions of each unmapped short read as "anchors" (18 bp as default), afterwards we map the two anchors onto the reference genome as two ends of one paired-end read with a length of 18 bp using Bowtie [32] to find potential MI regions in the read (Fig. 4a). Moreover, for longer short reads (e.g.,~200 bp), we recommend choosing cutting size according to sequencing quality before selecting anchors. One mismatch on the anchors generated from a short read is allowed during mapping. Take the short read data of individual sample NA19213 as an example. The length of short reads in NA19213 is 76 bp, so the original distance between two anchors in the same read is 40 bp. Owing to the possible combination of variants including unbalanced variants such as small indels, insertions, deletions and duplications, we set the range of distance between two anchors to be in a range from 15 bp to 65 bp, which is denoted as the insert size for the paired-end read in Bowtie. If the distance is shorter than 15 bp caused by a possible deletion, then there is no need to consider the read since our algorithm aims to detect MIs longer than 15 bp. If the distance is longer than 65 bp caused by a possible insertion, we suppose this match could be redundant since we focus on MI detection and the inversion length in this read can be 40 bp at most. The length of anchors can be adjusted by users. However, we suggest a minimum of 10 bp set; otherwise the pair of anchors might not have unique alignment result in this step. This process significantly reduces the potential search space for MID.

Seeding
A k-mer extracted from the read is called a seed (14 bp as default). In addition to the head and tail anchors of a read, the middle part of the read is called "target region" (Fig. 4a). We then select target region of the read and extract consecutive seeds (step size 1 bp as default). We also extract consecutive seeds (step size 1 bp as default) from the reverse complement of the target region. These two groups of seeds are stored and operated separately for following steps.

Seed matching
The matching process is performed on the target regions both of the read and the reference sequence. After the anchoring process, we get the exact location of unmapped short read onto the reference sequence. We then select the target region of the initial read and its corresponding target region on the reference sequence. This seed-matching process can tolerate MIs with indels around the inversion breakpoints. We allow a maximum of i (2 as default) mismatches in a seed. In other words, if more than i mismatches occur within an interval of k, this k-mer (seed) will not be aligned. During seed matching, we compare each seed in the read target region to the reference target region and store all possible locations for each seed on the reference. To reduce the effect of a particular type of palindromic sequence tandem Fig. 3 The workflow of the MID program repeats (such as "AT"s or "GC"s) when matching the seed in the reverse complement to detect inversions, we discard the seeds with greater and equal to n (n = L/4, where L is the length of the substring sequence) "AT"s or "GC"s.

Path finding and MI identification
This step aims to obtain the best path of k-mer alignment. MI within a read will correspond to a reversed sub-path.

Merging consecutive seeds
One seed can be matched to multiple locations on the reference sequence due to its short length. If the number of collinear consecutive seeds on the reference is larger than a threshold (5 as default), we consider this set of seeds as a "matching segment" (MS) and the "matching segment pair" (MSP) between the read and the reference. We might get different constructs of MSPs containing the same seeds, or one MS in the read that can be matched to different locations on the reference sequence, resulting in multiple MSPs for the same MS. The information of every possible MSP is recorded and will be further evaluated in the following steps.

Merging neighboring MSPs
While we set the threshold of i (2 as default) mismatches in one seed, if the SNVs occur at the edges of MSPs and a k-mer covering the SNVs reach the threshold of i mismatches, then this k-mer fails to be aligned. To A and B. c shows the max-score path approach. The path starting with MSP [1] and ending with MSP [4] denotes a short read to be detected, and the path in red is the max-score path. MSP [1] , MSP [2] , MSP [3] , MSP [4] are on the forward strand, and MSP [−3] is on the reverse strand. MID starts with MSP [1] and extends the path to MSP [2] , then if MSP [3]

Identifying non-overlapping MSPs
MSPs may overlap with each other. Since we want to find paths constructed from non-overlapping MSPs, we take a partition-and-recombine strategy to make sure all the information carried by the MSPs is well kept (Fig. 4b). Before this step, MSPs in different orientations are stored and operated separately. We then put two separate sets of MSPs together and sort them by their locations on the reference sequence. Next, we find out each pair of overlapping MSPs and partition the overlapping subsequence to get a new set of non-overlapping MSPs pairs. We then recombine the overlapping subsequence with either MSP or neither of them. In Fig. 4b, For overlapping MSPs, we use the partition-and-recombination strategy to check overlapping MSs on both read and reference sequence to get non-overlapping MSP set. This procedure guarantees the transformation of MSP is lossless, because we keep all the possible combinations in the non-overlapping MSP set. The resulting non-overlapping MSPs will be used in the subsequent path finding step.

Maximum-score path finding
We generate all combinations of non-overlapping MSPs as path candidates (Fig. 4c) and calculate the alignment score of each candidate to find the path with the maximum score by dynamic programming (see definitions of variables below). In dynamic programming, we define two kinds of score. One is for the MSP itself, which includes the score for matches and mismatches. The other is for the gap penalty between two neighboring MSPs in the path. We record the maximum score T [i] of every path ending with MSP [i] based on the maximum score of any sum of T [j] and GS [j,i] (1 < j < i). The complexity of our algorithm is O(n 2 ), where n is the number of MSPs. In practice,~85 % of n is smaller than 10. We consider match, mismatch, and gap in the similarity score [34]. We define: Finally we select the maximum alignment score: Fig. 4c, the path starting with MSP [1] and ending with MSP [4] denotes a short read to be detected, and the path in red is the max-score path. MSP [1] , MSP [2] , MSP [3] , MSP [4] are on the forward strand, and MSP [−3] is on the reverse strand. MID starts with MSP [1] and extends the path to MSP [2] , then if MSP [3] and MSP [−3] can both be matched, MID records both path candidates and ends with MSP [4] . Therefore we have two path candidates as follows: {1, 2, 3, 4} and {1, 2, −3, 4}. After scoring, path candidate {1, 2, 3, 4} will be chosen due to the reverse penalty for MSP [−3] on the reverse strand, which