Detection of False-Positive Deletions from the Database of Genomic Variants

Next generation sequencing is an emerging technology that has been widely used in the detection of genomic variants. However, since its depth of coverage, a main signature used for variant calling, is affected greatly by biases such as GC content and mappability, some callings are false positives. In this study, we utilized paired-end read mapping, another signature that is not affected by the aforementioned biases, to detect false-positive deletions in the database of genomic variants. We first identified 1923 suspicious variants that may be false positives and then conducted validation studies on each suspicious variant, which detected 583 false-positive deletions. Finally we analysed the distribution of these false positives by chromosome, sample, and size. Hopefully, incorrect documentation and annotations in downstream studies can be avoided by correcting these false positives in public repositories.


Introduction
A genomic variant is an alteration of the DNA sequence of an organism. Since an organism's DNA sequence encodes the genetic instructions used in its development, any alteration of this sequence may cause genetic abnormalities or even fatality. According to their sizes, genomic variants are classified into small-scale variants, such as single nucleotide polymorphisms (SNP) and indels, and large-scale variants, namely, structural variations (SV), including copy number variations (CNV), insertions, deletions, inversions, segmental duplications, and translocations [1]. Various complex diseases have been reported to be associated with genomic variants in human genomes [2].
Prior to next generation sequencing (NGS), cytogenetic techniques, such as fluorescence in situ hybridization (FISH) and array comparative genomic hybridization (aCGH), were employed to detect SV. However, due to their relatively low genomic resolution (e.g., microscopic scale (Mbp) for FISH, and submicroscopic scale (kbp) for aCGH) [3], most medical and biological research teams have migrated their platforms to NGS, which can provide base-pair level resolution.
Several approaches have been proposed to detect SV from NGS data. Generally, these approaches can be classified into two categories: paired-end read mapping (PEM)-based or depth of coverage (DOC)-based approaches [4]. For PEMbased approaches, if the span of a pair of mapped reads is longer/shorter than a specified cutoff related to the insert size of the sequencing library, a deletion/insertion can be identified [5], whereas for DOC-based approaches, if the local depth of reads is significantly larger/smaller than the global DOC, a duplication/deletion can be identified [3,6,7]. PEMbased approaches have advantage of detecting balanced SVs (inversion) and unbalanced SVs (deletion and insertion) of relatively small sizes, whereas DOC-based approaches are good at detecting unbalanced SVs (CNV) of relatively large sizes. Besides PEM and DOC, several other supplementary signatures, such as split read mapping, have been combined into PEM and DOC to improve detection performance, leading to integrative models [8][9][10].
Despite PEM's advantage in balanced SV detection, the majority of detection approaches use DOC as the primary signature to identify CNVs [11]. However, the DOC signature is biased due to two main factors: GC content [12] and mappability [13]. BioMed Research International (i) Since G and C form a triple hydrogen bond (whereas A and T form a double bound), theoretically the melting temperature of GC-rich segments is approximately 2 ∘ C higher than that of AT-rich segments [14]. As a result, when the sequencing protocol involves polymerase chain reaction (PCR), GC-rich segments and AT-rich ones are unevenly amplified [15], yielding the correlation between DOC and GC content [16,17].
(ii) Due to the complexity of the human genome, there are regions in which sequenced short reads cannot be uniquely mapped, e.g., repeated regions, such as retrotransposons (LINE and SINE) [13]. Mappability was introduced to measure the uniqueness of such regions using a score that ranges from 0 to 1, corresponding to highly repeated and unique regions, respectively. From the definition, it is clear that the DOC is correlated with mappability.
In human genomes, both GC content and mappability are distributed unevenly along chromosomes, and, therefore, they introduce biases into DOC. Several methods have been developed to correct these two biases [13,18,19].
However, due to the overlook of biases introduced by GC content and mappability, some DOC-based SV studies contain false detection. From the mechanism of PEM and DOC signatures, it is clear that PEM is less affected by GC content or mappability than DOC, and, therefore, PEM can be used to detect false positives. In this paper, we used this idea to verify the entries in the database of genomic variants (DGV). We hope that incorrect documentation and annotations can be avoided in downstream studies by correcting the false positives in this public repository and other related ones such as EMBL-EBI's Genomic Variants archive (DGVa) and NCBI's dbVar.

Samples and Data.
The database of genomic variants (DGV, http://dgv.tcag.ca/dgv/app/home) [20] provides a comprehensive summary of structural variation (SV) in the human genome. In DGV, SVs are defined as genomic alterations that involve segments of DNA with length larger than 50 bp. In DGV, the 6.4 millions of variants represent collections from 55 thousand healthy control samples in 72 studies. DGV provides a curated catalogue of genomic variations in the human genome, which was integrated into EMBL-EBI's Genomic Variants archive (DGVa) and NCBI's dbVar. Therefore, this database is of tremendous importance to investigators whose study interest is about genomic variance, which is also the focus of the current study.
The 1000 Genomes Project (http://www.1000genomes .org/ [21]) is a well-known international collaborative NGS project, which aims to sequence the genomes of approximately 2500 people from 25 populations around the world. In our study, the BGZF compressed sequence alignment/map (SAM) data files (BAM) of most samples were downloaded from the website of this project as the primary dataset.
Sequence read archive (SRA, https://www.ncbi.nlm.nih .gov/sra) is the NCBI database that stores raw sequencing data obtained from NGS technology. The aim of SRA is to make NGS data available to researchers to both improve reproducibility and enable new discoveries. This database includes data from most common sequencing platforms and most NGS studies. The BAM file of some specific samples in our study was not available, so the SRA files were downloaded as the primary dataset.

Methods.
The steps conducted in the study are shown as follows, and the pseudocode is listed in Pseudocode 1 to illustrate the logical structure of these steps.
(1) The latest spreadsheet of supporting variants was downloaded from the DGV website. The GC content and mappability profile of each chromosome of hg18 were downloaded from the readDepth website (https://github.com/chrisamiller/readdepth) [18], which is an NGS-based CNV detection software package.
(2) GC content profiles were smoothed with LOESS [22], and segments with size larger than 500 bp and an average GC content lower than ℎ 1 = 0.26 or greater than ℎ 2 = 0.59 were obtained as suspicious regions.
(3) Mappability profiles were smoothed with LOESS, and segments with size larger than 500 bp and an average mappability lower than ℎ 3 = 0.92 were added to suspicious regions.
(4) All supporting variants in DGV were resolved oneby-one to collect the fields needed in the current study, including variant accession ID, chromosome, genomic location (starting and ending loci), variant subtype, reference, method, and samples used.
(5) Variants associated with the sequencing method and loss or deletion subtype, size smaller than 10 kbp, and nonempty sample fields were filtered for further analysis.
(7) GDV variants that overlapped (F-score greater than 0.9) with suspicious regions were identified as suspicious variants. Here are some notes that should be addressed: (i) A supporting variant represents a variant called in a single sample/individual, which can also be described as sample level variant [20].
(ii) One sample is the minimal requirement to verify a specific supporting variant, so the sample field should be nonempty.
(iii) The terms deletion and loss are equivalent in the database [20].
(iv) Duplicated variants are defined in the sense of the same chromosome ID and genomic location.
(v) The corresponding BAM or SRA file of a suspicious variant was retrieved according to the sample and reference field.

Validation.
The validation of variants is based on the PEM signature. First, for each suspicious variant, we extracted all the read pairs in which both ends were mapped within the region of interest (ROI) from the corresponding BAM file. To provide an adaptive zoom, the ROI is defined as the genomic region that extends both upstream and downstream with 1 kbp plus half of the variant length. Next, the F-score [23] is employed to measure the overlapping quality between the span of a suspicious variant and that of a mapped read pair. The F-scores quantify the overlap quality between two spans, with values ranging from 0 to 1 (see Figure 1, which demonstrates several typical scores). A small value close to 0 means a bad overlap, whereas a high value close to 1 means a good overlap. The F-score is calculated as follows: for a test span, if it has no overlap with the reference span, the F-score is set to 0; otherwise, = 2( /( + )), where is the precision (percentage of the test span that overlaps with the reference span) and is the recall (percentage of the reference span that overlaps with the test span).
In our study, mapped read pairs with F-scores larger than 0.7 were selected, and the average and sum of the mapping quality of all selected pairs were calculated. A suspicious variant was identified as a true positive if the average and sum were above 30 and 90, respectively; otherwise, it was classified

Results
We identified a total of 1923 suspicious variants, which cluster in 7 samples from 8 studies (see Figure 4(b) and   Figure 2 shows two typical suspicious variants from NA18507. Figure 2(a) shows the variant with accession essv4528478, whose genomic location is chr1:210537566-210539442 (green bar in all panels). It is shown that both GC content (the magenta curve in the upper left panel) and mappability (the red curve in the middle left panel) profiles at the ROI are abnormal, and hence they yield a valley in the DOC profile (the blue curve in the lower left panel). As a result, a deletion variant was detected in the DOC profile. However, the PEM signature (the right panel) contains no mapped read pairs (the horizontal lines) that overlap with the green bar, suggesting that this suspicious variant is a false positive. In contrast, Figure 2(b) shows the variant with accession essv4968609 (genomic location chr1:154793140-154795767), whose GC content, mappability, and DOC profiles show similar behaviours to those in (a). However, the PEM signature contains well-mapped read pairs (three black lines in the centre) that overlap with the green bar, suggesting a true positive. Figure 3(a) shows the distribution of the GC content of the human genome (hg18) with a bin size of 100 bp. Based on this distribution, the threshold values ℎ 1 = 0.26 and ℎ 2 = 0.59 are used to determine the estimated extreme GC content regions, such that both the left and right tail areas cover 5% of the whole distribution. Figure 3(b) shows the distribution of the mappability of the human genome (hg18) with a bin size of 100 bp. Since exact 1 and 0 mappability values occupy a large portion of the distribution (68% and 7%, respectively), these two values were excluded from the distribution. Based on the remaining values, the threshold value ℎ 3 = 0.92 was chosen such that the right area covers 20% of the distribution.  Size of SVs / log 10 Figure 4(c) shows the distribution of suspicious variants and false positives with respect to the size. Three modalities are shown: the left one with size smaller than 1 kbp (3), the middle one with size between 1 kbp and 5 kbp (3.7), and the right one with size larger than 5 kbp, which have proportions of 36%, 46%, and 18%, respectively. These results indicate that most false positives are small-or medium-sized variants. By fitting each modality with a Gaussian curve, the means of the three modalities are 660 bp (2.8), 2.2 kbp (3.3), and 7.2 kbp (3.9) for the left, middle, and right modalities, respectively.

Conclusion and Discussion
We proposed an approach to detect GC content and mappability related to false positives from the database of genomic variants. The proposed approach utilized the PEM signature, whose presence is necessary and provides evidence for true positives. 583 false positives were detected by conducting a validation study on the database of genomic variant. The results can avoid incorrect documentation and annotations in downstream studies.
We excluded variants with sizes larger than 10 kbp in this study, and the reasons are as follows: for most NGS alignment/mapping tools, the maximal insert size is limited to thousands of base pairs; e.g., the default values of both thea parameter in BWA and -X parameter in Bowtie 2 are 500 bp, and a very large insert size degrades mapping performance. However, large variants (deletions and inversions) do require a large insert size. As a result, there is a conflict between the mapping quality and the maximal size of detectable variants, and a tradeoff has to be taken with caution. Therefore, we confined the maximal variant size to 10 kbp.
In this study, since there are several software packages/algorithms (e.g., smoothing, segmentation, mapping, etc.) and parameters (F-score, thresholds ℎ 1 , ℎ 2 , ℎ 3 , etc.) that are used in the method, the robustness is an important issue. A global optimization of all parameters is not conducted due to the larger number of parameters, but we set each software/algorithm to its recommended setting and tune each parameter separately to a reasonable value (e.g., th1, th2, and th3). We used the F-score to identify whether a segment overlaps with another segment. From Figure 1, we can see that a threshold value of 0.7 is appropriate to determine that two segments are almost overlapping with each other, so we used 0.7 in the validation step to determine an overlap status. When it is increased to 0.75, 644 false positives are detected, and when it is decreased to 0.65, 546 false positives are detected. Therefore, the results are roughly robust with respect to this parameter. In step 7, we used a large threshold value of 0.9 in order to narrow down the total number of suspicious variants to be validated. We also used three threshold values ℎ ( = 1, 2, 3) to identify suspicious regions. In Figure 3(a), we chose ℎ 1 = 0.26 and ℎ 2 = 0.59 so that both the left and right tail areas covered 5% (or 10% in total) of the whole distribution, whereas in (b) we chose the threshold value ℎ 3 = 0.92 so that the right area covered 20% of the distribution. Since the distribution of GC content is close to a Gaussian distribution, tail areas with 10% are appropriate. However, the distribution of mappability is far from a Gaussian distribution, and values other than 1 are unfavourable. When we adopted the strategy used for GC content to mappability, i.e., the left tail area covering 10%, the resultant threshold value was ℎ 3 , which was too tight to identify suspicious variants. Therefore, we chose 20% for the right area, which yielded a much looser threshold.
There are two limitations of the current study. First, since the PEM signature is more straightforward for studying deletion than other types of variants, the current study focuses on only this type of variant. In future studies, we hope to extend the spectrum of variants being studied. Second, this study focused on only the database of genomic variants as a pilot study to validate our method. In future works, we hope to conduct large-scale validations on other well-known public repositories related to structural variants, such as dbVar (https://www.ncbi.nlm.nih.gov/dbvar), which is NCBI's database of human genomic structural variation, and the Database of Genomic Variants archive (DGVa, https://www.ebi.ac.uk/dgva), which is an EMBL-EBI database that archives publicly available genomic structural variants of all species.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.