DuplicationDetector, a light weight tool for duplication detection using NGS data

Duplications are one on the main evolutionary forces in angiosperm, especially in Poaceae . A large number of genes involved in various metabolisms and pathways originate from such duplications (whole genome, segmental or single gene). However, to detect such duplication may be complicated, costly and generally requires heavy human and material investments. Here, we propose an alternative approach for detecting putative recent segmental duplications in haploid or diploid homozygous organisms based on NGS data. We rely on abusive mappings of paralogous sequences that increase apparent heterozygous points at a given locus to identify such duplicated genomic regions. We test our tool on simulated data, then on true rice genomic sequences and were able to identify about 200 candidate duplicated genes in African rice ( Oryza glaberrima ) lineage compared to Asian one ( O. sativa ). © 2017 The Authors. Published by Elsevier B.V. This is


Introduction
Duplication is an important feature of the plant genome architecture, and can involve a single gene, a chromosome segment, an entire chromosome or even the whole genome [1]. It was shown for instance that angiosperms undergone large scale duplications and multiple whole genome duplications all along their evolution [2]. Whole genome duplication, i.e. doubling the amount of the complete genetic material of an individual without crossing, appears as a source of evolution and biological complexity [1,3,4]. In the same way, segmental duplications create local variations offering new opportunities for natural selection to occur [5]. Therefore, gene and genomic duplications play an important role in the evolution of plant phenotypes [6], and duplicated genes could undergo different behaviors: (i) neofunctionalization -retention of both divergent copies but with a new function for one of them -, (ii) subfunctionalization -retention of both copies with conserved function but in another tissue/organ/timeframe for one -, or (iii) nonfunctionalization/pseudogenization -large number of mutations accumulation in one of the copies [1]. The two first case (neo-and subfunctionalization) may lead to new expression pattern, or even new regulatory pathway [7].
In cultivated Asian rice (Oryza sativa), for instance, genome duplication provided improved root resistance [8], seed germination and seedling growth to salt stress [8,9]. In addition, tandem duplications were evidenced, amplifying adaptively important resistance genes encoding membrane proteins and function related to abiotic and biotic stress [10]. Hence, segmental duplication and tandem duplication lead to HAP (Heterotrimeric Heme Activator Protein) gene duplication regulating rice heading date [11]. However, detecting genome or segmental duplications is a complex task. Different approaches and techniques are used, such as molecular ones that gather (time-consuming) techniques e.g. comparative genome hybridization (CGH) [12], FISH, and array CGH [13]. Recently, due to the availability of Next Generation Sequencing technologies and of their low cost [14][15][16], more computational sequencing-based approaches were developed (e.g. [17]). These methods rely mainly on Depth of Coverage variations (DoC) to identify duplications relative to a reference genome, as a duplicated regions are expected to be twice more sequenced than non-duplicated regions [18]. However, molecular approaches such as DoC methods require highly precise experiments, repetitions and heavy computational times, are designed to compare one target individual to a given reference one, and thus cannot be applied on a large number of individuals.
In this study, we propose a new methodology based on the use of apparent excess of heterozygous loci (AEH) on genomic intervals in autogamous diploid species. This methodology was implemented in a tool called DuplicationDetector, and was tested on a set of simulated data and shown to be fast and robust. In addition, we applied it on cultivated and wild African rices, respectively Oryza glaberrima and O. barthii, to detect duplicated genes compared to the Asian rice Oryza sativa.

Simulated sequence data for validation
A fragment of chromosome 7 (1Mb) of Oryza sativa ssp japonica cv NipponBare IRGSP1.0 was extracted from position 1,000,000 to 1,999,999 using the extractseq tool from EMBOSS [19]. Three duplications (namely duplications 1-3) were artificially created within this sequence using home-made Perl script (available on demand), respectively at positions 300-1500; 350,000-390,000 (containing another initial duplication) and 800,000 to 810,000. A total of nine virtual 'clones' were constructed. Clone1, without duplicated sequences, was considered as identical to the reference. Clone2 has the three duplicated sequences without mutation. Clones 4-6 have the duplicated sequences and mutations with 3% of supplementary divergence, while clones 7-9 have the duplicated sequences with the same mutations and 6% of divergence. In addition we add in clones 3-9 additional common mutations in the non duplicated sequence. All mutations were induced using the mutate dna tool from the SMS2 suite [20]. The sequences from each virtual clone were then used to simulate FASTQ data using aRT simulation tool [21], specifying as options HiSeq2500 machine, 100 pair-end fragment, depth of 35, insert size of 200, 10% of insert size divergence. aRT includes an empiric error model that allows a very good simulation of sequencing data [21].

Sequence data for Oryza glaberrima and O. barthii and initial quality control
Eight accessions of African cultivated rice Oryza glaberrima (TOG5307, TOG5307f, TOG5321, TOG5666, TOG5887, TOG7291, UB06, UG26), and six wild relatives Oryza barthii (B88, IG05, IRGC106302, MB323, TB41, TG57) were used in this study (see [22] for more informations about those accessions). Asian cultivated rice O. sativa IR64 (ssp indica) and Azucena (ssp japonica) were also included as control. All samples were sequenced at Genoscope (France), in the frame of the IRIGIN project (http://irigin.org), as follows: Sequencing: Libraries were prepared using the NEBNext DNA Modules Products (New England Biolabs, MA, USA) with a 'on beads' protocol developed at the Genoscope, thus reducing the costs and increasing the yields. Briefly, after gDNA fragmentation with the E210 Covaris instrument (Covaris, Inc., USA), end repair, A-tailing and ligation with adapted concentrations of Nextflex DNA barcodes (Bioo Scientific, Austin, TX,) were performed on the same AMPure XP beads that was used for the first purification after end repair. After two consecutive 1x AMPure XP clean up, the ligated product was amplified by 12 cycles PCR using Kapa Hifi Hotstart NGS library Amplification kit (Kapa Biosystems, Wilmington, MA), followed by 0.6x AMPure XP purification. Libraries traces were validated on Agilent 2100 Bioanalyzer (Agilent Technologies, USA) and quantified by qPCR using the KAPA Library Quantification Kit (KapaBiosystems) on a MxPro instrument (Agilent Technologies, USA). Libraries were sequenced on an Illumina HiSeq2000 or HiSeq4000 instrument (Illumina, USA), at 2 × 101 bp or 2 × 151 bp. respectively. About 50 billion useful paired-end reads were obtained per run.
QC and initial treatments: Low quality clusters were filtered during the sequencing run by Real Time Analysis (RTA) software. Filtering steps were performed on whole paired FASTQ files: Illumina adapters and primers were removed, nucleotides with quality value lower than 20 were trimmed from both ends and sequences between the second unknown nucleotide (N) and the end of the read were trimmed. Reads shorter than 30 nucleotides after trimming were discarded. These trimming steps were achieved using fastxtend (http://www.genoscope.cns.fr/fastxtend/), a software based on the FASTX library [23]. The filtered reads and their mates that mapped onto run quality control sequences (PhiX genome) were removed using SOAP aligner [24].

Mapping approach & initial SNP calling
For VCF creation, cleaned paired FASTQ data were mapped upon the reference initial sequence using BWA 0.7.12 (aln/sampe legacy algoritm) [26]. SAM (Sequence Alignment/Map) files were cleaned and filtered for low quality mapping and abnormal mapping, merged and realigned using combination of SAMtools [27] and PicardTools [28]. After realignment, SNP were called using the GATK HaplotypeCaller [29] under standard conditions. Calling was performed per individual chromosome to optimize calculation time. All steps were performed using the TOGGLE pipeline [30] to ensure repeatability and traceability. The standard default values were chosen respecting two criteria: I) THE IRIC/3K genomes standards for mapping/calling and II) numerous home made tests and evaluations of conditions using control samples in different analyses (such as in Monat et al. [30], GBE) that provide the best results. Detailed options are shown in suppData, as well as TOGGLE configuration file.

Heterozygous SNP recovery
VCF were filtered out for recovery of lines containing heterozygous SNPs based on standard default filters (depth for each sample, maximum number of missing data, minimal calling quality value, maximum MQ0 value, homozygous controls).

Genomics intervals recomposition
Extracted VCF lines were recompiled in genomics intervals respecting a specified maximal distance between 2 SNPs to be considered as related, a minimal block size, and a minimal heterozygous SNP density. Resulting files are 3 columns BED-like files.

Duplicated genes identification and SNP potential effect identification
Genomics interval files were crossed with GFF file containing annotation using intersectBED from the BEDtools suite [31]. Selected heterozygous SNPs were then annotated for their potential effect using snpEff software [32]. A schematic view of the whole pipeline is detailed in Suppdata.

Availability
All codes, installation instructions and manual for Dupli-cationDetector are available, under the GPLv3/CeCiLL-B double licenses, on the GitHub of the project: https://github.com/ SouthGreenPlatform/duplicationDetector.

Description of the approach
We rely on AEH genomic intervals to detect duplicated sequences (Fig. 1). Basically, we detect abusive mapping of reads coming from duplicated regions in a sequenced individual when they are mapped on a reference genome without the duplication (i.e. harboring only a single copy). Such abusive mapping will lead the SNP calling to produce an Apparent Excess of Heterozygous locus, i.e. too many heterozygous loci in a short region. If many individuals are sequenced and mapped in the same experiment, such AEH loci will appear for (almost) each individual in the same location, indicating thus that the region is duplicated in the sequenced genomes compared to the reference one. Users can manage in DuplicationDetector the level of stringency for selection of AEH loci using different criteria: • Minimal depth per individuals (default at 30) • Minimal number of individuals to be heterozygous for a point to be chosen (default at 10) • Maximal number of MQ0 reads (that can be mapped at two positions with identical score; default at 0) • Control individuals (some samples that may not be heterozygous) The genomic interval creation can also be set up using following criteria: • Maximum size between two heterozygous loci (default at 1kb) • Minimum size of the genomic interval (default at 100bp) • Minimum density of heterozygous loci in the genomic interval (default at 25 bases between each SNP).
If user provide a GFF file for gene annotation, duplicated genes will be identified through overlapping with identified genomic intervals.
In terms of speed, a complete scan from a raw VCF of 16 African rice individuals (12 chromosomes, ∼380Mb) at high sequencing depth (∼35x, see Materials and methods), with two Asian rice individuals as control, will be performed on a single recent 64-bits core in less than 2 h.

Results on simulated data
On simulated data, we were able to partially (∼40%) identify the duplication 1 directly and almost entirely (∼95%) the duplication 3 (Table 1) as fragmented blocks. We were able to limit those two duplications with a quite good resolution, i.e. capacity of correctly identify the borders (max 500bp of error in limitating; see Table 1). The difference of recovery level between the two duplications is mainly due to their size, as for duplication 1 (1.2kb), the non recovered fragment is of 193bp in 5 and 522bp in 3 on 1200, while for duplication 3 the non-recovered size is of 142/355 bp. The fragmentation effect may be due to the fact that duplication 3 is a large block but with a low variation density, and AEH loci density parameter is quite conservative.
Duplication 2 was not detected, as it contains another older (non-simulated) nested duplication, and AEH loci in this region were removed based on the maximum MQ0 parameter. Indeed, reads coming from a region already duplicated in the reference genome could be mapped on any of the two copies on the reference with the same probability. This will increase the MQ0 level, and thus those regions will be filtered out with the current version of DuplicationDetector.
No false positives regions were obtained from the simulated data.

Results on experimental data
We then test our tool on an experimental set of African rice individuals sequenced at relatively high-depth (see Materials and methods). African cultivated rice Oryza glaberrima and its wild relative O. barthii diverged from Asian cultivated rice O. sativa ancestor almost 1 million year ago [33,34], and may harbor duplicated regions compared to Asian rice. We thus tested 8 cultivated and 6 wild samples to evaluate our tool on real data. To avoid individual effect due to the reference genome (i.e. identification of AEH loci only because of a loss of duplication in Nipponbare), we add as control two other O. sativa sample, IR64 (indica subspecies) and Azucena (japonica subspecies), sequenced as described in 2.2.1. Those two individuals must be homozygous and similar to the reference for an AEH locus on the African samples to be validated as such.
The whole analysis on the raw VCF representing variations on all the 12 chromosomes of Nipponbare reference spend less than 2 h using a single core, with a very short memory footprint (max 2Mbytes per file), and provided 786 putatively duplicated regions in African rice lineage relative to Asian one (see Table 2). From those putatively duplicated regions, we can identify 200 annotated gene feature (i.e. tagged as 'gene' in the MSU7 GFF file; see SuppData). The blocks ranged from 102 to 3560 bases, with 5-170 SNP (11SNP/kb on average) with AEH in each block ( Table 2).
The duplicated regions are widely spread all along the chromosomes, without any main locations ( Fig. 2 and SuppData). The putatively duplicated genes themselves seems to be related partially to stress response, but no general trend were observe concerning Gene Ontology (data not shown). We observed a mean value of 38 AEH loci per region, showing that these duplications are quite recent, but dating from before the radiation between O. glaberrima and O. barthii. Indeed, if applying a mean value of 1.3 × 10 −8 mutation per million years (as calculated in [35]), the mean age of the putative duplicated regions is of ∼800,000 years, spanning from ∼605,000 to 972,000 years, as expected from the two group separation [33,34].
Detailed analysis of mutation induced by the AEH loci in duplicated genes showed that most of the mutations occurred outside of the gene coding regions (7-20% only in exonic sequences; Sup-pData). For exonic mutation, a mean Ka/Ks ratio of 1.53 ± 0.45 (1.01-2.75) was observed, indicating a low level of global divergent selection. The mean observed Transition vs Transversion Fig. 2. Chromosomal location of duplications in African rices compared to Asian on Chromosome 1. In red are symbolized duplication involving annotated genes, in green region without annotated genes. ratio is around 2 (1.7-2.3), as expected for genomic mutation [36,37].
When focusing on individual genes, we were able to identify as putatively duplicated the LOC Os07g09900 gene (Chromosome 7, from 5,263,409 to 5,267,310), Disease resistance protein RPM1, located under a major QTL of resistance to African strain of Xanthomonas (qABB-7, from [38]). As it has been shown in other plant species, duplication of resistance genes may increase disease resistance.

Limits of the approach
In a recent paper, Hutin et al. [39] shown a recent partial duplication of 3.2kb of the LOC Os11g31190 gene OsSWEET14 [39] (Chromosome 11, from 18,171,678 to 18,174,478), involved in an increased resistance to Xanthomonas oryzae pv oryzae. A set of 12 high level variant positions, including the 18bp deletion between the original copy and the duplicated one [39], were identified in the variant selection step. However, post-filtrations (especially the minimal SNP density, here of 177bp between two SNP instead of 25) did not allow to recover this duplicated block in our current test.

Discussion
Detection of duplicated regions between individuals is a challenging task using molecular tools, and a costing computing task with sequence data. Up to now, use of the latter approach (mainly using NGS) was based on raw mapping, DoC divergence computation and Copy Number Variation (CNV) analyses. Numerous tools experimented such approaches, with more or less efficiency [14][15][16], but all of them requires a long computation time and cannot compare massively different individuals.
In the present study, we rely on abusive mapping and subsequent AEH loci to identify the duplicated regions. Moreover, our tool does not require intense re-calculation or mapping, as it relies directly on the raw VCF data (already generated in numerous genetic analyses) to identify those AEH loci. This approach is fast and allows to work on large samples; in addition, it offers the possibility to include negative controls which allow users to identify duplications existing in only a subsample of their sequenced individuals. Our approach is however quite conservative, as shown on simulated data, and will not detect for instance new copies of an already existing repeated sequence in the reference genome. In the same way, it cannot identify too recent duplications, due to the low number of mutations between the two copies (as for OsSWEET14). However, applied to the divergence between African and Asian rices, we were able to identify more than 200 putatively duplicated genes and almost 780 total regions. The detected genes are widely spread all along the chromosomes, generally related to stress response, at least marginally, and under a quite low divergent selection.

Conclusion
DuplicationDetector is thus a very efficient tool to detect duplication in haploid and highly homozygous diploid organisms, such as rice (tested here), but also bacteria, yeasts, autogamous plants, haploid fungi, and so on. The future development of our tool will include the implementation of detection in heterozygous or polyploid organisms or both, as well as additional criteria for filtering (such as hard-clipping level).

Author's contribution
GD and FS manage the whole study and wrote the whole pipeline. SE & Genoscope performed the sequencing and initial data treatments and QC. FS performed the simulation, GD and CM performed the basic data analyses, and GD analyzed the results. GD and FS wrote the manuscript, and all authors corrected and approve the current version.