RGAAT: A Reference-based Genome Assembly and Annotation Tool for New Genomes and Upgrade of Known Genomes

The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of version-dependent annotation files and other compatible public dataset for downstream analysis. To handle these tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool (RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNA-seq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool (RATT), RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification, genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/ and https://github.com/wushyer/RGAAT_v2 at no cost.


Introduction
With the development of sequencing technologies, it is getting easier to obtain the genome of various species. Up to, genome sequences of 4963 eukaryotes, 125,679 prokaryotes, 12,952 viruses, 10,916 plasmids, and 10,965 organelles have been available in the NCBI genome database (https://www.ncbi. nlm.nih.gov/genome/browse#!/overview/; accessed on December 5, 2017) [1]. The sequence error rate is around 0.01% in the human genome [2]. However, the quality of genome sequences varied considerably due to a variety of factors such as different sequencing platforms used, even if improved by subsequent efforts, especially using next-generation sequencing platforms. In addition, some assemblies have obvious sequencing errors caused by the sequencing platform used, such as homopolymers from Roche/454 and base substitutions from Solexa [3]. Moreover, many more genome projects have released one reference assembly and several resequencing data for different cultivars or closely related species [4,5]. The reference sequences are also constantly updated with newly emerging methods or strategies, such as 10X genomics long reads (https://www.10xgenomics.com/), single molecular sequencing (https://www.pacb.com/), and optical scan (https:// bionanogenomics.com/). Thus, to maintain and utilize the different assemblies, genome upgrade, assembly, and annotation based on known assemblies are on common and great demands. Unfortunately, there are few easy-to-use integrated tools to achieve both genome assembly and annotation transfer based on known reference genomes. Despite some tools, such as SAMtools/BCFtools and GATK, containing the module to create consensus sequence, none of them considers the true allele frequency for each variant, which is important for reducing false positive rate [6][7][8][9]. Another tool, rapid annotation transfer tool (RATT), can be used for annotation transfer, but the accuracy is relatively low for repeat regions [10], whereas iCORN can be used for correcting sequence errors, but not for upgrading annotations [11]. The web-based platforms-UCSC (https://genome.ucsc.edu/cgi-bin/hgLiftOver) and Galaxy (http://usegalaxy.org)-can convert coordinates among different genome assembly versions using the liftOver utility, but only for 106 genomes present in their databases [12][13][14][15]. There is an increasing demand for genome comparison between sub-species and cultivars on the gene level. Therefore, it is imperative to achieve both reference-based genome assembly and annotation transfer for comparative genomic analysis.
Unfortunately, there were few integral tools to perform both functions.
In this study, we reported the development of the referencebased genome assembly and annotation tool, RGAAT, to solve the problems encountered in the process of genome assembly and annotation. Although these problems are very common, we did not find comprehensive solutions despite searching two popular forums: Biostars (https://www. biostars.org/) and SEQanswers (http://seqanswers.com/). RGAAT is implemented in Perl and freely available to users at https://sourceforge.net/projects/rgaat/ and https://github. com/wushyer/RGAAT_v2. It accepts inputs of the genome sequence (FASTA format), annotation (GTF, GFF, GFF3, and BED format), mapping-based new assembly features, such as sequence alignment (SAM/BAM format), sequence variant (VCF format or tab-delimited five-column table containing chromosome, position, ID, reference allele, and alternative allele), and the new genome sequence (FASTA format). The search output displays sequence variants (for sequence alignment and genome comparison), updated genome sequence (for sequence alignment and sequence variant), corresponding coordinates between two genomes (known genome and upgrade/new genome), new genome annotation, and result of genome comparison. This tool can also be used to identify genome variants and to build genome consensus sequences.

Method
RGAAT includes three main modules: variant identification, coordinate conversion, and genome assembly/annotation. The workflow of RGAAT is shown in Figure 1.

Variant identification based on read alignment
The principle of variant identification involves assessment of read quality, mapping quality, and sequence coverage. As several read mapping software have been developed to deal with read and mapping quality, we adopt the mapping results and handle the data at two stages: read processing and variant identification/filter ( Figure 2).
The first part is read processing, i.e., read filtering and locus parsing. Firstly, low-quality reads with average quality score <Q20 were abandoned. Then, we filtered the reads, including those with lower mapping quality, those from PCR or optical duplicate, and those with multiple mapping loci. To distinguish sequence error and variant, we evaluated the adjacent variant distance in dbSNP and built a variant distance trinomial function (y = À0.6944x 3 + 11.31x 2 À 26.71x + 17, x and y indicate the variant number and variant distance, respectively) ( Figure S1). Using this function, we filtered reads based on read mismatch, insertion, and deletion. Due to the relatively low accuracy for short mapped blocks at both ends of the read, we clipped the read ends with short match regions (length of cigar ''M" <8 bp). After obtaining high-quality mapped reads, we parsed reads at each genome locus with quality check and masked reads in repeat regions that were handled by RepeatMasker (http://www.repeatmasker.org) and Tandem Repeats Finder [16]. Next, we recorded related attributes for each locus simultaneously, including (1) raw read coverage, (2) high-quality read coverage, (3) reference and alternative alleles, (4) base quality, (5) read start point number, (6) multi-mapped read number, and (7) mapped read number on each strand.
The second part is variant discovery and filtering. We firstly identified all candidate variants and recorded related attributes. To reduce the false positive rate, we removed the variants with low average base quality (<20), low uniquely mapped allele frequency (<15%), high reference allele frequency (!80%), low read depth (<2), and low variant read number (<2) ( Figure S2).

Variant identification based on genome comparison
RGAAT can be used to generate variants between two assemblies by sequence comparison (Figure 3). We used BLAT for genome comparison because of its ability to map sequences with long gap tolerance to eliminate the influence of repeat sequences, especially for different genome assemblies of the same species. First, we obtained the genome alignment using BLAT (v35) [21] with default setting. For genome comparison between different species, we used parameter ''-minIdentity = 50" for BLAT. There were some redundant alignments and alignment errors in BLAT results due to the presence of repetitive and low-complexity regions. Based on the base number for match, mismatch, insertion, and deletion in query and target genome, we filtered the BLAT results stepby-step as follows. We first identified and kept the best alignment result for each query sequence; we then sorted query alignments based on the coordinate order in target sequences; and finally we removed bad alignments for overlapping records, that is, only the alignment with highest identity was kept whereas other alignments were removed. After that, we identified variants (SNPs and INDELs) and created genome coordinate conversion files (''TargetChrom, TargetStart, TargetEnd, QueryStart, QueryEnd, QueryChrom, QueryStrand") based on the non-redundant genome alignment. Variants were identified at three levels, i.e., SNPs in aligned regions, INDELs in gaps, as well as SNPs and INDELs

Consensus sequence building based on variants
One of the most common needs for re-sequencing projects and genome sequencing of closely-related cultivars, strains, or species is to reconstruct the new assembly based on read alignment files, such as the population-specific consensus genome sequences in humans [22] and other model species. Although GATK and SAMtools/BCFtools can build the consensus sequences based on variants, both tools have some disadvantages. First, GATK and SAMtools identify variants with suppressed read depth (500Â coverage for GATK with the down-sampling setting for and maximally 250Â coverage for SAMtools by default), which may affect allele frequency due to information loss with highly excessive coverage. Second, both tools create consensus sequences with non-reference alleles without considering the true allele frequency and read depth, which are very important parameters for genome upgrade. RGAAT improved the consensus sequence building in both aspects, that is, not setting read depth limit for variant identification and taking allele frequency into consideration. RGAAT can build consensus sequences easily by parsing the variant files in two steps, including (1) selecting the main allele among reference and alternative alleles and (2) adjusting genome location according to variants. For the first step, we selected the major allele based on the allele frequency. Software such as GATK and Freebayes can provide the allelic depths for the reference and alternative alleles by the allelic depth (AD) ID. RGAAT reports the exact allele read number and frequency of the reference and major alternative alleles during variant calling. Note that the ''AF" ID in the VCF file is the max-likelihood estimate of the alternative allele frequency, which is not the true allele frequency. For the second step, we produced the genome coordinate conversion file for further annotation transfer.

Annotation transfer based on variants or genome comparisons
In addition to new genome creation, annotation transfer is an important step for assembly upgrade and further genome comparison at the gene level. Application of next-generation sequencing (NGS) technologies has greatly reduced the sequencing cost and promoted the productivity of genome sequencing projects dramatically. However, genome annotation is both arduous and computing-intensive. Several automatic annotation tools, such as Ensembl [23], NCBI [1], PASA [24], and MAKER [25] have been developed. However, these complicated tools require expertise to use and are more suitable for ab initio genome annotation. RATT is a tool for annotation transfer between similar genomes and can be run easily and quickly [10]. However, RATT uses MUMmer [26] as aligner, resulting in the loss of global sequence consistency during alignment for closely-related genomes, especially for   Note: dbSNP for human samples and the manually-curated variants for chloroplast and mitochondria sequences were used for evaluating the performance of variant calling in RGAAT. During read filtering step, unmapped reads, multi-mapped reads, reads generated from PCR duplicate, reads with low quality, high mismatch, chromosome difference, or large distance for paired-end were removed. At the first filtering step, variants with low read average base quality, low uniquely-mapped allele frequency, high reference frequency, low read depth, or low variant read number were removed. Variants of low read depth, low allele frequency, or low average read quality were filtered to obtain the final variants. GCAT, Genome Comparison and Analytic Testing platform.
repeat regions. For rapid upgrade genome annotation between different genome assemblies, RGAAT can build genome coordinate conversion files based on variants or genome comparison. There are two options for genome annotation transfer: one is to replace the reference genome with variants (creating a new consensus sequence) and change the coordinate for corresponding annotation files; and the other is to transfer reference annotations to the target genome based on genome comparison ( Figure 4). The former is suitable for genome upgrade, while the latter performs better for closely-related genomes without re-sequencing data.
It is important to define the exact syntenic regions between the two assemblies in annotation transfer. With the genome coordinate conversion file, the coordinates of annotation features in the reference genome are transformed. For each annotation feature, the outcome of annotation transfer can be classified into three groups according to the status of start/ end locus: two loci successfully transferred, one locus successfully transferred, and no locus transferred. In the first case, two loci can be easily replaced with new genome coordinates. In the second case, the non-transformed locus can be inferred from the successfully-transferred locus, by considering the distance of two loci in the reference genome and the strand information in the query genome. To reduce the influence of syntenic loss in the low identity region, RGAAT tries to find possible start or stop codons by extending to upstream or downstream sequences in order to infer the non-transformed site. For the first exon, RGAAT tries to find the possible start codon by extending upstream, while for the last exon, RGAAT tries to find the possible stop codon by extending downstream. Due to sequence variations between the two genomes, annotation features may be interfered by SNPs and INDELs, especially for coding sequences (CDS). Meanwhile, to achieve the maximum annotation transfer in syntenic regions, we keep all candidate annotations in the output files and mark the location of problematic annotations (annotations partially transferred due to the interruption by the presence of stop codons) using an interrogation sign ''?". Users can check the annotation with ''?" markers to recover partially transferred annotations interrupted by stop codons. In addition, we prefer using the feature table file (*.tbl) for annotation transfer to be compatible with the NCBI record system. However, it should be pointed out that the success in direct transfer of genes highly relies on the similarity between the two genomes. For the annotation of problematic features, we refer to the information from the successfully-transferred locus, including distance between the two loci of the feature in the reference genome and their strandness in the query genome to ensure the completeness of ORFs. If the similarity of genome is too low, the results of annotation transfer would become unpredictable. In this case, we highly recommend a fully ab initio gene prediction for very distinct genomes.

Evaluation of annotation transfer between genomes in five datasets
We evaluate the efficiency of annotation transfer using RGAAT based on five datasets. These include two chloroplast genome assemblies generated in our lab (GenBank accession: KX028884) using different sequencing platforms, 454 and Solexa, which includes corrected 212 regions in total consisting of 119 base errors, 6 deletions, and 87 insertions as reference to assess annotation transfer between different genome assemblies. To evaluate the annotation transfer between strains, the bacterium Mycobacterium tuberculosis (strain H37Rv; GenBank accession: AL123456 and strain F11; GenBank accession: CP000717) and chromosome IV of yeast Saccharomyces cerevisiae (strain S288C; GenBank accession: NC_001136 and strain ySR127 GenBank accession: CP011550) were used. In addition, the chromosome 14 of the  parasites Plasmodium chabaudi and P. berghei (downloaded from http://ratt.sourceforge.net) and the chromosome IV of yeast S. cerevisiae (strain S288C; GenBank accession: NC_001136) and S. arboricola (strain H-6; GenBank accession: CM001566) were used to assess annotation transfer between species.

SVG figure creation for genome comparison
To reveal the variations between two genomes, an SVG figure is created for each chromosome based on the genome coordinate conversion file and genome annotation file. The figure shows SNPs, INDELs, and identical regions with different colors for the compared genomes with gene blocks. SVG file can be displayed easily in browser, using functions such as drag, zoom in, and zoom out.

Results
Here, we demonstrate the performance of RGAAT, mainly in two parts: variant identification and annotation transfer.

Variant identification
For dataset with low sequence depth The 30Â 100 bp paired-end exome dataset was downloaded from GCAT website and aligned using Bowtie2 with default parameters (Table 1). Before variant identification, we filtered 4.01% aligned reads including those with low sequence quality (3.16%), low mapping quality (0.06%), and high percentage of mismatches (0.79%). We also filtered the short aligned regions on both end of the reads. From the retained mapped reads, we identified 789,170 raw variants. After examining the base quality, variant frequency, reference frequency, multi-mapped reads, read start count, read depth, and the discrepancy between the numbers of reference and variant reads, only 189,106 (23.96%) variants passed the initial filter criteria. Then, we filtered the variants according to the percentage and quality of variant reads. In total, 123,660 final variants were identified. For comparison, GATK and SAMtools were also used to identify variants simultaneously. We uploaded the variants identified by RGAAT, GATK, and SAMtools to GCAT and compared them together with Freebayes. According to the comparison reports, the precision rate of variant identification using RGAAT is higher than those using Freebayes and SAMtools, and was similar to that using GATK. As for sensitivity, the performance of RGAAT is comparable with that of GATK, but higher than that of Freebayes and lower than that of SAMtools ( Table 2). Upon validation with dbSNPs, we observed that RGAAT identified higher number of common variants than GATK, and lower number of novel variants than SAMtools and Freebayes. These observations indicate that RGAAT achieves a good balance between true positives and false positives, which is consistent with the precision rate, specificity, and sensitivity exhibited by RGAAT. In addition, RGAAT shows a higher transition/ transversion ratio (

For dataset with high sequence depth
To assess the performance of identification at higher read depths, we applied RGAAT, GATK, and SAMtools to identify variants in one medium-depth data (200Â human exome dataset) and two high-depth data (5167Â of chloroplast and 1788Â of mitochondria datasets) ( Table 2). With 200Â human exome data, RGAAT showed the highest precision rate and specificity, but lowest sensitivity. The Ti/Tv ratios were 2.63, 2.31, and 2.28 for SAMtools, RGAAT, and GATK, respectively. For the two high-depth data, RGAAT displayed better performance, i.e., higher precision rate, sensitivity, and specificity than SAMtools and GATK (Table 2).

Variant identification by genome comparison
We performed sequence alignment to identify variants between two versions of chloroplast genome sequence generated in our lab using two different platforms by BLAT and compared them with the true variants. All 212 true variants (119 SNPs, 87 insertions, and 6 deletions) were identified by genome comparison, including 191 one-to-one, 8 two-to-one, and 1 five-toone variant matches ( Figure S3A). Note that the variants identified by BLAT is located at the end of aligned region, while the variant identified from read alignment is located in the start of aligned region (see Figure S3B for example).

Annotation transfer
Annotation transfer between different genome assemblies We obtained two genome assembly versions for the chloroplast sample. Using two annotation transfer methods in RGAAT, i.e., variant-based and genome comparison-based, all annotation features were successfully transformed, including 93 CDSs, 54 exons, 141 genes, 8 rRNAs, and 40 tRNAs. In comparison, RATT, another annotation transfer tool, lost 8 genes, 14 CDSs, 1 exon, and 8 tRNAs during transfer ( Table 3). In particular, the transferred annotation in RATT contained one partial CDS and two frameshift CDSs (Table S1).

Annotation transfer between different strains
First, we tested the annotation transfer from the bacterium Mycobacterium tuberculosis strain H37Rv to the strain F11 genome because these two closely related genomes are relative well assembled and annotated. Both RATT and RGAAT completed the transfer within several minutes. Of 8540 annotation features in strain H37Rv, 8417 (98.56%) and 8223 (96.29%) were transferred to F11 by RGAAT and RATT, respectively (Table 3 and Table S2). We inspected all CDSs of strain F11 and found that only 29 (0.73%) in RGAAT and 41 (1.05%) in RATT were not transferred correctly. Among them, inframe stop codons were found in the translation of 17 and 15 CDSs in RGAAT and RATT, respectively, indicating that these CDSs could be pseudogenes. Comparing with known annotation of F11, RGAAT shows similar precision rate (96.08%) with RATT but higher sensitivity (97.50%). Moreover, 140 (4 problematic) and 137 (8 problematic) novel CDSs were identified by RGAAT and RATT, respectively. We then used the chromosome IV of yeast strain S288C to annotate the strain ySR127 that was submitted to NCBI without annotation. All 2430 annotation features in strain S288C were successfully transferred by both RGAAT and RATT (Table 3 and Table S3). Among them, the translation of 5 CDSs in RGAAT and 22 CDSs in RATT terminates earlier, most of which were transferred incorrectly. We thus compared the annotation results between RGAAT and RATT and found that 4 mobile elements and 20 CDSs were inconsistently annotated. After comparing with repetitive elements, we found that 4 mobile elements were mis-transferred in RATT. Among the 20 discrepant CDSs, 18 was incorrectly transferred in RATT, which led to frame shift, and the remaining 2 terminated earlier due to the stop codons present in RGAAT.