A new strategy to infer circularity applied to four new complete frog mitogenomes

Abstract We applied a novel strategy to infer sequence circularity and complete assembly of four mitochondrial genomes (mitogenomes) of the frog families Bufonidae (Melanophryniscus moreirae), Dendrobatidae (Hyloxalus subpunctatus and Phyllobates terribilis), and Scaphiopodidae (Scaphiopus holbrookii). These are the first complete mitogenomes of these four genera and Scaphiopodidae. We assembled mitogenomes from short genomic sequence reads using a baiting and iterative mapping strategy followed by a new ad hoc mapping strategy developed to test for assembly circularization. To assess the quality of the inferred circularization, we used Bowtie2 alignment scores and a new per‐position sequence coverage value (which we named “connectivity”). Permutation tests with 400 iterations per specimen and 1% or 5% chance of mutation at the ends of the putative circular sequences showed that the proposed method is highly sensitive, with a single nucleotide insertion or deletion being sufficient for circularity to be rejected. False positives comprised only 2% of all observations and possessed significantly lower alignment scores. The size, gene content, and gene arrangement of each mitogenome differed among the species but matched the expectations for their clades. We argue that basic studies on circular sequences can benefit from the results and bioinformatics procedures introduced here, especially when closely related references are lacking.

from short genomic sequence reads sequenced using high-throughput sequencing technology even when the total number of reads is low, and the reference belongs to distantly related taxa (i.e., different species, family, or even order). Subsequently, several authors have incorporated both the strategy and the partial mitogenomes provided by Machado, Lyra, and Grant into their methods and body of evidence (e.g., Anmarkrud & Lifjeld, 2017;Vacher, Fouquet, Holota, & Thébaud, 2016;Yuan, Xia, Zheng, & Zeng, 2016), highlighting the need to develop and improve the methods used to assemble and analyze mitogenome sequences.
A vital step of the de novo assembly of mitogenomes is the validation of sequence completeness, which, in the case of circular molecules, is assessed by testing for circularity. While assembling four novel mitogenomes of frogs from the families Bufonidae (Melanophryniscus moreirae), Dendrobatidae (Hyloxalus subpunctatus and Phyllobates terribilis), and Scaphiopodidae (Scaphiopus holbrookii) using the strategies advanced by Machado, Lyra, and Grant (which was based on partial mitogenomes only), we encountered significant difficulties in extracting the putative circular mitogenome from the final scaffolds, which included both spurious duplications of mitochondrial DNA fragments and reads that were erroneously assembled to the ends of the scaffolds.
A quick survey of the recent specialized literature (searching https://www.scopus.com and https://scholar.google.com.br for "complete mitochondrial genome" within the last 3 years) shows that many authors infer circularity through visual inspection of either reads at the ends of the assembly (e.g., Gan, Schultz, & Austin, 2014;Grau, Nuñez, Plötner, & Poustka, 2015;Vacher et al., 2016) or other more complicated methods (Cong & Grishin, 2016). However, we have observed that MITObim assemblies often produce sequences flanked by erroneous sequences that seem to have resulted from the spurious assembly of repetitive fragments. In such cases, it can be difficult or impossible to detect circularity through visual inspection-even if the entire mitogenome was assembled correctly.
Several quantitative approaches have also been proposed. A few programs, such as circules.py (distributed with MITObim), search for putative circular sequences based on k-mer (i.e., substrings of length k contained in a sequence), overlap within an expected range of sequence length. However, these programs provide limited statistics to assess overall quality and compare alternative assemblies. More elegant solutions are available that check assembly circularity through homology searches using BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990) and comparing the size of the assembled genome to the reference (Soorni, Haak, Zaitlin, & Bombarely, 2017). Additionally, tools such as Minimus2 (Sommer, Delcher, Salzberg, & Pop, 2007) and Circlator (Hunt et al., 2015) perform circularization of genome assemblies by joining individual contigs, although they do not clip circular fragments included in larger scaffolds and focus only on identifying sequences common to both ends of a contig. All these strategies are limited to specific research questions, pipelines, sequencing technology, and availability of closely related reference genomes, creating a demand for alternatives.
Giving the abovementioned limitations and the lack of metrics to guide the choice among multiple possible options to trim the final scaffolds, we used our empirical data to develop and refine a package of computer programs validated by permutation tests to overcome the difficulty of inferring the completeness of de novo assembled mitogenomes.

| Whole genomic DNA sequencing
To increase the diversity of complete mitogenomes from undersampled clades, we selected four species of frogs from which to sequence mitogenomes. The mitogenome of Scaphiopus holbrookii (Harlan, 1835) is the first of the family Scaphiopodidae and that of the dendrobatid poison frog Phyllobates terribilis (Myers, Daly, & Malkin, 1978) is the first for its genus. The complete mitogenomes of the bufonid Melanophryniscus moreirae (Miranda-Ribeiro, 1920) and the dendrobatid Hyloxalus subpunctatus (Cope, 1899)

| Quality control
Postsequencing quality control was performed using the detailed guidelines provided by Machado et al. (2016) with some modifications. Specifically, adapter trimming for mate-pair sequences was performed using NxTrim v0.3.0-alpha (O'Connell et al., 2015), and all filtered reads were analyzed with FastUniq v1.1 (Xu et al., 2012) to remove putative PCR duplications. The overall quality of all sequence reads was evaluated before and after postsequencing quality control using FastQC (Andrews, 2010).

| Mitogenome assembly
Mitogenomes were assembled using MIRA v4.0.2 (Chevreux & Suhai, 1999) and MITObim v1.8 (Hahn et al., 2013) following the baiting and iterative strategy using reference genomes from different genera or families, as discussed by Machado et al. (2016) We used only sequences identified as paired-end reads after quality control for the assembly. The interleaved paired-end sequence read file from P. terribilis was the largest (>150 GB disk size with ~500 M reads). Assuming mtDNA reads have a random distribution of occurrence within sequenced libraries, analyzing only a fraction of the paired-end reads should provide adequate information to assemble the mitogenome. Therefore, we divided the reads of P. terribilis into three files of up to 52 GB disk size and ~170 M read pairs, ultimately reducing computational requirements and assembly run-time. We validated this strategy by comparing the three scaffolds.

| Circularity inference
We divided the problem of testing for assembly circularization into two parts: The first part of the problem is to find putative overlapping sequences and use the original sequence reads to validate the circularization. To track these putative overlapping sequences, we devised a strategy that searches for identical k-mers (i.e., continuous text strings) at a minimum distance from each other (Figure 1a). Once a putative mtDNA sequence is found, it is flipped and rewritten, so the ends are adjacent to each other in the middle of the sequence ( Figure 1b).
In the second part of the problem, we use Bowtie2 (Langmead, Trapnell, Pop, & Salzberg, 2009) to map the original paired-end reads to the flipped fragment. Finally, we calculate quality metrics for the assembly. These metrics include sequence similarity, coverage, and average alignment score. We also calculated a modified per-position sequence coverage value (which we named "connectivity") in which sequence reads that start or end at a position are excluded from the coverage calculation of that position. This allows us to quantify the number of reads that support the position of a particular nucleotide in relation to its two adjacent nucleotides (e.g., in the sequence fragment "ACT", the connectivity of "C" ignored reads starting in or ending in "C" and considers only reads that align to the entire fragment "ACT"; Figure 1c).
Bowtie2 can align short reads quickly and efficiently, and the remaining operations can be executed in linear time, making the entire process feasible using standard personal computers. Mapping reads with Bowtie2 allowed us to test the sensitivity of our strategy to multiple k-mer and mtDNA sizes.
To validate our approach, we randomly added or deleted nucleotides in 50-bp fragments at both ends of the proposed circular sequence. We performed random deletions and additions with 1% and 5% chance, iterating the probability of modification 100 times per operation (nucleotide deletion or addition) for a total of 400 iterations per species. We then flipped sequences to make their ends adjacent to each other at the middle of the sequence. Finally, we used the flipped and modified sequences to compare the results of each iteration based on the same quality metrics described above and ranked them by the distance to the flipped sequence with no modifications (modification probability of 0%) in terms of similarity, connectivity, and average alignment scores.

| Mitogenome annotation
We parsed the DNA in CAF format using a Python script (parse-Caf.py described in Machado et al., 2016) to extract DNA data and F I G U R E 1 Main steps of our strategy to infer circularity. (a) We search for k-mers of a specified length, from the end to the middle of the scaffold, with the condition that they are at a minimum distance from each other. (b) The longest putative circular sequence found for each k-mer size is flipped, so the 5′ and 3′ ends will be adjacent to each other in the middle of the fragment. (c) Original sequence reads are remapped against the flipped putative circular sequence. All the mapped reads (represented by reads 1-3) contribute to the average alignment score. For each nucleotide, only the reads that support its position in relation to the two adjacent nucleotides (represented by read 1) are counted to determine the contiguity coverage (mt-cyb) and the LTPF tRNA cluster in neobatrachians (mt-tl1, mt-tt, mt-tp, and mt-tf) (Zhang et al., 2013), was annotated using sequence similarity searching with BLAST using default parameters (Altschul et al., 1990).
We compared our complete mitogenome sequences from  Parallel assembly of each mitogenome, sequentially followed by sequence annotation, was performed in 1-2 days of computer and user time.

| Software
The AWA (the Tupi word for "round") package comprises all the

| Inference of circularity
The four mitogenome assemblies passed the circularization tests with average alignment scores of −2.89 to −0.29 (the Bowtie2 alignment score is ≤0 in end-to-end mode, and the quality of the alignment is directly proportional to the alignment score). With 5% chance of adding or deleting a nucleotide at the ends of the sequence, no permutation passed the circularization test. Likewise, with 1% chance of adding a random nucleotide, no sequence was considered circular. False positives for circularity only occurred under a 1% chance of deletion and were limited to 4% of the permutations with alignment scores 1.95-14.93 times worse than the observed scores, so we expect false positives to be easy to detect. In case different putative circular sequences are obtained with different k-mer sizes, we suggest using the contiguity coverage and alignment scores to choose the optimal circularization. For additional details on these experiments, see supplemental online material (Table S1). of each mitogenome. As in other vertebrates, the heavy strand encodes most mitochondrial genes, except for eight tRNA genes (mttp, mt-tq, mt-ta, mt-tn, mt-tc, mt-ty, mt-ts2, and mt-te) and mt-nd6

| DISCUSSION
The four new mitogenomes presented here represent the first complete mtDNA sequences for each of the four genera, Hyloxalus, Melanophryniscus, Phyllobates, and Scaphiopus. They also represent the first complete mitogenome of Scaphiopodidae and important clades inside Bufonidae and Dendrobatidae. We expect that the increased data will help consolidate current understanding of the genome rearrangements previously proposed for these clades and corroborated here. We also anticipate that the new mitogenomes will be helpful for researchers working on mitochondrial DNA data from frogs of these three families. Moreover, these data help reduce the gaps in genome sampling of nonmodel organisms, which ultimately will collaborate to the reduction in disparities in biological understanding (Richards, 2015). with what has been proposed as a modification of gene order CR,and mttf) in the Neobatrachia lineage (Sumida et al., 2001; also see discussion in Xia et al., 2014). The consistency of our findings with the specialized literature is important as it suggests that the methods applied here produce reliable results.
The circules.py program takes a sequence in FASTA format and some parameters such as k-mer size, expected sequence length, and a value of tolerance for length variation. The algorithm proceeds to find all duplicated k-mers at a certain distance from each other that fits the specified range of sequence length. The circule.py program reports all putative circular sequences found this way. Depending on the particular sequence, the program may produce ambiguous results (i.e., different putative sequences supported by some duplicated k-mers).
In that case, the user may use the expected sequence length and number of k-mer duplicates to choose among competing hypotheses. The algorithm of circules.py resembles the first part of the AWA pipeline with two relevant differences: (1) AWA takes a minimum expected sequence length instead of an anticipated sequence length with a variation threshold (i.e., AWA receives only the minimum sequence); (2) AWA reports the most extended putative circular sequence without any information on the duplicated k-mers. The AWA pipeline flips the putative circular sequence and uses Bowtie2 to remap the reads to it.
Instead of relying on the amount of k-mer duplications, and the user rely on read coverage, read contiguity, and alignment scores to infer circularization (for details, see Section 2.4 and Figure 1).
To the best of our knowledge, the AWA pipeline is the first to introduce metrics based on the original short read data to infer sequence circularity of the assembled sequences. It also introduces an original metric, the sequence contiguity. This allowed us to infer that sequences were contiguous on the basis of both overlapping k-mers on the scaffolds and high-quality reads mapped against the putative mitogenome with an average alignment score lower than −2.9. The permutation tests provide further support to our automated approach to infer sequence circularity. These tests found only 2% false positives in all iterations. As false positives had an overall alignment score 1.95-14.93 times worse than the best scores, authors can use poor alignment scores (−3 or lower) as indications that the sequence should be reviewed and curated manually.

| CONCLUSIONS
In this study, we present the first complete mitogenome of the family Scaphiopodidae and the genera Hyloxalus, Melanophryniscus, and Phyllobates. This increases in 1.68%, 5.26%, and 3.70% the number of anuran species, genera, and families, respectively, for which complete mitogenome sequences are known. Our approach for testing the completeness of circular DNA assemblies (presented here as a Python package named AWA) is time-efficient and not computationally intensive. The test for mitogenome completeness can be carried out within minutes on a standard personal computer even when the file is large (i.e., 50-100 GB), and it both enables reproducibility of the tests of completeness and minimizes human error.
This procedure can also be applied to other circular genomes (e.g., chloroplasts, plasmids, covalently closed circular DNA [cccDNA] from viruses, and circular bacterial chromosomes), although applications of our strategy to nonmitochondrial sequences will be discussed in detail elsewhere.

ACKNOWLEDGMENTS
We would like to thank Regiane Lima for comments that inspired the simplification of parts of our algorithm, ultimately leading to great gains regarding speed and efficiency. We also thank Tatiana