SMARTdenovo: a de novo assembler using long noisy reads

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. It has also been widely used to study structural variants, phase haplotypes and more. Here, we introduce the assembler SMARTdenovo, a single-molecule sequencing (SMS) assembler that follows the overlap-layout-consensus (OLC) paradigm. SMARTdenovo (RRID: SCR_017622) was designed to be a rapid assembler, which, unlike contemporaneous SMS assemblers, does not require highly accurate raw reads for error correction. It has performed well in the evaluation of congeneric assemblers and has been successfully users for various assembly projects. It is compatible with Canu for assembling high-quality genomes, and several of the assembly strategies in this program have been incorporated into subsequent popular assemblers. The assembler has been in use since 2015; here we provide information on the development of SMARTdenovo and how to implement its algorithms into current projects.


INTRODUCTION
The development of high-throughput sequencing provides the means to deliver fast, inexpensive, and accurate information for assembling whole genomes. As a result, there has been rapid growth in the number of whole-genome sequencing projects [1][2][3].
Single-molecule sequencing (SMS) technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore, which generate sequencing reads of >20 kb in length, are now widely used in whole genome projects. These long reads are advantageous because they span polymorphic regions, repeats, and transposable elements, and because they provide long-range information for assemblies that are usually too complex to be resolved by short reads alone.
The huge demand for long-range DNA sequencing and mapping technologies has catalysed a renaissance of the development of high-quality SMS assemblers, such as PBcR [4,5] [11], Shasta [12], and ABruijn [13].
In fact, highly available SMS assemblers have always been essential for improving the quality of genome assemblies. SMARTdenovo (RRID: SCR_017622) is a long-read SMS assembler that follows the overlap-layout-consensus (OLC) paradigm. It assembles genomes following four steps: overlapping, trimming, layout, and consensus. The source code for SMARTdenovo was released in GitHub in 2015 [14]. Assessment by others has shown that it performs well compared with other congeneric assemblers [15], and it has been widely used for generating highly accurate contigs in many genome assemblies [16][17][18]. For datasets from both PacBio and Oxford Nanopore, such as 20-30× 2D reads of different varieties of yeast, SMARTdenovo assembled more accurate and more highly contiguous sequences than other assemblers [15,19]. SMARTdenovo was also used successfully for datasets from the wild tomato Solanum pennellii (∼1.2 Gb) and Sorghum bicolor (∼732 Mb) using Oxford Nanopore reads [18,20], and for the long-read datasets for Taraxacum kok-saghyz (∼1.04 Gb) and the woody plant, Rhizophora apiculata (∼274 Mb) using PacBio RSII reads [16,21]. Here, we explain how we developed SMARTdenovo and provide use cases to show its ability.

IMPLEMENTATION The SMARTdenovo algorithm
SMARTdenovo uses four steps for assembly: overlapping, trimming, layout, and consensus. We used homopolymer-compressed (HPC) k-mers for seed-indexing and identifying collinear seeds. HPCs have been widely adopted in Minimap2, Wtdbg2 and Shasta [10, 12,22]. We then trimmed low-quality regions and chimeric reads based on the overlapping reads. We applied the Best-Overlap-Graph [23] to generate the layout of the reads and the PBDAG-Con algorithm [24] to generate a consensus.

Overlapper
The algorithm for alignment follows a typical seed-chain-align procedure that is used by most full-genome aligners. In this step, we constructed each read by contracting homopolymer reads to a single base, called HPC strings. An HPC k-mer, a 1000-bp substring of an HPC string, was treated as a seed ("wtzmo -k 16 by default"). For HPC-based k-mer-indexing, we scanned all the reads and counted k-mers in a hashtable with a 64-bit key to store a k-mer, and a 64-bit value to store its count. If a k-mer was present more than 500 times, it was filtered out ("wtzmo -K 500"). A seed array was also created and filled with the "read_id" and the orientation of the remaining seeds. To manage the cost of memory for k-mer-indexing, we used two different parameters: (1) we kept only the index with smaller values between the k-mer and its reverse complement; (2) we randomly selected k-mers according to the hash code (one quarter was set as the default). All queried k-mers were indexed against the hashtable and seed array to identify candidate reads. We sorted the seeds by the "read_id" and "strand" and calculated the coverage length of the overlaps. If the coverage was longer than 300 bp ("wtzmo -d 300"), the candidate was kept. The top 500 candidates were chosen for each query ("wtzmo -A 500").
To refine the collinearity relationship between query and candidate, we further built a similar but shorter HPC-based index called a "z-mer" on queries for seed chaining. Each z-mer was recorded with its offset and strand. Pairs of matched seeds between candidate and query were obtained. However, because the high rate of insertions and deletions (indels) made the distance between two nearby z-mers highly variable, wtzmo was used to identify the synteny between query and candidate using sliding windows ("wtzmo -y 800") on query instead of using the whole overlapping region. We first filtered z-mer windows using a minimum match length of 200 bp. We then developed a scoring algorithm to filter excessive matches. The adjacent matched pairs of seeds (p i , p i+1 ) were scored based on the relationship of z-mers on candidates: S i+1 = S i + L i+1 − Distance i−(i+1) , S, L, D represented the sort of seeds, length of seeds, and the distance between the adjacent seeds, respectively. If S i+1 was larger than L i+1 , pairs of p i , p i+1 were considered to be a syntenic block, and the block would be extended until S j was smaller than L j . At that point, the block ended, S j recovered to L j and the next round of syntenic block identification began. The z-mer-block containing the highest value of S was the block chosen as the best syntenic overlap between the two windows. If the coverage was longer than 100 bp, the matched windows were retained. Seed windows were also scored using the same method as above for searching the best colinear pairs. Finally, candidates with the best colinear window-block that covered more than 300 bp were retained for pairwise alignment.
To speed calculation time and reduce unnecessary computation, we developed a weighting algorithm that negatively ranked a repetitive region based on its depth in the alignment process. "Wtzmo -q 100" was set so that the weights ranged between 1 and 0 if the depth ranged from 10-100. Large numbers of false candidates containing similar repeats were eliminated with queries. The pairwise alignment procedure based on the collinearity relationship was split into four steps: (1) first, the bases of the z-mers were decompressed and gaps were added between the matched z-mers; (2) global alignment was conducted between two adjacent z-mers within a z-mer window; (3) the two adjacent z-mer windows were aligned using the banded global Smith-Waterman algorithm, with a dynamic bandwidth that increased according to the length of the gaps ("wtzmo -w 50 ∼ 3200"); (4) the two ends were extended by global alignment with the band width set at 800 bp ( Figure 1).

Trimming
Wtclp trimmed or discarded reads to a maximum total length of the valid overlaps. It took one read as a reference, and tiled all reads containing overlaps. A functional model, "call_legal_overlaps_wtclp", calculated the length of valid overlaps. First, it clipped the ends that had high error rates. Then, it detected chimeras and trimmed them according to their depths. Structures containing partially aligned reads were called "spurs". We counted the depth of reads crossing the "spurs" as "m" and counted the number of reads with partial alignments as "n". If the m of a read was less than half of the average depth, or if n was larger than the average depth, or if n was larger than "m∕2", the read was considered chimeric and discarded. Otherwise, it was considered to be a sequencing error and the maximum region of reads was retained. Errors in the structure were corrected based on the graph. If a single read connected two subgraphs, it was considered a chimera. We then used wtclp to check for any alternative path formed by valid overlaps of tiled reads.

Layout
Wtlay was used to achieve the Best-Overlap-Graph [23] to generate a layout of reads following the OLC paradigm. In general, if an overlap was not end-to-end, leaving n (no greater than 100) bp unaligned, it would be treated as true ("wtlay -w 100"). Owing to the high indel rate, wtlay identified the best overlap with an alignment score ≥ 0.95, instead of picking out the longest one ("wtlay -r 0.95"). Using this process would keep bubbles from being merged, and instead find one appropriate path. The wtlay script also filtered out each unitig sharing more than 40% identity with another unitig to avoid islands. The output included uncorrected unitigs and all the parameters needed by the consensus caller.

Consensus
We used the wtcns command to implement the PBDAG-Con algorithm described in HGAP to generate consensus [24]. Because an alignment algorithm is integrated into wtcns, it

Assembling the genome of the fruit fly and evaluating accuracy of the assembly
We benchmarked SMARTdenovo against the SMS assemblers Flye, Canu, and Ra using the dataset of the fruit fly, Drosophila melanogaster, and calculated the accuracy of the assembly by aligning it with the reference genome. A total of 29.3 Gb PacBio reads from the National Center for Biotechnology Information Sequence Read Archive database (SRX499318) [25] was assembled with the command line "smartdenovo.pl -c 1 -t 16 reads.fa > wtasm.mak && make -f wtasm.mak". The length of the genome sequence was 146 Mb, with an N50 value of 11.59 Mb. We also tested three other SMS assemblers (Canu, Flye, Ra) using this dataset. SMARTdenovo was superior to Flye and Ra in both total length and contig N50, but was inferior to Canu ( Table 1).
The genome size of D. melanogaster is usually estimated to be 180 Mb. Of this, 60 Mb of the genome comprises centric heterochromatin, making it intractable for assembly [26].
Compared with the released reference [27], SMARTdenovo and Canu were able to create longer assemblies: ∼146.29 Mb and ∼161.97 Mb, respectively. SMARTdenovo and Canu performed better than the other two SMS assemblers, not only by creating longer assembly lengths, but also having higher coverage when aligned to the reference genome.

Assembling the genome of the wild tomato
We also compared the performance of SMARTdenovo with that of three other SMS assemblers: Flye, Canu, and Ra using the dataset for the wild tomato Solanum pennellii.
A total of 27.5 Gb Oxford Nanopore reads was downloaded from the European Nucleotide Archive database (PRJEB19787) [28]. A k-mer analysis of this dataset indicated that this accession of S. pennellii (LYC1722) has a genome size between 1 and 1.2 Gb [20]. We assembled a 30-fold Oxford Nanopore dataset and achieved an assembly of 902.96 Mb, with an N50 value of 339 kb ( to generate the assembly graph layout; and wtcns to calculate the consensus. Based on the results of tests on the wild tomato dataset, we found that SMARTdenovo was more memory-intensive than the other SMS assemblers, but its performance was comparable, and it was faster. SMARTdenovo has been successfully used to assemble data from various species such as plasmids [29], protists [17,30], fungi [19,31,32], microorganisms [33], and complex plants [18,20,21].
In addition to its solid performance, SMARTdenovo includes multiple algorithms that can be-and have been-useful for improving other programs. These algorithms have had a positive impact on popular SMS assemblers. For example, in developing SMARTdenovo, we introduced the first algorithm to use HPC-based k-mers, and this has now been incorporated into many other assemblers [10, 12,22]. SMARTdenovo has had a more extensive impact on our development of the assembler Wtdbg2, as it includes several of its algorithms for handling long reads, including those for indexing, seed chaining, trimming, consensus, and some of the data formation.
There are several other algorithms within SMARTdenovo that have not yet been taken advantage of for use in other programs. An example includes its weighting algorithm for handling repeat regions, which significantly improves both its speed and the accuracy of the alignment. At this point, no other long read assemblers have this feature.
SMARTdenovo has been available on GitHub since 2015, but its performance not only remains comparable with current assemblers, it also has several advantages as described.

DATA AVAILABILITY
A Code Ocean capsule to execute SMARTdenovo is available (Figure 2) [44]. Supporting data are available in the GigaScience GigaDB repository [45].