Choice of assembly software has a critical impact on virome characterisation

Background The viral component of microbial communities plays a vital role in driving bacterial diversity, facilitating nutrient turnover and shaping community composition. Despite their importance, the vast majority of viral sequences are poorly annotated and share little or no homology to reference databases. As a result, investigation of the viral metagenome (virome) relies heavily on de novo assembly of short sequencing reads to recover compositional and functional information. Metagenomic assembly is particularly challenging for virome data, often resulting in fragmented assemblies and poor recovery of viral community members. Despite the essential role of assembly in virome analysis and difficulties posed by these data, current assembly comparisons have been limited to subsections of virome studies or bacterial datasets. Design This study presents the most comprehensive virome assembly comparison to date, featuring 16 metagenomic assembly approaches which have featured in human virome studies. Assemblers were assessed using four independent virome datasets, namely, simulated reads, two mock communities, viromes spiked with a known phage and human gut viromes. Results Assembly performance varied significantly across all test datasets, with SPAdes (meta) performing consistently well. Performance of MIRA and VICUNA varied, highlighting the importance of using a range of datasets when comparing assembly programs. It was also found that while some assemblers addressed the challenges of virome data better than others, all assemblers had limitations. Low read coverage and genomic repeats resulted in assemblies with poor genome recovery, high degrees of fragmentation and low-accuracy contigs across all assemblers. These limitations must be considered when setting thresholds for downstream analysis and when drawing conclusions from virome data. Electronic supplementary material The online version of this article (10.1186/s40168-019-0626-5) contains supplementary material, which is available to authorized users.


Background 40
The rapid evolution of metagenomics and high throughput sequencing technologies has revolutionised 41 the study of microbial communities, giving new insights into the role and identity of the uncultivated 42 microbes which account for the majority of metagenomic sequences (Solden, Lloyd et al. 2016). uncharacterised viral sequences "viral dark matter"; (Roux, Hallam et al. 2015), and the lack of a 54 universal marker gene, virome studies rely on database independent analysis methods and depend 55 heavily on de novo assembly to resolve viral genomes from metagenomic sequencing reads. 56 Metagenomic assemblers typically use de Bruijn graph (DBG) approaches to address the 57 complexity and size of metagenomic datasets in an accurate and efficient manner. Microbial 58 metagenomes pose significant challenges to DBG assembly when compared to single genome 59 assemblies often complicating the DBG and leading to fragmentation and/or misassembly (Olson,  A wide array of metagenomic assembly programs have been employed, each addressing 66 aspects of metagenomic challenges to varying degrees. However, many of these programs have been 67 designed and optimised for bacterial metagenomes, which share many assembly challenges of 68 viromes but to a lesser degree. Virome data is characterised by high proportions of repeat regions 69 within viral genomes (Minot,  Here we expand upon previous studies and present a detailed investigation of assembly 95 software for virome analysis which compares all those previously used in human virome studies to 96 date, as well as other popular or more recently published assemblers (Table 1). We compare assembly 97 efficacy and accuracy using simulated viromes, mock viral communities and human gut viromes 98 spiked with a known exogenous bacteriophage. Furthermore we confirm these findings using human 99 virome data from published datasets and assess computational parameters such as runtime and RAM 100 usage. We also investigate in detail the impact of sequencing coverage and genomic repeats on 101 assembly performance and highlight important considerations for future virome studies.

Simulated virome dataset 106
The ability to accurately recover each of the 572 members of a published simulated community 107 (Hesse, van Heusden et al. 2017) and the degree of fragmentation, was assessed by aligning the 108 resulting contigs from each assembler to the reference genomes (Fig. 1). MetaVelvet was not included 109 in this analysis as it failed to reach completion after seven days. Approximately half of the genomes in 110 the community featured an average recovered genome fraction less than 75% and exhibited higher 111 degrees of fragmentation (>10 contigs per genome on average) across all assemblers. For 87 of the 112 572 genomes there was an average recovered genome fraction of less than 20% across all assemblers 113 (the low recovered genome fraction of VICUNA was excluded as an outlier). Of these genomes, 84 114 were present at low abundance (lowest 40% of all abundances normalised to genome length). The 115 remaining three genomes were present at higher normalised abundances (50 -80 th percentile) but 116 featured the some of the highest proportions of genomic repeats (70 th -90 th percentile). greater than 90% of the target genome in one single contig was compared (Fig 2). SPAdes (default) 160 performed best, recovering 210, SPAdes (meta), SPAdes (single cell + careful), CLC, and SPAdes 161 (single cell) each recovered 179, 168, 162 and 160 genomes respectively. 162 The accuracy of assemblies was assessed by calculating the average count of indels, 163 mismatches, and misassemblies per 100kb across all genomes. These counts were normalised to the 164 number of genomes each assembler recovered with a minimum genome fraction of 50%. These were 165 ranked according to their performance in all three metrics (Supplementary Table 4  The rate of false positive (no alignment to reference genomes) and false negative (recovered 171 genome fraction of 0%) contigs assembled allowed for the determination of sensitivity. A number of 172 assemblers had a sensitivity greater than 97%, however each returned greater than 7,000 contigs, 173 inferring a high degree of fragmentation (Table 2). MIRA assembled (partial or complete) 559 of the 174 genomes with a false positive count of just four. However, this was achieved from more than 27,000 175 contigs. ABySS (both k-mer sizes), Geneious, Ray Meta and Velvet returned very few false positives 176 but failed to detect many of the genomes present. SPAdes (meta) performed best with 558 of the 572 177 genomes detected and only five false positives resulting from 7419 contigs.  Two mock viral communities were used to investigate the impact of high and low abundance ssDNA 183 viruses on assembly performance. Mock A (Table 3a) contained 12 viral genomes, 10 of which were 184 at equal abundance (9.82% of the total community) and two ssDNA genomes (NC_001330 and 185 NC_001422) at low abundance (0.92%). Analysis of this community showed that although some ABySS (both k-mer sizes) and SOAPdenovo2 had the lowest sensitivity. Despite being a relatively 198 simple community consisting of 12 members, not all assemblers were able to recover all members 199 (Supplementary Table 5   SPAdes (meta) did not feature any. The six best assemblies of the Q33 genome and the genome itself 251 are syntenic (although occasionally on the reverse strand) and the start and end point were not 252 conserved (Fig .3). 253

RAM usage via four published healthy human gut viromes (Manrique, Bolduc et al. 2016) and various 256
sequencing depths . It must be noted that all assembly tasks were allocated five threads, however 257 specifying the number of threads did not change the number of threads used by certain programs. 258 MetaVelvet was not included in this analysis as it failed to reach completion after running for seven 259 days. CLC and Geneious were performed on a desktop computer and therefore excluded from time 260 and RAM analysis. Run time is dependent upon the number of reads and this is largely linear in scale 261 with more reads leading to an increased assembly time (Fig. 4a). MIRA and Vicuna (Fig. 4a insert) 262 were the slowest with MIRA taking over 15 times longer than the other software to assemble 3.5 263 million reads. SOAPdenovo2 had the shortest completion time followed by IBDA UD and Velvet. 264

Most assemblers were consistent across samples (observed via error bars) with the exception of 265
MIRA and Ray Meta. MIRA, Vicuna and Velvet (Fig. 4b insert) had the highest max RAM usage 266 while the lowest was Ray Meta, IDBA UD and SPAdes (meta) (Fig. 4b). The majority of assemblers 267 observed a linear scale pattern similar to that of run time. 268

269
For both the N50 (Fig. 4c) and the longest contig length (Fig. 4d) were improved using Geneious, resolving greater genome fractions with fewer contigs (despite 348 Geneious recovering a lower genome fraction of the Simulated dataset overall). It is possible that 349 using these approaches could address issues facing each assembler, i.e. combine the assemblies of 350 SPAdes (meta) which performs well across all 4 datasets but struggles to recover low abundant 351 genomes, with MIRA assemblies which are less affected by low abundance but has difficulty 352 resolving genomes of higher abundance. Comparison of multi-assembler approaches and 353 combinations of various assemblers was not within the scope of this study, but should not be ruled out 354 as a potential method of improving virome assembly in cases where composition could be assessed 355 and obvious assembly challenges were known to be present. 356 Across all analysis methods in this study, SPAdes (meta) performed consistently well and 357 would be our recommendation. It performed best in the Simulated data based on false positives, true 358 positives and false negatives, best assembled the Q33 genome (recovery, fragmentation, 359 misassemblies and genome size) and performed well with both mock communities in recovering all 360 members accurately in one or two contigs. SPAdes (meta) RAM usage was low and did not increase 361 to the same degree as other assemblers with increasing sequencing depth. This recommendation is in 362 agreement with previous comparisons (Vollmers, Wiegand et al. 2017) which also suggested using 363 SPAdes (meta) due to its ability to accurately assemble members of bacterial metagenomes. SPAdes 364 (meta) is less able to accurately reconstruct micro-diversity as it generates a consensus assembly of 365 "strain-contigs" in a metagenome, which means it is better equipped to address the high mutation 366 rates observed in virome data (Nurk, Meleshko et al. 2017). This recommendation is also concurrent 367 with a previous study (Roux,  is of particular importance. The mock A and B datasets were used to assess the impact of 384 amplification bias on assembly performance. All ssDNA assemblies featured an equal minimum 385 number of mismatches across both Mock A and B. This may be caused by challenges in the genomes 386 themselves hampering accurate assembly, but is more likely to reflect strain variation between genome sequence featured in the original publication and the genome of the phage used in the 388 community itself. 389 The Q33 spiked virome consisted of pooled reads from three healthy human faecal samples, 390 each of which having been spiked with 10 7 PFU ml -1 of lactococcal phage Q33 prior to virome 391 extraction. This allowed for assembly comparison of one abundant member of a challenging viral 392 community. Despite the high relative abundance of the Q33 genome, only 6 assemblers could recover 393 over 90% of the genome in a single contig, of these SPAdes (meta) and MEGAHIT reconstructed the 394 Q33 genome accurately without the introduction foreign or chimeric DNA. It was also noted that the 395 genome synteny was conserved across these six assemblies. This may reflect circularization of the 396 linear Q33 genome during DNA extraction as the presence of cos sites has been previously predicted 397 (Mahony, Martel et al. 2013). 398 The longest contigs of each assembler were only detected at the highest sequencing depths 399 and varied across assemblers, which may indicate that high coverage is necessary to recover the 400 largest viral genomes in a community. However, it is also possible that these long contigs may reflect 401 misassemblies and duplication events caused by read errors at high sequencing depths which must be 402 2016). As a result, it is likely that true diversity of viral metagenomes is not being accurately captured 430 using current virome extraction methods. However, as these procedures move away from steps known 431 to introduce bias, greater diversity will be observed. In this sense, the level of complexity of the Q33 432 dataset, which pooled three independent human viromes, provides a useful testbed for metagenomic 433 assemblers in future virome studies as extraction methods improve. Additionally, Q33 was not present 434 in the viromes prior to spiking, assemblers were not challenged by the presence of native strain 435 variations of Q33 genome. In this study, assemblers were compared without individual optimisation 436 to the specific dataset. Feasibility dictates that, this "straight out of the box" approach to assembly is 437 used by almost all metagenomic assembly comparisons. Additionally, as the true composition of 438 metagenomes is unknown, any impact of parameter optimisation must be estimated from general 439 assembly statistics such as N50 and longest contig which have been highlighted to be of limited   Genome fraction "is the total number of aligned bases in the reference, divided by the genome size. A 532 base in the reference genome is counted as aligned if there is at least one contig with at least one 533 alignment to this base. Contigs from repeat regions may map to multiple places, and thus may be 534 counted multiple times in this quantity." 535 N50 "is the contig length such that using longer or equal length contigs produces half (50%) of the 536 bases of the assembly. Usually there is no value that produces exactly 50%, so the technical definition 537 is the maximum length x such that using contigs of length at least x accounts for at least 50% of the 538 total assembly length." 539 Number of contigs "is the total number of contigs in the assembly that have size greater than or equal 540 to 0 bp." 541 Misassemblies "is the number of positions in the assembled contigs where the left flanking sequence 542 aligns over 1 kbp away from the right flanking sequence on the reference (relocation) or they overlap on more than 1 kbp (relocation) or flanking sequences align on different strands (inversion) or 544 different chromosomes (translocation)." 545 Local misassemblies "A local misassembly has two or more distinct alignments covering the 546 breakpoint, the gap between left and right flanking sequences is less than 1 kbp and the left and right 547 flanking sequences both are on the same strand of the same chromosome of the reference genome." 548