Impact of chimera-less long reads on metagenomics of human gut viromes treated with 1 multiple displacement amplification 2

26 Background: The ecological and biological features of the indigenous phage community 27 (virome) in the human gut microbiome are poorly understood, possibly due to many 28 fragmented contigs and fewer complete genomes based on conventional short-read 29 metagenomics. Long-read sequencing technologies have attracted attention as an 30 alternative approach to reconstruct long and accurate contigs from microbial 31 communities. However, the impact of long-read metagenomics on human gut virome 32 analysis has not been well evaluated. 33 34 Results: Here we present chimera-less PacBio long-read metagenomics of multiple 35 displacement amplification (MDA)-treated human gut virome DNA. The method 36 included the development of a novel bioinformatics tool, SACRA (Split Amplified 37 Chimeric Read Algorithm), which efficiently detects and splits numerous chimeric reads 38 in PacBio reads from the MDA-treated virome samples. SACRA treatment of PacBio 39 reads from five samples markedly reduced the average chimera ratio from 72 to 1.5%, 40 generating chimera-less PacBio reads with an average read-length of 1.8 kb. De novo 41 assembly of the chimera-less long reads generated contigs with an average N50 length of 42 11.1 kb, whereas those of MiSeq short reads from the same samples were 0.7 kb, 43 dramatically improving contig extension. Alignment of both contig sets generated 378 44 high-quality merged contigs (MCs) composed of the minimum scaffolds of 434 MiSeq and 45 637 PacBio contigs, respectively, and also identified numerous MiSeq short fragmented 46 contigs ≤500 bp additionally aligned to MCs, which possibly originated from a small 47 fraction of MiSeq chimeric reads. The alignment also revealed that fragmentations of the 48 scaffolded MiSeq contigs were caused primarily by genomic complexity of the community, 49 including local repeats, hypervariable regions, and highly conserved sequences in and between the phage genomes. We identified 142 complete and near-complete phage 51 genomes including 108 novel genomes, varying from 5 to 185 kb in length, the majority 52 of which were predicted to be Microviridae phages including several variants with 53 homologous but distinct genomes, which were fragmented in MiSeq contigs. 54 55 Conclusions: Long-read metagenomics coupled with SACRA provides an improved 56 method to reconstruct accurate and extended phage genomes from MDA-treated virome 57 samples of the human gut, and potentially from other environmental virome samples. 58 59

9 contigs at the local repeat region in SAL contigs. Type 2 fragmentations were observed for 200 FMS contigs aligned to multiple SAL contigs homologous but distinct, and included contigs 201 with mosaic sequences consisting of the highly similar sequences between the homologous 202 SAL contigs. Type 3 fragmentations of FMS contigs occurred mostly near the hypervariable 203 loci containing many single nucleotide polymorphic sites [20] in SAL contigs. Type 4 204 fragmentations were observed only in SAL contigs, mostly at the relatively low read-depth 205 region in FMS contigs. Among the FMS fragmentations, type 2 fragmented contigs were most 206 frequent ( Fig. 5c and Table S11). 207 208

Identification of complete and near-complete contigs 209
From the 324 non-redundant MCs, we identified 159 MCs including 17 FMS-specific contigs 210 as complete contigs generating circular contigs (CCs), a hallmark of the full-length genome 211 [22]. For determination of the near-complete linear contigs (ncLCs), we developed and used 212 two parameters to assess the completeness of contigs (Discussion). One was the intra-chimera 213 rate of SAL contigs, which was significantly reduced from ~85% to ~50% as the contig length 214 was shortened to 80%, 50%, and 20% of the complete λ genome (100%) (Suppl. Fig. 8). We 215 set the cut-off value of the intra-chimera rate to 80% to screen the LCs/MCs (≥10 kb) and 216 obtained 95 LCs with an intra-chimera rate ≥80%. Of them, 13 were identified as ncLCs with 217 the second parameter that the total length of either shorter SAL or FMS contigs aligned to the 218 MC was ≥97% of that of the MC, which was consistently observed for all complete CCs 219 composed of both SAL and FMS contigs. Overall, we identified a total of 172 complete and 220 near-complete contigs in the five samples (Tables S12 and S13). 221 222

Characterization of complete and near-complete contigs 223
The 172 contigs were characterized by phage classification assessments such as VirSorter [32], 224 gene annotation by pVOG [33], and a similarity search with the publicly available databases. 225 The results showed that 142 contigs (129 CCs and 13 ncLCs) were classified as phage with the 226 VirSorter category 1, 2, or 3 and with at least one pVOG (Table S12). Other questionable 227 contigs negative for the VirSorter assessment or with no pVOG, including nine matched with 228 known plasmids, were excluded from further analysis ( Table S13). 229 Of the 142 phage contigs, 108 (101 CCs and seven ncLCs) were likely novel phage 230 genomes because of a lack of similarity with the known genomes, whereas 34 (28 CCs and six 231 ncLCs) had high sequence similarities of ≥88.9% and almost identical lengths with the known 232 full-length phage genomes. Of the 34 contigs, seven had similarity with phages of the family 233 Microviridae and 11 had similarity with those of the crAss-like phages, but the known phages 234 matched with other 16 contigs lacked taxonomic information ( Table S12). The pVOG-based 235 family-level taxonomy prediction of the novel and taxon-unknown phage contigs revealed that 236 77 contigs were possibly assigned to Microviridae, 27 to Siphoviridae, 12 to Myoviridae, six 237 to Podoviridae, one to Inoviridae, and one unassigned (Table S14). In total, we identified six 238 families including 84 Microviridae, 27 Siphoviridae, 12 Myoviridae, 11 crAss-like, six 239 Podoviridae, and one Inoviridae in the samples (Suppl. Fig. 9). 240 We identified a total of 4,675 open reading frames (ORFs) as putative genes in the 142 241 phage contigs including 1,077 ORFs matched with pVOGs ( Table S12). The average number 242 of ORFs on contigs was 32.9 per phage contig, 8.8 per SAL contig, and 1.5 per FMS contig, 243 respectively. The median ORF length in the 142 phage contigs was 150 amino acids (aa), which 244 was longer than 76 aa in FMS contigs, 110 aa in SAL contigs, and 132 aa in the reference 245 phage genomes (Fig. 6). The present data revealed numerous chimeric reads in PacBio reads from the MDA-treated 249 human gut virome samples (Fig. 1). The high chimera rate was also observed in PacBio reads 250 from MDA-treated DNA of non-metagenomic samples, in which Pacasus improved the 251 mapping rate by reducing the chimeric reads [29]. Nevertheless, Pacasus inefficiently reduced 252 the chimeric ratio from ~72% to only ~34% in our PacBio reads, which substantially hampered 253 reconstruction of the complete spike-in λ genome ( Fig. 1 and Suppl. Fig. 6). We found that 254 Pacasus-insensitive NPCRs were composed of two different sequences, differing from the 255 Pacasus-sensitive reads with palindromic sequences (Suppl. Fig. 3a). The mechanism for 256 NPCR formation by MDA might involve the template switching of DNA extension by 257 polymerases to annealable sequences spatially close but sequentially distant [30]. Because the 258 MDA-mediated chimera formation was highly reproducible in terms of the position and 259 frequency in the genomes (Suppl. Fig. 3b), it might not be largely influenced by other 260 coexisting genomic DNA under the present MDA conditions, which might also be supported 261 by the higher rate of intra-NPCRs than inter-NPCRs (Suppl. Fig. 8). As the estimated chimera 262 frequency was similar between the samples (Suppl. Fig. 2, Table S4), the higher chimera ratio 263 in PacBio than FMS reads is explained by its longer read-length than that of FMSs. 264 SACRA was developed to correct both Pacasus-sensitive and NPCRs with high 265 efficiency. SACRA included simple computational steps, pairwise local alignment of PacBio 266 reads to construct ARCs, PC ratio calculation, PC ratio-based selection, and splitting of 267 chimeric ARCs (Fig. 3), in which the PC ratio was most crucial. In fact, SACRA treatment 268 with the optimized mPC ratio, which maximized the sensitivity and specificity for 269 distinguishing the two ARCs (Fig. 2), dramatically reduced the average chimera ratio to ~1.5% 270 ( Fig. 1 and Table S3), recovering the complete λ genome in all the samples, whereas no 271 complete λ genome was recovered from SALs with an mPC ratio ≥0 in four samples (Suppl. 272 Fig. 6). Of note, the overall SAL assembly was improved by SACRA treatment with the 273 optimized mPC ratio in each sample (Fig. 4), suggesting that the optimized mPC ratio 274 determined from reads aligned to the spike-in λ genome is effective for reads from other phage 275 genomes in the samples as well. 276 De novo assembly of chimera-less SALs markedly increased the N50 and contigs with 277 a length >5 kb, whereas the contig number was much less than that of FMSs ( Fig. 4 and Table  278 S7), largely due to differences in read numbers between them, where were ~7-fold higher in 279 FMSs than SALs in this study (Table S6). For the alignment of FMS and SAL contigs, we 280 found numerous short FMS contigs (mostly ≤500 bp) aligned to the MCs, in addition to the 281 minimum scaffolded FMS contigs ( Table S8). Assembly of the FMSs excluding chimeric 282 reads on the λ genome effectively suppressed the generation of these short contigs (Table S9), 283 suggesting that the short contigs originated from a small fraction of FMS chimeric reads, which 284 could not be efficiently removed prior to assembly in the actual samples without the reference 285

genomes. 286
Typical FMS contig fragmentations were caused by local repeats (type 1) and 287 hypervariable regions (type 3) [20] in individual phage genomes and by the highly conserved 288 regions between multiple homologous phage genomes in the community (type 2) (Suppl.  Table S11), suggesting that FMS contig fragmentation largely depended on the 290 community's genomic complexity, and the most frequent type 2 fragmentation for FMS contigs 291 was difficult to identify without SAL contigs. However, SAL contig fragmentations (type 4) 292 occurred mostly at the low read-depth regions in FMS contigs, which might be caused by the 293 biased MDA [34]. Although type 4 fragmentation was more frequent than other types (Table 294 S11), this was probably specific and limited to this study due to much fewer SALs than FMSs, 295 as described, as well as undetected SAL contigs corresponding to the relatively low abundance 296 of FMS-specific contigs (Table S6). In other words, SAL contig fragmentations will be 297 decreased by increasing the SALs per sample, whereas FMS contig fragmentations could be 298 marginally improved even when the reads increase. 299 It is challenging to identify complete LCs without TDRs because they have no 300 characteristic structural feature of the full-length genome, unlike the linear genomes with TDRs 301 [22]. In this study, we therefore defined ncLCs without TDRs as those containing SAL contigs 302 with a ≥80% intra-chimera rate (Suppl. Fig. 8) and with a ≥97% length coverage of either 303 shorter SAL or FMS contigs in the MC. Using these two parameters, we conservatively 304 identified 13 ncLCs/MCs, of which, six were aligned to the known full-length phage genomes 305 with high similarity (Table S12). Because all of the six ncLCs aligned to the known phage 306 genomes were captured by the intra-chimera rate parameter, the intra-chimera rate of SAL 307 contigs could be used to assess the completeness of non-TDR linear genomes in MDA-treated 308 virome samples. 309 The 142 complete and near-complete phage contigs identified in this study were 310 characterized by the majority (71 novel and 13 known phages) of relatively small genomes 311 (from 4.8 to 7 kb in length) belonging to the Microviridae family with ssDNA genomes [35] 312 (Suppl. Fig. 9 and Table S12). Of the 84 complete Microviridae genomes, 18 (13 novel) were 313 recovered as type 2 fragmented contigs including mosaic sequences in the FMS assembly 314 (Table S11 and Suppl. Fig. 7), suggesting the existence of some Microviridae phages with 315 homologous but distinct genomes in the human gut, which might be hardly reconstructed from 316 the conventional short reads. However, the abundance of these small circular ssDNA phages 317 in the gut cannot be precisely evaluated from data of the MDA-treated samples because they 318 were preferentially amplified by MDA as described previously [15]. Only 13 CCs, all of which 319 were novel Microviridae phages, were shared with multiple samples in this study (Table S12), 320 suggesting the high inter-individual variability in the gut phages, except for some phages 321 prevalent in the human gut. 322 Of the 129 complete CCs, some might have linear genomes with TDRs like 323 crAssphages, because the conversion of linear to circular DNA might also occur in MDA with 324 a mechanism similar to template switching in PCR-extension between the two unconnected 325 TDRs [22]. Metagenomic sequencing of non-MDA virome DNA samples will allow to 326 determine whether the contigs have circular or linear genomes with TDRs and to quantify the 327 abundance of phage genomes including ssDNA in the human gut [15,22]. The increased 328 number and length of ORFs in SAL contigs was remarkably (Fig. 6), providing more 329 information on gene organization and gene products in the phage genome than the conventional 330 short-read metagenomics, which will facilitate the accurate prediction of phage taxonomy 331 any 5′ and 3′ low quality bases (<20 QV) bases were also trimmed. Reads with a mean QV <20 390 or those with <100 bp were removed. The quality filtering of reads was performed using First, ECLs were aligned to each other using LAST to generate ARCs. Second, the PC ratio 465 was calculated to divide ARCs with both PARs and CARs into those with greater than the 466 optimized mPC ratio and less than the optimized mPC ratio. Third, ARCs with greater than the 467 optimized mPC ratio and those with only PARs were subjected to SACRA to split them at the 468 chimeric position. Finally, SACRA-split reads, unsplit reads from ARCs with less than the 469 optimized mPC ratio and with only CARs, and singletons were combined and assembled 470 ( Figure 3). 471

De novo assembly of PacBio and MiSeq reads 473
PALs and SALs were assembled using canu (v1. Chimeric paired reads generated from Pacasus-insensitive chimeric reads (NPCRs) 496 split by SACRA were mapped to contigs using minimap2 with the map-pb option and ≥80% 497 identity and coverage. The paired reads mapped to the same contig were assigned as intra-498 NPCRs, and those mapped to different contigs were assigned as inter-NPCRs. The intra-499 chimera rate of a contig was obtained by dividing the number of intra-NPCRs by the total 500 number of NPCRs in the contig. The fragmented λ contigs were generated by randomly 501 dividing the complete λ sequence into 10 subsequences with 20%, 50%, and 80% coverage of 502 the complete length. 503 504

Mapping of MiSeq and PacBio reads to contigs 505
FMSs were used for quantification of the relative abundance of contigs by mapping them to 506 contigs using Bowtie2 with ≥95% identity in each sample, and the relative abundance was 507 calculated by dividing the number of mapped reads by the contig size. Mapping SALs to 508 contigs was individually conducted using minimap2 with the map-pb option [42] and ≥95% 509 identity and ≥85% length coverage. Of the alignments containing PacBio reads that aligned to 510 multiple contigs, the alignment with the highest identity was selected as the top hit alignment. 511 The mapping results were visualized with IGV (v2.7.2) [44].