VIGA: a one-stop tool for eukaryotic virus identification and genome assembly from next-generation-sequencing data

Abstract Identification of viruses and further assembly of viral genomes from the next-generation-sequencing data are essential steps in virome studies. This study presented a one-stop tool named VIGA (available at https://github.com/viralInformatics/VIGA) for eukaryotic virus identification and genome assembly from NGS data. It was composed of four modules, namely, identification, taxonomic annotation, assembly and novel virus discovery, which integrated several third-party tools such as BLAST, Trinity, MetaCompass and RagTag. Evaluation on multiple simulated and real virome datasets showed that VIGA assembled more complete virus genomes than its competitors on both the metatranscriptomic and metagenomic data and performed well in assembling virus genomes at the strain level. Finally, VIGA was used to investigate the virome in metatranscriptomic data from the Human Microbiome Project and revealed different composition and positive rate of viromes in diseases of prediabetes, Crohn’s disease and ulcerative colitis. Overall, VIGA would help much in identification and characterization of viromes, especially the known viruses, in future studies.


INTRODUCTION
Viruses are ubiquitous in nature and have profound impacts on the health and diversity of all living organisms [1].However, most viruses remain unknown and unculturable due to the limitations of conventional isolation methods [2].In recent years, the rapid development of next-generation-sequencing (NGS) technologies has revolutionized the field of metagenomics and transcriptomics [3], enabling the analysis of nucleic acid sequences obtained directly from environmental or clinical samples.For example, the Human Microbiome Project (HMP) [4] has facilitated the generation of hundreds of reference microbial genomes, along with thousands of metagenomic and metatranscriptomic datasets from multiple parts of the human body.This has greatly increased the number of viral genomes and provided valuable information for studying viral evolution, diversity and epidemiology [5].
Identification of viruses from NGS data is the first step in virome studies.There are currently two kinds of methods for virus identification.The first is the homology-based methods, such as using tools of BLAST [6] or HMMER [7] to search for viruses that have sequence similarity to known viruses.This kind of methods is commonly used in identifying eukaryotic viruses that have been studied much.For example, Moustafa et al. [8] identified 19 viruses by whole genome sequencing of blood from 8240 individuals using BLAST.The advantage of these methods is that they are generally accurate with a low rate of false positives [9], but they may miss identification of novel viruses that have remote or little homology with known viruses [10].This problem becomes serious in discovery of phages that have huge genetic diversity.To address the problem, the other kind of methods has been developed for virus identification based on machine-learning methods, such as Virtifier [11], Seeker [12] or VirFinder [13].They can detect viruses with higher sensitivity than the homology-based methods and are generally used in identifying phages.For example, the project of Global Ocean Viromes 2.0 (GOV 2.0) [14] have identified 195 728 viral populations from the global ocean DNA virome dataset based on this kind of methods.Unfortunately, a high false-positive rate was observed for these methods [15] and there was also inconsistency between the prediction results by different methods [16].
Assembling virus genomes is also an essential step for further studies of the virome.Unfortunately, assembling large genomic fragments from short-read sequencing data is a formidable computational challenge [17].In the case of viruses, the presence of repetitive region [18], strain heterogeneity [19] and lowabundance viral populations [20] can pose significant difficulties for accurate assembly of virus genomes.Besides, the quality of the assembly may be compromised by errors in the sequencing data.To address these challenges, researchers have developed various methods for improving the accuracy and quality of the virus genome assembly that can be mainly grouped into two categories [21]: one is the reference-based assembly methods, such as MetaCompass [22] and VirGenA [23], and the other is the de novo assembly methods, such as Trinity [24] and Haploflow [25].The reference-based methods assembling genomes by taking a known genome as a guide.The advantage of this kind of methods is that they are generally more accurate than the de novo methods [22], but they are not suitable for genome assembly of viruses without reference genomes.On the contrary, the de novo methods can be applied to all viruses including novel viruses, although they generally require a deep sequencing depth and may not assemble complete genomes [26].
In this study, a one-stop tool named VIGA was developed for eukaryotic virus identification and genome assembly from NGS data.It integrated multiple assembly tools including both the reference-based and de novo ones and combines both functions of virus identification and genome assembly.Evaluation on multiple simulated and real virome datasets showed that VIGA could be used for assembling virus genomes and separating mixtures of virus strains from metagenomic and transcriptomic data.
To evaluate the performance of computational tools in assembling viral genomes at the strain level, two datasets were used.The first is the HIV dataset, which was created in Fritz's study by mixing three HIV strains with about 95% sequence identity in proportions 10:5:2 [25].The dataset contains simulated metagenomic sequencing reads with the length of 150 bp and the depth of 20 000 (available for download at https://frl.publisso.de/data/frl:6424451/).The other is the hepatitis B virus (HBV) dataset [30], which is the metagenomic sequencing of two clinical samples that were co-infected by two HBV strains with 89% sequence identity (accessions in NCBI SRA database: ERR3253398 and ERR3253399, accessions of viral genomic sequences in NCBI GenBank database: MK720628.1 and MK720631.1).
The metatranscriptomic datasets derived from the HMP (https://portal.hmpdacc.org/)[4] were used to illustrate the usage of VIGA.Only the paired-end sequencing samples were used as VIGA can only use the paired-end data.A total of 1321 samples were finally used.The meta information including the disease state, age and gender of samples was also obtained (available at https://github.com/viralInformatics/VIGA/blob/master/HMP_datasets.xlsx).

The workflow of VIGA
VIGA included four modules: identification, taxonomic annotation, assembly and novel virus discovery, which were described as follows.The identification module aimed to detect eukaryotic virus sequences from NGS data.Firstly, fastp (version 0.21.0)[31] was used to trim the universal adapter sequences from raw reads and filter reads with average base quality less than Q20.Secondly, the remaining clean reads were assembled into contigs using the Trinity program (version 2.1.1)[24] with default parameter settings.Thirdly, the contigs were queried against a library of virus protein sequences retrieved from the NCBI RefSeq database on 10 June 2020 using Diamond BLASTX (version 0.9.25.126)[32].The contigs with an E-value of less than 1E-5 to the best hit were labeled as hypothetical viral contigs.Fourthly, the hypothetical viral contigs were queried against the NR database (downloaded on 17 November 2020) with Diamond BLASTX to remove false positives.Those with the BLASTX best hit belonging to viruses were kept and considered as viral contigs.Finally, they were filtered based on host of the BLASTX best hit, and only those with the BLASTX best hit belonging to eukaryotic viruses were kept for further analysis.
The taxonomic annotation module was designed for taxonomic annotation of viral contigs identified above.The taxonomy information of viral contigs was obtained according to the amino acid identity (AAI) and coverage between the contig and the BLASTX best hit against the NR database.If the AAI and coverage were no less than 90% and 80%, respectively, the viral contig was considered to be the same virus species with the BLASTX best hit according to previous studies [33][34][35]; if they were no less than 70% and 60%, respectively, the viral contig was considered to be the same genus with the BLASTX best hit according to previous studies [36][37][38].For viral contigs that were annotated at the species level, the assembly module would be conducted; for those that were annotated at the genus level, the novel virus discovery module would be conducted.
The Assembly module assembled and quantified the viral genome for the virus species after the taxonomy annotation.Firstly, a library of virus reference genomes was built as follows: (i) genome sequences of all eukaryotic viruses were downloaded from the NCBI Genome database on 17 August 2022.(ii) Genome sequences of the same virus species were clustered using the MMseqs2 (version 13.45111) [39] at 100% level.The representative sequence in each cluster was added to the library of virus reference genomes.Secondly, all representative sequences of the virus species were taken as reference genomes and were used to assemble the virus genome using MetaCompass (version 2.0.0, parameter: '-l 100 -t 30') [22].Thirdly, RagTag (version 2.1.0,parameter: '-u -C') [40,41] was used to correct potential assembly errors in contigs based on the reference genomes.Then, MetaQUAST (version 5.0.2) [42] was used to evaluate the quality of assembled viral genomes and output the genome fraction (GF) of the assembled virus genome, which measures the genome completeness.Then, the assembled viral genomes were quantified using the FPKM (fragments per virtual kilobase per million sequenced reads) method according to Lee's study [43], which was listed as follows: where L v is the length of the assembled viral genome (bp), M is the number of reads assigned to the virus genome and N is the total number of clean reads.The genome coverage [44] and the depth coverage [45] of the assembled genome that measure the sequencing broadness and depth of the genome, respectively, would also be outputted.Finally, the depth coverage of the assembled genome would be visualized with a line chart.The novel virus discovery module was designed for novel virus discovery.Viral contigs that were annotated as the same genus were pooled together and defined as a new virus class (NVC).The read coverage of the NVC, which measures the sequencing depth of the NVC, the AAI and coverage between viral contigs and the BLASTX best hits against the NR database, and the detailed information of the BLASTX best hits would be provided.

Comparison of VIGA and other assembly tools
Since VIGA integrated both de novo and reference-based assemblers for virus identification and genome assembly, four commonly used assembly tools, namely, two reference-based ones, i.e.MetaCompass [22] and VirGenA [23], and two de novo ones, i.e.Haplof low [25] and Trinity [24], were used for virus genome assembly and for comparison to VIGA.For the de novo assemblers, Trinity is a widely used tool for de novo contig assembly [46].Previous studies have shown that it had excellent performances in virus genome assembly [47][48][49].Thus, it is used for comparison to VIGA in assembling virus genomes.Haplof low was designed for assembly of virus genomes, especially the haplotype reconstruction [25].It has been demonstrated to outperform other tools in haplotype reconstruction [25].Thus, it is used to compare its performance to VIGA in assembling virus genomes at the strain level.For the reference-based assemblers, VirGenA is a tool designed for separating mixtures of viral strains of different intraspecies genetic groups such as genotypes, subtypes and clades and for assembling a separate consensus sequence for each group in a mixture [23].It has been reported to produce long assemblies for mixture components of extremely low frequencies (<1%) [23].Thus, it is used for comparison to VIGA in assembling virus genomes at the strain level.MetaCompass utilizes reference genomes as a guide to construct consensus sequences and employs de novo assembly to address regions dissimilar to known reference genomes, thereby enhancing the overall assembly quality [22].It is designed for metagenomic assembly of lowabundance bacterial genomes and should be particularly useful for assembling virus genomes since most viruses have low abundances in either metagenomic or metatranscriptomic sequencing data [22].Besides, it is part of VIGA and can be used to investigate the effectiveness of integrating multiple tools in VIGA.
Besides, all competitor tools used in the study have undergone extensive validation and have been widely used in the field.The GF [42] was used to measure the completeness of the assembled genome.It was calculated as the ratio of the virus genome covered by the bases to which at least one contig has alignment.Two additional metrics were used to measure the accuracy and precision in assembling virus genomes at the strain level.One is the mismatches per 100 kb (M100K) [42], which referred to the average number of mismatches per 100 000 aligned bases and was used to measure the accuracy of assembled virus genomes.The other is the strain precision (SP) [25], which was calculated as the ratio of correctly assembled contigs among all contigs of the virus and was used to assess the precision of assembled virus genomes.

Virus identification and genome assembly on the HMP
VIGA was used to identify viruses and assemble virus genomes from the HMP.The host kingdom (plant, fungi and animal) of the identified viruses was inferred based on the virus family as viruses of the same family generally infect host of the same kingdom.When analyzing the virome in diseases, only animal viruses with abundance of greater than 2 FPKM were kept.Besides, viruses of the Retroviridae family were removed due to possible contamination from endogenous retroviral sequences, and viruses of the Baculoviridae family were also removed, which are commonly used in the laboratory [50].

Statistical analysis
The Pearson correlation coefficient (PCC) was used to measure the correlation between virus abundance and virus percentage in the mock virus community.Python (version 3.8) was used for statistical analysis.

Overview of the VIGA pipeline
The VIGA included four modules: identification, taxonomic annotation, assembly and novel virus discovery (Figure 1), which were described brief ly as follows.The identification module aimed to detect eukaryotic virus sequences from NGS data.Firstly, the raw reads were pre-processed and then were assembled into contigs.Then, these contigs were queried against virus protein sequences using BLASTX, and candidate virus contigs were obtained.Subsequently, to remove false positives, the candidate virus contigs were further queried against the NR database using BLASTX.Finally, the virus contigs were filtered based on host, and only eukaryotic virus sequences were kept for further analysis.
The Taxonomic annotation module was designed for taxonomic annotation of viral contigs identified above.The taxonomy information of viral contigs was obtained according to the AAI and coverage between the contig and the BLASTX best hit against the NR database (see Materials and methods).For viral contigs that were annotated at the species level, the assembly module would be conducted; for those that were annotated at the genus level, the novel virus discovery module would be conducted.
The assembly module assembled and quantified the viral genome for the virus species after the taxonomic annotation.Firstly, the reference genomes were obtained for the virus species; then, MetaCompass was used to assemble the virus genome based on reference genomes of the virus species.Next, RagTag was used to assembly the virus genome after correcting potential assembly errors in contigs based on the reference genomes; finally, the assembled viral genomes were evaluated with MetaQUAST and quantified using FPKM.
The novel virus discovery module was designed for novel virus discovery.Viral contigs that were annotated as the same genus were pooled together and defined as a new virus class (NVC).Then, the read coverage of the NVC, the AAI and coverage between viral contigs and BLASTX best hit against the NR database, and the detailed information of the BLASTX best hits was provided.

Evaluating the performance of VIGA on a mock viral community
The performance of VIGA was firstly evaluated on a mock viral community that consisted of seven viruses with different proportions (Materials and methods).For comparison, we also evaluated four commonly used tools, namely, two reference-based (MetaCompass and VirGenA) and two de novo tools (Trinity and Haplof low).As illustrated in Figure 2A, VIGA successfully identified six viruses except the rotavirus A. For viruses of PV-1, EV-13, HAdVC and CV-B4, the GFs of viral genomes assembled by VIGA exceeded 0.94, which were much higher than those of its competitors.For example, the GFs of viral genomes assembled by MetaCompass and Trinity ranged from 0.74 to 0.94 and from 0.50 to 0.84, respectively.For viruses of HAdV-5 and HAdV-11, VIGA performed poorly, with the GFs of 0.636 and 0.064, respectively; however, it still outperformed its competitors much.
Previous studies have shown that the transcript abundance that was calculated based on the effective length (defined as the length of transcript regions that were covered by uniquely mapped reads) can ref lect the gene expression more accurately than the abundance based on the full length of the transcript.Hence, the length of the assembled virus genome was taken as the effective length and was used to calculate the virus abundance using the FPKM method (Materials and methods).The PCC was used to measure the correlation between the calculated virus abundances and the percentages of viruses in the mock virus community that ref lect the real virus abundance.As shown in Figure 2B and C, VIGA had a PCC of 0.93, which was similar to that of MetaCompass (0.92) and much higher than those of other software tools (0.62-0.68), suggesting that the virus quantification based on the virus genome assembled by VIGA could accurately capture the real virus abundance in complex virus communities.

Performance of VIGA on metatranscriptomic and metagenomic datasets
Then, VIGA was evaluated on a metatranscriptomic dataset of the sweet potato virome from which previous studies have identified 10 viruses and obtained their genomes by the RT-PCR method.VIGA and other four assemblers (MetaCompass, VirGenA, Trinity and Haplof low) were used to assemble virus genomes in the dataset.Except for SPCFV, all viruses were assembled by at least one tool.The GF of virus genomes assembled by VIGA ranged from 1.5% to 100% for these nine viruses (Figure 3A), with a median of 47.9%, which was greater than those of other software tools (median ratio ranging from 0% to 27.1%).Notably, both VIGA and MetaCompass assembled near complete genomes for five viruses, namely, sweet potato virus G (SPVG, 99.17%), sweet potato leaf curl virus (SPLCV, 100%), sweet potato feathery mottle virus (SPFMV, 98.90%), sweet potato virus E (SPVE, 99.94%) and sweet potato virus F (SPVF, 99.45%), three of which (SPLCV, SPFMV and SPVE) were also assembled well by Trinity.However, both VirGenA and Haplof low performed poorly and assembled only a small proportion of genomes for all nine viruses.
VIGA was also evaluated on a metagenomic dataset of bird droppings from which 16 viral strains of three virus species  were identified, and their genomes were obtained by the RT-PCR method.VIGA stood out as the most effective tool for assembling viral genomes, with the GF ranging from 0% to 100% and a median of 86.54% (Figure 3B), while other three assemblers (MetaCompass, Haplof low and Trinity) had the median GFs ranging from 4.5% to 80.3%.Notably, VIGA assembled high ratios of virus genomes in most samples, with only two exceptions resulting in no genome assembly.

Performance of VIGA in assembling virus genomes at the strain level
We then evaluated the performance of VIGA and its competitors in assembling virus genomes at the strain level.Two datasets were used in the evaluation.The first is the HIV dataset, which was the simulated metagenomic sequencing of three HIV strains with a 95% sequence identity.VIGA assembled genomes of all three strains successfully.Three indexes, namely, GF, SP and M100K, were used to measure the performance of assemblers (see Materials and methods).As shown in Figure 4A and Table S1, VIGA achieved near-perfect performance on indexes of GF and SP, with an average GF of 98.20% and an SP of 100%, while in terms of M100K, VIGA had 2787 mismatches per 100 kbp, which was much smaller than those of other assemblers except Haplof low that had the lowest mismatches per 100 kbp (33.7) among all assemblers.Compared to VIGA, Haplof low also achieved excellent performances on these indexes, with the SP of 100% and the average GF of 93.36%; other three assemblers had medium or poor performances on at least one index.For example, VirGenA had poor performances on the GF (76.14%).
The second is the HBV dataset, which was the metagenomic sequencing data of two samples infected by two HBV strains with 89% sequence identity.As shown in Figure 4B and Table S2, VIGA performed best on three indexes among all assemblers, with the average GF of 99.91%, SP of 100% and 1890.7 mismatches per 100 kbp; Haplof low performed secondly among all assemblers, with the average GF of 91.19%, SP of 100% and 2355 mismatches per 100 kbp; other assemblers had median or poor performances on at least one index.For example, Trinity had an average GF of 73.48% and 3306.4 mismatches per 100 kbp.

Performance of VIGA in assembling virus genomes with extreme genome size or GC content
To assess the robustness of VIGA, we evaluated its performance alongside that of its competitors when applied to viruses with very small or large genomes, as well as those with very low or high GC content.We initially selected HBV and Human Betaherpesvirus 5 (HHV5), whose genomes are approximately 3 and 229 kb in size, respectively.As indicated in Table S3, VIGA demonstrated the highest GF and the fewest mismatches and achieved 100% precision for both viruses.Furthermore, VIGA and MetaCompass produced only a small number of contigs for virus genome assembly, whereas other tools yielded a greater number of contigs.
Next, we examined the performance of VIGA and its competitors on viruses with either low or high GC content.Suid Herpesvirus 1 (SuHV-1) and Diachasmimorpha Longicaudata Entomopoxvirus (DlEPV) were chosen, with GC content of 73.59% and 30.07%, respectively.As shown in Table S4, VIGA consistently assembled genomes with the highest GF, achieved 100% precision and produced the fewest contigs for both viruses, although it had a little more mismatches than MetaCompass.Taken together, the biased conditions had a negligible impact on the performance of VIGA in virus genome assembly.plants and fungi (Figure 5A) (available at https://github.com/viralInformatics/VIGA/blob/master/Virus_recovery_from_HMP_ using_VIGA.xlsx).The GFs of the assembled virus genomes ranged from 1.4% to 100% with a median of 33.8% (Figure 5B).A total of 44 viruses were assembled with high completeness (GF >80%).We next focused on analysis of animal viruses since the samples were obtained from humans.After removing potential contaminations (see Materials and methods), a total of 28 animal viruses were kept for further analysis.The abundance and positive rate of 28 viruses in samples of three human diseases, namely, Crohn's disease, prediabetes and ulcerative colitis, and healthy people were analyzed.As shown in Figure 5C, a total of 11 viruses were observed in the healthy people, with the Mongoose picobirnavirus and Human picobirnavirus having the highest positive rate.The virome composition in disease samples was different from that in healthy people.For example, only 9 of 16 viruses in the Crohn's disease were also observed in the healthy people.For each disease group, there were one or more disease-specific viruses.For example, the Rotavirus A, which has been previously implicated in the development of intestinal diseases such as diarrhea, had high abundances in patients of the Crohn's disease with a median of 4837 FPKM.The virome composition between Crohn's disease and ulcerative colitis was similar, with nine virus species coexisting in both diseases, while the prediabetes had different virome composition to the Crohn's disease and ulcerative colitis.

DISCUSSION
Numerous methods have been developed to identify virus or assemble virus genome from the NGS data.However, there is still a lack of an effective tool for integrating both the functions of virus identification and genome assembly.This study presented a one-stop tool named VIGA for virus identification and genome assembly from the NGS data.It has been shown to assemble more complete virus genomes than its competitors in both the metatranscriptomic and metagenomic data and perform well in assembling virus genomes at the strain level.It has also been used to characterize the virome in metatranscriptomic data from the HMP.Overall, it should be an effective tool for virus identification and genome assembly in virome studies.Compared to previous assemblers, VIGA assembled more complete virus genomes in both metatranscriptomic and metagenomic data.This may be because VIGA effectively integrated two reference-based tools, i.e.MetaCompass and RagTag.The reference-based tools have been reported to assemble more accurate and complete genomes than de novo ones [51] given high-quality reference genomes.When compared to reference-based tools such as MetaCompass, VIGA automatically selected reference genomes from the library of virus reference genomes; more importantly, it used RagTag to further assemble virus genome obtained by MetaCompass and to correct potential errors in assembled genomes, which may lead to more complete and accurate virus genomes than those assembled by MetaCompass alone.Even for viruses with extreme genome size or GC content, VIGA also outperformed its competitors in both accuracy and completeness, suggesting its robustness in assembling virus genomes.
Both Haplof low and VirGenA have been designed for strainresolved assembly of viral genomes.Unexpectedly, VIGA can also be used to assemble virus genomes at the strain level with excellent performances.Compared to Haplof low, VIGA assembled viral genomes with similar precision, higher completeness yet a little lower accuracy on the HIV and HBV datasets.While compared to VirGenA, VIGA assembled more complete viral genomes with higher accuracy.The ability of VIGA on separating mixtures of virus strains may be due to the comprehensive library of virus reference genomes as genome sequences of each virus species were clustered at 100% level.Thus, the library should be updated in time to capture the newly submitted virus reference genomes.However, for novel viruses without reference genomes in the virus reference genome library, it may be difficult to separate and assemble virus genomes at the strain level.
Lots of third-party tools were integrated into VIGA.There were many challenges in integrating such tools effectively.Firstly, tools used in VIGA should be selected seriously in each module.For example, in contig assembly, there were lots of candidate tools such as Trans-ABySS [21], SPAdes-rna [52] and IVA [53].Trinity was selected due to its excellent performance, wide usage and ease of use [46]; in virus genome assembly, the reference-based method was selected because a library of virus reference genomes was built.Secondly, the tools should be effectively integrated to make full use of their advantages.For example, in virus genome assembly, two reference-based tools, MetaCompass [22] and Rag-Tag [40,41], were used in VIGA.MetaCompass was firstly used to assemble virus genomes based on the virus reference genome, which usually resulted in the generation of many virus contigs.Then, RagTag was employed to correct misassemblies, scaffold genome assemblies and fill gaps, leading to a chromosome-scale virus genome and a significant improvement in the integrity of the assembled virus genomes.Thirdly, all tools used in VIGA should be effectively connected to form a unified workf low, which makes it easy for users to use.
There were some limitations to the study.Firstly, VIGA can only assemble genomes for known viruses, making it particularly suitable for identification and assembly of known virus genomes in viral genome re-sequencing.For novel viruses without reference genomes, it is difficult to assemble accurate genomes by VIGA.Secondly, the accuracy of VIGA in assembling genomes needs further improvements, especially in regions with low abundance or no transcription such as the 3 and 5 ends.Thirdly, VIGA can only be used for eukaryotic viruses that have been studied much and for which a large number of reference genomes are available.Fourthly, VIGA can only be used on paired-end NGS data as MetaCompass only accepts such data as input.

CONCLUSION
Overall, this study developed a one-stop tool called VIGA for eukaryotic virus identification and genome assembly from the NGS data.It can be used in assembling virus genomes from both metatranscriptomic and metagenomic data and be used in separating virus strains from mixtures.It would help much in identification and characterization of viromes in future studies.

Key Points
• VIGA is a one-stop tool for eukaryotic virus identification and genome assembly from next-generation sequencing data.• VIGA outperformed its competitors in assembling more complete virus genomes, even at the strain level, when evaluated on multiple simulated and real virome datasets.• VIGA can be used to analyze the metatranscriptomic and metagenomic data.• VIGA is freely available for public use on the GitHub platform at https://github.com/viralInformatics/VIGA.

Figure 1 .
Figure 1.The workf low of VIGA.It includes four modules: identification, taxonomic annotation, assembly and novel virus discovery.

Figure 2 .
Figure 2. The performances of VIGA and its competitors on a mock virus community.(A) The ratios of viral genomes assembled by VIGA and its competitors including MetaCompass, VirGenA, Trinity and Haplof low.(B) The PCC between the calculated virus abundance based on the viral genomes assembled by different software tools and the viral percentages in the virus community that ref lect the real virus abundance.(C) The scatter plot of the calculated virus abundance based on VIGA and the real virus abundance.The blue line referred to the linear regression and the gray area referred to 95% confidence interval.PV-1, human poliovirus 1; EV-13, human echovirus 13; HAdVC, human adenovirus C; CV-B4, coxsackievirus B4; HAdV-5, human adenovirus 5; HAdV-11, human adenovirus type 11.

Figure 3 .
Figure 3. Performance comparison of different software tools on metatranscriptomic (A) and metagenomic (B) datasets.Boxplots show the statistical distribution of GF of assembled virus genomes by different tools.Each dot in the boxplots referred to the GF of the virus genome assembled in a sample.The results of VirGenA on the metagenomic data were not shown as no viral genomes were assembled successfully by the assembler.

Figure 4 .
Figure 4. Performance of five assemblers (Haplof low, MetaCompass, Trinity, VIGA and VirGenA) on the HIV (A) and HBV (B) datasets.Three indexes, namely, GF, SP and M100K were used to measure the performances of assemblers.Each index was normalized with the min-max normalization method before it was used in the radar plot.

Figure 5 .
Figure 5. Identification and characterization of viromes in metatranscriptomic data from the HMP dataset.(A) Distribution of assembled viruses based on host type.(B) Number of viruses with a given level of genome fraction were assembled.For viruses that were identified in multiple samples, only the most complete genome was used in the analysis.The GF was evenly divided into five levels.(C) Abundance and positivity rate of animal viruses in three diseases, namely, Crohn's disease, ulcerative colitis and prediabetes and in the control group (healthy people).Viruses were organized by groups in the Baltimore classification system and by the viral family.The virus abundance and the positive rate of viruses are shown in the left and central heatmaps, respectively, according to the figure legend in the top right.The bar plot on the right indicates the number of samples in which the virus was identified.
They are freely open to users and easy to install.The local version of MetaCompass (version 2.0.0, available at https://github.com/marbl/MetaCompass) was used by specifying the reference genome and setting the read length of 100 and other parameters as default.The local version of VirGenA (version 1.4, available at https://github.com/gFedonin/VirGenA/releases) was used with the insertion length of 800 and other parameters as default.The local version of Haplof low (version 1.0, available at https://github.com/hzi-bifo/Haplof low) was used with default parameters.The local version of Trinity (version 2.1.1,available at https:// github.com/trinityrnaseq/trinityrnaseq)was used with default parameters.

Table 1 :
The level of installation, setup and usage, as well as time and memory consumption of VIGA, MetaCompass, Haplof low, Trinity and VirGenAThe time and memory consumption of assemblers were tested using an RNA-Seq dataset of PRRSV infection (accession in NCBI SRA database: SRR2123928, accession of virus genome sequence in NCBI GenBank database: AF325691.1).The test was conducted on a server with the following configuration: OS, Ubuntu 16.04.7 LTS; CPU, Intel(R) Xeon(R) Silver 4114, 2.20GHz, 20 cores and 40 threads; RAM, 128GB).Memory consumption was recorded using the 'memory_profiler' module in Python.aOnly the Assembly module was run for a fair comparison.bNoresults returned when the task was completed as VirGenA cannot deal with large datasets.