SegVir: Reconstruction of Complete Segmented RNA Viral Genomes from Metatranscriptomes

Abstract Segmented RNA viruses are a complex group of RNA viruses with multisegment genomes. Reconstructing complete segmented viruses is crucial for advancing our understanding of viral diversity, evolution, and public health impact. Using metatranscriptomic data to identify known and novel segmented viruses has sped up the survey of segmented viruses in various ecosystems. However, the high genetic diversity and the difficulty in binning complete segmented genomes present significant challenges in segmented virus reconstruction. Current virus detection tools are primarily used to identify nonsegmented viral genomes. This study presents SegVir, a novel tool designed to identify segmented RNA viruses and reconstruct their complete genomes from complex metatranscriptomes. SegVir leverages both close and remote homology searches to accurately detect conserved and divergent viral segments. Additionally, we introduce a new method that can evaluate the genome completeness and conservation based on gene content. Our evaluations on simulated datasets demonstrate SegVir’s superior sensitivity and precision compared to existing tools. Moreover, in experiments using real data, we identified some virus segments missing in the NCBI database, underscoring SegVir’s potential to enhance viral metagenome analysis. The source code and supporting data of SegVir are available via https://github.com/HubertTang/SegVir.


Assign species-level taxon to contigs within the viral group
For each viral group, we determine the taxonomy of contigs based on RNA-dependent RNA polymerase (RdRp).In most cases, a viral group contains only one virus.Consequently, we extend the taxon assigned to RdRp to the other contigs within the group.However, in rare instances, a group may contain RdRps from multiple viruses.In such cases, Segvir will compute the sequencing coverage correlation between non-RdRp and RdRp contigs to assign taxon.Our hypothesis posits that contigs originating from the same virus will exhibit similar depth distributions across multiple samples.As a result, we can assign the taxon of RdRp to non-RdRp contigs by assessing the correlation of their coverage distributions.In detail, we first map reads from each sample to the contigs and calculate their average depth within each sample to obtain the coverage distribution for each contig.Subsequently, we employ the Pearson correlation coefficient to measure the correlation of coverage distributions between pairs of contigs, with the formula as follows: where x and y are the coverage distributions of a pair of contigs.x and ȳ are the mean of x and y, respectively.SegVir will calculate the coverages using CoverM [1] and output the distribution correlation between each non-RdRp contig and all RdRp contigs.Users can determine the taxon of the contigs with the alignment results.It is worth noting that the accuracy of this method is correlated with the number of samples, where larger datasets yield more accurate results.

Compare the different methods of building the function cluster dataset
To construct the function clusters (FCs) using protein clusters (PCs), we first computed the similarity between PCs and then employed graph clustering to group them into FCs.To accurately calculate the similarity between PCs, we tested three methods: sequence-based similarity, HMM profile similarity, and protein three-dimensional structure similarity.Suppose we have a pair of PCs, denoted as P C i = {s 1 , s 2 , ..., s n } and P C j = {s ′ 1 , s ′ 2 , ..., s ′ m }, where s ′ and s are proteins.Our goal is to compute the similarity between P C i and P C j , denoted as sim(P C i , P C j ).The formulas corresponding to the different methods are shown in Table S1.cov and ident denote the coverage and identity of alignment regions on the query, respectively.
profile HMM hhalign [2] prob(hmm i , hmm j ) hmm i and hmm j refer to the pHMMs corresponding to P C i and P C j , respectively.prob denotes the probability of homology between these two profiles.

Structural similarity
ESMfold [3] tm-score(ŝ, ŝ′ ) ŝ and ŝ′ represent the three-dimensional structures predicted by ESMFold for s and s ′ , respectively.The TM-score [4] is used to calculate the similarity between these two structures.TM-vec [5] cos(v,v ′ ) |P Ci|×|P Cj | cos refers to the cosine similarity between two vectors.
We initially employ a sequence-based similarity method.We align the proteins in P C i to proteins in P C j using DIAMOND and define sequence similarity as the product of alignment coverage and identity.In addition, if there are multiple distinct alignments between a pair of sequences, we retain only the maximum score as their similarity.We compute the similarities of pairs of proteins and calculate the average similarity across all pairs as the measure of similarity between the PCs.
In addition to utilizing sequence similarity, we also evaluate the similarity between a pair of PCs by estimating the likelihood of homology between their corresponding Hidden Markov Model (HMM) profiles.We employ hhalign with default parameters to align the two HMM profiles, resulting in an estimated probability of homology between them.
Finally, we measure the similarity between PCs by assessing the similarity between the threedimensional structures.Here, we use the Template Modeling Score (TM-score) as the metric, which primarily measures the overall similarity between two folds.The TM-score ranges from 0 to 1, and when the score is greater than 0.5, we consider the two proteins to have similar structures.We employ two methods to calculate the TM-score between two proteins.The first method involves using ESMFold to predict protein structures, followed by the use of tmalign to calculate the TMscore between the predicted structures.ESMFold is a high-precision protein structure prediction tool based on Evolutionary Scale Modeling (ESM).ESM is a protein foundation model trained on a large-scale protein data set using deep learning.The model demonstrates the ability to learn the evolutionary rules of proteins and the relationship between sequence-structure-function.For a single protein sequence, ESMFold can directly predict its atomic-level structure end-to-end.Moreover, tmalign is an algorithm used to compare protein structure similarity.It first identifies the optimal equivalent residues between two proteins based on structural similarity, and then outputs a TM-score value.Here, we use the average TM-score as the similarity between two PCs.
In the above method, predicting protein structures serves as an intermediate step is very timeconsuming.Therefore, we also used a more efficient approach, using TM-Vec to directly calculate the TM-score between two protein sequences.TM-Vec is a deep learning model that integrates protein sequences and 3D structures during training, enabling the direct generation of vector embeddings containing structural information.For two proteins, s and s ′ , we use TM-Vec to calculate their corresponding vector embeddings, v and v ′ .According to the literature, the TM-score between s and s ′ is equivalent to the cosine similarity between v and v ′ .After calculating the TM-score between the proteins of two PC groups, we compute the mean TM-score of all protein pairs as the similarity between the two PC groups.
We use the similarity calculated by the above methods to build the graph and cluster PCs, where node is the PC and edge is the similarity between PCs.To compare the performance, we first evaluated the accuracy of edge construction for different methods, followed by assessing the clustering results.We used 746 PCs derived from all proteins of the Orthomyxoviridae family as evaluation data.Each PCs was annotated based on the protein's description, with protein types including PA, PB1, PB2, neuraminidase, nucleocapsid, and so forth.We retained only the annotated PCs as the ground truth, totaling 739.In the graph, there should be edges connected between PCs that exhibit the same function.We employed the Precision-Recall (PR) curve to assess the performance, as shown in the Fig. S1 (a).Based on the area under the curve, the ranking of edge prediction accuracy is as follows: hhalign, BLASTp, ESMFold, and TM-Vec.At a recall of about 0.55, both hhalign and BLASTp exhibit very high precision, close to 1.However, as the recall increases, the precision of BLASTp drops sharply.Moreover, the accuracy of ESMFold is higher than that of TM-Vec.
After obtaining the similarity between PCs, we employ the Markov Clustering Algorithm (MCL) to cluster the PCs.We utilize purity, Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and sensitivity to evaluate the clustering results.The formulas are as follows: where the number of samples is N and the clustering result contains K clusters.C = {C 1 , C 2 , . . ., C K } is the clustering result, and T = {T 1 , T 2 , . ..} is the actual clusters.n ij denotes the number of samples in C i and T j , a i and b j denotes the number of samples in C i and T j .I(C, T ) is the mutual information, and H(C) and H(T ) are the entropy of the C and T , respectively.T P is the number of true positive samples, F N is the number of false negative samples.Among these metrics, purity is used to measure the accuracy of clustering, and both the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) are utilized to evaluate the consistency between the predicted clusters and the actual clusters.Sensitivity is employed to access the proportion of clustered samples.For all four metrics, a value closer to 1 indicates better performance.Finally, we utilized hhalign, BLASTp, ESMFold, and TM-Vec to obtain 17, 34, 28, and 7 FCs, respectively.The performance is shown in the Fig. S1 (b).Overall, hhalign demonstrated the best performance, followed by BLASTp, ESMFold, and TM-Vec.The purity of BLASTp was slightly higher than that of hhalign.However, the number of FCs generated by BLASTp was twice that of hhalign, resulting in lower ARI and NMI in BLASTp.While ESMFold's performance was still better than TM-Vec's, ESMFold had the lowest sensitivity among the four tools.

Identify the closest reference genomes in completeness estimation
When estimating the completeness of a query genome, we need to find the genome that is most similar to it in the reference genomes.Let s ′ be the FC set of the query genome and S ref = {s 1 , s 2 , ..., s n } be the FC sets of all n reference viruses.To achieve this, we first need to calculate the distance between s ′ and each set s i in S ref .
Here, we employ the weighted Jaccard similarity coefficient (J), as shown in the following function: Here, we calculate the Jaccard similarity coefficient between s ′ and s i , denoted as J(s ′ , s i ).We aim to identify the closest reference containing FCs that match those in the query while maximizing the J.To achieve this, we multiply J(s ′ , s i ) by the ratio of the intersection of s ′ and s i to the size of s ′ (i.e., |s ′ |).This approach ensures that the most similar s i includes as many elements from s ′ as possible, thereby avoiding overestimating completeness.For example, the FC set of the query genome G q is {A, B, C}.In the reference database, there are two complete genomes, G 1 and G 2 , corresponding to FC sets {A, B, D} and {A, B, C, D, E, F, G}, respectively.We can see that G 2 should be used to estimate the completeness of G q , as the latter is a subset of the former.If we rely solely on the Jaccard coefficient, we would incorrectly identify G 1 as the closest genome to G q , resulting in an overestimation of completeness.However, by employing Formula 6, we correctly identify G 2 as the closest genome to G q and accurately estimate its completeness.

Estimate the conservation score for viral genomes
To calculate the conservation score, we first determine the weights of different function clusters (FCs).For RNA viruses, the RdRp plays a crucial role and is commonly used in evolutionary analyses.And viruses carrying distinct RdRps usually share low homology, implying differences in FC sets between viruses containing different RdRp FCs can be very large.Consequently, we group viruses based on the RdRp FCs, ensuring that viruses with the same RdRp FC are grouped together.
Assume the reference complete genomes are G = {g1, g2, g3, ...}, and we obtained the FC sets corresponding to each genome by aligning these genomes to the reference database.Within this group of viruses, we calculated the weight w for each FC by assessing its frequency across all genomes.The formula for calculating the weight is: This implies that the more frequently F C i occurs, the more conserved it is within this group of viruses, resulting in a weight w i closer to 1. Consequently, for a query genome, after aligning it to obtain its corresponding FC sets, we first check which RdRp FCs are included.Subsequently, assuming that proteins in the genome can be aligned to n FCs, where w i represents the weight of F C i corresponding to specific RdRp, the conservation score is calculated by n i=1 w i .The more conserved FCs a genome contains, the larger the cumulative score is, indicating the higher conservation between the genome and the reference database.When a query genome contains multiple RdRp FCs, we calculate corresponding scores for each RdRp FC and select the maximum score as the genome's conservation score (ConS).After obtaining the weights associated with the FCs, we compute the reference ConS for each reference FC set.For a query genome, a smaller difference between the calculated score and the reference score indicates stronger conservation relative to the reference genomes.

Performance of leave-one-out test by family
To evaluate SegVir's ability to detect novel segmented viruses, we designed a leave-one-out test by family.We used complete genomes from 40 families containing segmented RNA viruses from the RefSeq database as our dataset.In each test, we selected one family as the test set and used the remaining 39 families to create SegVir's reference database.The experiment first evaluated the sensitivity of virus detection, where a virus was considered detected if at least one of its RdRp segments was identified.Next, we calculated the sensitivity for detecting non-RdRp segments among the identified viruses.Finally, we determined the proportion of detected viruses that could be identified by other families within the same order.This metric evaluates whether the virus detection is easier if the reference database contains families from the same order.The results are shown in Table S2.
As shown in Table S2, we can see that most viruses can be identified by other families within the same order, demonstrating SegVir's capability to detect new viruses from families not included in the sample.This is primarily due to the high conservation of RdRp, as evidenced by the significantly lower sensitivity for non-RdRp segments compared to RdRp.Although SegVir itself cannot determine whether the detected new virus belongs to a new family, users can construct a phylogenetic tree based on the identified RdRp to make further assessments.Non-RdRp: the detection sensitivity of non-RdRp segments.Order: the proportion of detected viruses that could be identified by other families within the same order, where nan means there's only one family with this order.

Supplementary Figures
Figure S2: The distribution of FC weights for all families and the selected 13 families.
The X-axis is the family, and the Y-axis is the distribution of FC weights.3 Supplementary Tables

Figure S1 :
Figure S1: The evaluation of building edges and constructing clusters using different PC similarity estimation methods.(a) The PR-curve of evaluating the accuracy of edge construction between different PCs.The value within the parentheses represents the area under the curve (AUC).(b) The bar plot of accessing the clustering using different types of edges.Different colors represent different tools for edge building.

Figure S3 :
Figure S3: The completeness estimation of identified genomes by different tools using the FC-based method.

Figure S4 :
Figure S4: The similarity between segmented RNA viruses and mosquito hosts.We aligned segmented RNA viruses to different host genomes to analyze their similarities, where the X-axis represents the alignment length and the Y-axis represents alignment identity.

Figure S5 :
Figure S5: The similarity between segmented RNA phages and bacteria.We aligned segmented RNA viruses to different bacteria genomes to analyze their similarities, where the X-axis represents the alignment length and the Y-axis represents alignment identity.

Figure S6 :
Figure S6: The genomes and coverage distribution of Enontekio phenui-like virus 2. (a) The identified genomes.The arrows indicate the proteins encoded on the segments, which encode RdRp and coat protein, respectively.(b) The coverage distribution of two segments across 10 samples.

Table S1 :
The formulas of estimating the similarity between PCs using different methods.

Table S2 :
The basic information of the dataset and the experimental results.

Table S3 :
The detailed information of the segmented RNA virus database.The database is built using the viral proteins of 40 families containing segmented RNA viruses from the NCBI virus database.The reference FC sets and conservation score are constructed using the complete

Table S4 :
The detailed information of the low-similarity dataset.

Table S6 :
The accession numbers of SRA data used to calculate the coverage distributions in experiments identifying the missing segments in the NCBI datasets.

Table S7 :
The accession numbers of SRA data used to identify the segmented RNA viruses from real metatranscriptomes.