Inference of genetic relatedness between viral quasispecies from sequencing data

Background RNA viruses such as HCV and HIV mutate at extremely high rates, and as a result, they exist in infected hosts as populations of genetically related variants. Recent advances in sequencing technologies make possible to identify such populations at great depth. In particular, these technologies provide new opportunities for inference of relatedness between viral samples, identification of transmission clusters and sources of infection, which are crucial tasks for viral outbreaks investigations. Results We present (i) an evolutionary simulation algorithm Viral Outbreak InferenCE (VOICE) inferring genetic relatedness, (ii) an algorithm MinDistB detecting possible transmission using minimal distances between intra-host viral populations and sizes of their relative borders, and (iii) a non-parametric recursive clustering algorithm Relatedness Depth (ReD) analyzing clusters’ structure to infer possible transmissions and their directions. All proposed algorithms were validated using real sequencing data from HCV outbreaks. Conclusions All algorithms are applicable to the analysis of outbreaks of highly heterogeneous RNA viruses. Our experimental validation shows that they can successfully identify genetic relatedness between viral populations, as well as infer transmission clusters and outbreak sources.


Background
Inferring transmission clusters, transmission directions, and sources of outbreaks from viral sequencing data are crucial for viral outbreaks investigation. Outbreaks of RNA viruses, such as Human Immunodeficiency Virus (HIV) and Hepatitis C virus (HCV), are particularly dangerous and pose a significant problem for public health. It is well known that genomes of RNA viruses mutate at extremely high rates [1]. As a result, RNA viruses exist in infected hosts as populations of closely related variants called quasispecies [2,3]. However, only recently with the progress of sequencing technologies, it became possible *Correspondence: glebova@cs.gsu.edu † Equal contributors 1 Computer Science Department, Georgia State University, 25 Park Place NE, 30303 Atlanta, GA, USA Full list of author information is available at the end of the article to identify and sample quasispecies at great depth [4][5][6][7][8][9]. Consequently, a contribution of sequencing technologies to molecular surveillance of viral disease epidemic spread becomes more and more substantial [10,11]. Computational methods can be used to infer transmission characteristics from sequencing data. The first question usually is whether two viral populations belong to the same outbreak. The methods typically utilize the simple observation that all samples from the same outbreak are genetically related, so they use some measure of genetic relatedness as a predictor for epidemiological relatedness [10][11][12].
The second question is which samples constitute isolated outbreaks. For this purposes, we define a transmission cluster as a connected set of genetically related viral populations. The third questions we address in this article is "Who is the source of infection?". This questions is the most difficult to answer, and there were only a few attempts to do it computationally using solely genomic data [13] without invoking additional epidemiological information [14]. To the best of our knowledge, there is still no freely available computational tool for this problem.
Computational methods for detection of viral transmissions and inference of transmission clusters are often consensus-based, i.e. they analyze only a single representative sequence per intra-host population (for example, consensus sequence). Such methods assign two hosts into one transmission cluster, if the distances between corresponding sequences do not exceed a predefined threshold [10,11]. Although consensus-based methods proved to be useful, they do not take into account intra-host viral diversity. Inclusion of whole intra-host populations into analysis is important, because minor viral variants are frequently responsible for transmission of RNA viruses [15,16].
Recently published computational approach (further referred to as MinDist) [12] uses the minimal genetic distance between sequences of two viral populations as a measure of genetic relatedness of intra-host viral populations. Since minimal genetic distances between different pairs of populations can be achieved on various pairs of sequences, this approach takes into account intra-host diversity.
However, both consensus-based and MinDist approaches have further limitations. First of all, they do not allow to detect directions of transmissions, which is crucial for detection of outbreak sources and transmission histories. Secondly, distance thresholds utilized by both approaches could be derived from analysis of limited or incomplete experimental data and highly data-and situation-specific, with different viruses or even different genomic regions of the same virus requiring specifically established thresholds.
In this paper, we address the above limitations by proposing two novel algorithms ReD and VOICE, as well as by suggesting an improvement of the MinDist algorithm. The new algorithms allow to infer important epidemiological characteristics, including genetic relatedness, directions of transmissions and transmission clusters.
• Relatedness D epth (ReD ) method uses clustering-based analysis of intra-host viral populations. It is a non-parametric algorithm, so it does not rely on any virus-specific threshold values to predict epidemiological characteristics. • V iral O utbreak I nferenCE (VOICE ) is a simulation-based method which imitates viral evolution as a Markov process in the space of observed viral haplotypes • MinDistB method is a modification of MinDist [12], which takes into account the sizes of relative borders of each pair of viral populations.
The proposed algorithms were validated on the experimental data obtained from HCV outbreaks. Comparative results suggest that our methods are efficient in epidemiological characteristics inference.

Relatedness depth (ReD) algorithm
ReD is a deterministic algorithm based on deterministic hierarchical clustering. The key concept of this method is a k-clustered intersection of viral populations (we used similar idea previously for combinatorial pooling [17]). For two sets of viral sequences P 1 and P 2 , their k-clustered intersection P 1 ∩P 2 is calculated as follows: 1) Partition the union P 1 ∪ P 2 into k clusters C 1 , ..., C k ; 2) P 1 ∩P 2 = i∈B C i , where B = {i ∈ {1, ..., k} : C i ∩ P 1 = ∅, C i ∩ P 2 = ∅}, i.e. P 1 ∩P 2 is the union of clusters, which contain sequences from both P 1 and P 2 (see Fig. 1

);
The parameter k is a scale of clustering. In particular, populations P 1 and P 2 are separable, if P 1 ∩P 2 = ∅, while the fact that P 1 ∩P 2 = ∅ indicates that they may Dashed cluster is the k-clustered intersection. Direction of transmission is from the blue population to the red population be genetically related. In the most extreme case P 1 ∩P 2 = P 1 ∪ P 2 , i.e. populations are completely inseparable under the scale k.
The degree of confidence that the samples are genetically close is represented by the relatedness depth d(P 1 , P 2 ), which is calculated by Algorithm 1. Simply speaking, Algorithm 1 tries to recursively separate populations P 1 and P 2 . At each iteration, k-clustered intersection is calculated. If two populations are separable, then the algorithm stops. Otherwise, it continues the separation of sequences from P 1 and P 2 within their k-clustered intersection. The separation depth is a depth of this recursion. It is possible that at some iterations of Algorithm 1 two populations are completely inseparable under a current clustering scale. In this case, the scale k is increased and k-clustered intersection is recalculated. The initial value of k used by Algorithm 1 is k = 2.

Algorithm 1 ReD (relatedness depth calculation)
Input Two sets of viral sequences P 1 , P 2 .
if I = P 1 ∪ P 2 then 7: I ← P 1 ∩P 2 13: end while k-clustered intersections depend on a clustering method. Our implementation uses a hierarchical clustering based on neighbor-joining tree (as implemented in Matlab (MathWorks, Natick, MA)). The algorithm utilizes a standard Jukes-Cantor distance which is based on the simplest substitution-based evolutionary model.
Clustered intersections also allow for estimating the direction of transmissions. It is reasonable to assume that if two hosts share a population, then a host with more heterogeneous population is more likely to be the transmission source [18]. Formally, if I = P 1 ∩P 2 , P 1 ⊆ I and P 2 \I = ∅, then we assume that probable transmission direction is from P 2 to P 1 (see Fig. 1). The direction is defined according to the first occurrence of such situation during execution of Algorithm 1. Note that in some cases direction may not be identified.
Given the collection of viral populations P = {P 1 , ..., P n }, ReD produces the weighted directed genetic relatedness graph G = (V , A, d) with V = P. An arc (P i , P j ) is in A whenever populations P i and P j are genetically related, i.e., have sufficiently high relatedness depth; the direction of an arc corresponds to the estimated direction of transmission and its weight to the relatedness depth. Transmission clusters are calculated as weakly connected components of the digraph G. To determine transmission clusters, the simplest depth cutoff T = 1 can be used. In addition, only components containing at least one arc a of weight d(a) ≥ 2 were considered as reliable. For each reliable component, a source s of the corresponding outbreak is identified as a vertex with highest eigenvector centrality.

Viral outbreak inference (VOICE) simulation method
VOICE is another approach to predict epidemiological characteristics. Unlike ReD, it is not deterministic. Instead, it simulates the process of evolution from one viral population (source) into another (recipient) as a Markov process on a union of both populations. VOICE starts evolution from a subset of source sequences called the border set and estimates the number of generations required to acquire a genetic heterogeneity observed in the recipient.
Formally, given two sets of viral sequences P 1 and P 2 , VOICE simulates viral evolution to estimate times t 12 and t 21 needed to cover all sequences from the recipient population under the assumptions that first and second host were sources of infection. Based on the value min{t 12 , t 21 }, the algorithm decides whether the populations are related. The direction of possible transmission between the related pair is assumed to follow the direction which requires less time.
The simulation starts from the δ-border set B 1 , which contains viral variants that are likely the closest to variants transmitted between P 1 and P 2 . It is defined as the set of vertices of P 1 minimizing pairwise Hamming distance D between vertices from P 1 and P 2 up to a constant δ: Fig. 2). The constant δ is a parameter, with the default value 1.
The simulated evolutionary process is carried out in the evolutionary space represented by the variant graph G(B 1 , P 2 ), which is constructed as follows. First, construct a union of all minimal spanning trees of the complete graph on a vertex set B 1 ∪ P 2 with the edge weights equal to Hamming distances between variants (sometimes referred to as a pathfinder network PFNet(n − 1, ∞) [19,20]). Then substitute every edge in graph with two directed edges of the same weight. Next, subdivide each The simulation starts from all border vertices B 1 and runs until all the vertices of the population P 2 are reached. At the beginning of the simulation, border vertices get count equal to 1, and the rest of the vertices get count 0. Each tact simulates variants replication by updating vertex counts according to one of the three following scenarios happening with the specified probabilities (see Fig. 4). First, if during replication there are no mutations, then the vertex v replicates itself and its count label is incremented. This happens with the probability p 1 (1). Second, the vertex can mutate into one of its neighboring vertices with probability p 2 (see Eq. (2)), in which case the count of the neighbor is incremented. Finally, with probability p 3 , vertex does not produce any viable offspring, in which case vertex counts are not changed. If the count of a vertex reaches the maximum allowed variant population size C max , then it is not increased. The probabilities of these scenarios are calculated as follows: where is the mutation rate, L is the genome length and deg − (v) is an outdegree of a vertex v. Algorithm 2 represents the flow of the method. The time t 12 is computed as the average over s simulations. The same procedure is repeated for the opposite direction of the transmission with its border set B 2 and the time t 21 is computed. The value min{t 12 , t 21 } determines which direction of transmission is more likely.

Algorithm 2 VOICE (Viral Outbreak InferenCE)
Input Two sets of viral variants P 1 , P 2 . Output Time t 1,2 to evolve from P 1 to P 2 .

Data normalization
The sizes of observed intra-host viral populations may significantly vary due to sampling and sequencing biases. Since the larger population will require more time to cover, the estimation of t 12 and t 21 could be biased. VOICE avoids such biases by normalizing the intra-host population sizes. The deterministic normalization partitions each viral population into q clusters using hierarchical clustering and each cluster is replaced with the consensus of its members. The subsampling normalization randomly chooses q sequences from each population. The procedure is repeated r times, and the final result is an average over all subsamplings.

Identification of genetic relatedness, transmission directions, clusters and sources of outbreaks
Analogously to ReD, VOICE produces a weighted directed genetic relatedness graph G = (V , A, w) with V = P. An arc P i P j is in A whenever populations P i and P j are genetically related, i.e., value min{t ij , t ji } is less than a threshold. Weakly connected components of G represent transmission clusters or outbreaks. To determine the source of each outbreak, we build a Shortest Paths Tree (SPT) for every vertex in the corresponding component. The source is estimated as the vertex with an SPT of minimal weight.

MinDistB method
The method extends the MinDist approach proposed in [12], which defines the distance between viral populations as the minimum Hamming distance between their representatives. The new approach also takes into account sizes of border sets, on which the minimum distance is achieved.
Formally, given an integer δ (by default δ = 1), the δcrossing between populations P 1 and P 2 is the set of pairs of variants (u, v) from different populations, the Hamming distance D(u, v) between which is within δ from the minimum Hamming distance: Fig. 2). Our empirical study shows that in case when the crossing is large (see Fig. 2a), then the populations are less likely to be related than in case when the borders are small (see Fig. 2b).
This effect can be intuitively explained. Two related populations likely diverge away from the common ancestor and from each other, and their borders are formed by few old survived variants closest to the common ancestor. Two unrelated populations diverging from two different ancestors may in time reduce minimum distance from each other randomly and closest variants are relatively young and abundant (see Fig. 5).
We define a δ-distance between populations P 1 and P 2 as follows: where c = 3 is an empirically chosen constant.

Identification of genetic relatedness, transmission clusters and sources of outbreaks
For MinDistB methods, genetic relatedness graph G = (V , E, w) is a weighted undirected graph with the vertex set V = P and an edge of weight w i,j connecting populations P i , P j whenever w i,j = D δ (P 1 , P 2 ) does not exceed a threshold. Transmission clusters are estimated as connected components of the graph G. For each transmission cluster its source could be inferred either as a vertex with maximum eigenvector centrality or as a vertex with the shortest paths tree of minimal weight.

Results and discussions
ReD, VOICE and MinDistB were validated using experimental outbreak sequencing data, and their predictions were compared with the previously published MinDist method [12].

Data sets
We used the benchmark data presented in [12], which is a collection of HCV intra-host populations sampled from 335 infected individuals.
• Outbreak collection contains 142 HCV samples from 33 epidemiologically curated outbreaks reported to Centers for Disease Control and Prevention in 2008-2013. Outbreaks contain from 2 to 19 samples. Epidemiological histories, including sources of infection, are known for 10 outbreaks. • Collection of 193 epidemiologically unrelated HCV samples.
All viral sequences represent a fragment of E1/E2 genomic region of length 264 bp.

Prediction of epidemiological characteristics
The proposed methods were used to infer the following epidemiological characteristics: • genetic relatedness between populations; • transmission clusters representing outbreaks and isolated samples; • sources of outbreaks; • transmission directions between pairs of samples.
Comparison results are collected in Table 1. The variants of VOICE with deterministic and subsampling normalizations are referred to as VOICE − D and VOICE − S, and for them we used the normalization constants q = 10 and q = 4, respectively. For all VOICE runs, five independent simulations were performed, and the averages over that simulations are reported. For each simulation, VOICE-S performs 50 subsamplings, and the results of the algorithm are averaged over all subsamplings. For MinDist, sources of outbreaks were identified as vertices with highest eigenvector centralities in the corresponding genetic relatedness graphs, since for MinDist this method outperform the shortest path tree-based approach.  For each method, the sensitivity (i.e. the percentage of detected related pairs) was calculated ( Table 1). The highest sensitivity is achieved by MinDistB method. Figure 6 depict ROC curve for the tested methods (ReD is not present, since for this method only few viable discrete thresholds are possible). MinDistB and VOICE − D have highest areas under a curve value followed by MinDist and VOICE − S.

Detection of transmission clusters
The similarities between true and estimated partitions into transmission clusters were measured using an editing metric [21], which is defined as the minimum number of elementary operations required to transform one partition into another. An elementary operation is either merging (joining of two clusters into a single cluster) or division (partition of a cluster into two clusters) [21]. We calculate sensitivity by normalizing an editing distance E by dividing it by the number N of elementary operations required to transform trivial partition (i.e. the partition into singleton sets) into the true partition. The number N is equal to n − k, where n is the total number of samples and k is the number of true clusters: Table 1 shows that MinDistB and MinDist demonstrate the highest sensitivity.

Source identification
The accuracy of the source identification is defined as the percentage of correctly predicted sources for outbreaks, where the correct sources are known. The Source section  Table 1 shows that the best results are achieved by ReD and VOICE − S which were able to detect sources in 90% of cases. At the same time, MinDist and MinDistB, which are not able to identify transmission directions, were significantly less accurate.

Transmission direction
Among tested algorithms, only ReD and VOICE allows for detection of transmission directions. For that algorithms, percentages of correctly predicted pairs source-recipient were calculated (Table 1). Here the highest accuracy of 87.1% was achieved by ReD and VOICE − S.

Running time
All tests were performed on PC with DDR3-1333MHz 4 GBx12 RAM and 2 Intel Xeon-X5550 2.67 GHz processors. The fastest algorithms were MinDist and MinDistB, with running times 9 ms for a pair of samples in our dataset. ReD requires ∼ 0.1s per pair of samples, While the running time of VOICE is ∼ 35 s per pair.

Conclusions
Currently, a molecular viral analysis is one of the major approaches used for investigations of outbreaks and inference of transmission networks. Although modern sequencing technologies significantly facilitated molecular analysis, providing unprecedented access to intra-host viral populations, they generated novel bioinformatics challenges.
This work proposed three novel algorithms for the investigation of viral transmissions based on analysis of the intra-host viral populations, which allow clustering genetically related samples, infer transmission directions and predict sources of outbreaks. Evaluation of the algorithms on experimental data from HCV outbreaks demonstrated their ability to accurately reconstruct various transmission characteristics. It should be noted, that although ReD was proved to be accurate in estimation of transmission clusters, directions and sources, its accuracy of relatedness detection is lower than for other evaluated methods. However, the advantage of this method over other methods is its non-parametricity (i.e. independence from virus-specific and genomic region-specific thresholds), which makes it more universally applicable and extremely useful in situations, when the lack of training data does not allow to establish reliable relatedness thresholds.
The clustering-based ReD approach may be further improved using a more scalable clustering similar to the algorithm proposed in [17]. The simulation-based approach VOICE presented here may be further improved by incorporating more complex viral evolution models taking into account cell proliferation rate and immune responses against viral variants.