bayroot: Bayesian sampling of HIV-1 integration dates by root-to-tip regression

Abstract The composition of the latent human immunodeficiency virus 1 (HIV-1) reservoir is shaped by when proviruses integrated into host genomes. These integration dates can be estimated by phylogenetic methods like root-to-tip (RTT) regression. However, RTT does not accommodate variation in the number of mutations over time, uncertainty in estimating the molecular clock, or the position of the root in the tree. To address these limitations, we implemented a Bayesian extension of RTT as an R package (bayroot), which enables the user to incorporate prior information about the time of infection and start of antiretroviral therapy. Taking an unrooted maximum likelihood tree as input, we use a Metropolis–Hastings algorithm to sample from the joint posterior distribution of three parameters (the rate of sequence evolution, i.e., molecular clock; the location of the root; and the time associated with the root). Next, we apply rejection sampling to this posterior sample of model parameters to simulate integration dates for HIV proviral sequences. To validate this method, we use the R package treeswithintrees (twt) to simulate time-scaled trees relating samples of actively and latently infected T cells from a single host. We find that bayroot yields significantly more accurate estimates of integration dates than conventional RTT under a range of model settings.


Introduction
Root-to-tip (RTT) regression is a simple method to locate the earliest point in time in a phylogenetic tree (i.e., rooting the tree; Huelsenbeck, Bollback, and Levine 2002), to measure the rate of evolution (Drummond et al. 2003), or to reconstruct the divergence times of common ancestors. This method assumes the existence of a strict molecular clock, i.e., that the rate of molecular evolution is roughly constant (Bromham and Penny 2003). Accordingly, the number of nucleotide substitutions accumulating in a sequence should increase linearly over time. Hence, this method is a linear regression of the evolutionary divergence of sequences from their common ancestor against the times when those sequences were observed. The primary input of RTT regression is an unrooted phylogenetic tree with branch lengths measured in units of evolutionary time (i.e., the expected number of substitutions per site; Tajima and Nei 1984), which is the standard output of maximum likelihood methods for reconstructing phylogenies. The tips of the tree representing observed sequences are labelled with sampling times. Thus, RTT becomes an optimization over three parameters: the location of the root in the tree, the time associated with the root (x-intercept), and the molecular clock (slope of regression).
RTT has a broad range of applications. Since many viruses have a very rapid rate of evolution, RTT can be applied to sequences collected over a number of months or years. For instance, RTT has recently been used to estimate the origin date and clock rate of severe acute respiratory syndrome coronavirus 2 within the first few months of the pandemic (Duchene et al. 2020). We are particularly interested in the use of RTT to estimate the integration dates of HIV-1 proviruses within hosts (Jones et al. 2018). HIV-1 converts its RNA genome into double-stranded DNA that becomes integrated into the host genome as part of the virus replication cycle. In some cases, this integrated provirus becomes reversibly dormant in a transcriptionally inactive host cell (Siliciano and Siliciano 2004). This long-lived reservoir of latently infected cells is the primary obstacle to an effective cure for HIV-1. Consequently, characterizing the composition and dynamics of the latent reservoir has significant implications for HIV-1 cure research (e.g., Gondim et al. 2021).
For instance, we can estimate the molecular clock (the slope of the regression) from longitudinal samples of plasma HIV-1 RNA sequences before the start of antiretroviral therapy (ART). If we reconstruct a tree relating both these RNA sequences and proviral sequences from the latent reservoir, we can then use our clock estimate to extrapolate integration dates for the latter (Jones et al. 2018). This relies on the assumption that the integrated HIV-1 genome ceases to accumulate mutations upon integrating into the host genome. Since we are reconstructing a tree relating individual HIV-1 sequences from a single host, the resulting tree is technically a 'genealogy' instead of a phylogeny, and we are counting mutations in an individual lineage instead of the accumulation of substitutions in a population. Consequently, we will avoid using the terms 'phylogeny' and 'substitution' from this point onward.
Due to its simplicity, RTT has a number of significant limitations. It implicitly assumes that the input tree is known without error. In practice, each proviral sequence is mapped to the regression line for a given tree to yield one and only one estimate of its integration date. Although one could generate interval estimates for integration dates, this is not trivial because we need to consider the joint confidence for the regression slope and intercept and to invert the model to predict dates. Furthermore, variation in the number of mutations after a given amount of time is expected, even under a strict molecular clock (Langley and Fitch 1974). In other words, a proviral sequence may by chance carry more mutations than expected given its actual date of integration. This random outcome can cause RTT to project a sequence's integration date estimate into the future, past its time of sampling or even past the start of ART, when the infection of new cells should be nearly completely suppressed.
Here we describe a Bayesian extension of the RTT method to estimate HIV-1 integration dates. Adopting a Bayesian approach provides a means of quantifying our uncertainty in estimating integration dates, as well as incorporating prior information about the time of infection and the start of ART. We detail our implementation of this method as an R package called bayroot and use a simulation model of within-host population dynamics to validate bayroot in comparison to conventional RTT methods.

Regression model
We start with an unrooted tree T relating n observed sequences. A strict molecular clock assumes that mutations accumulate at a constant rate over time, such that the number of mutations per unit time follows a Poisson distribution. Let Y i be the number of mutations in the i th observed sequence, which is determined by the location of the root in T. Since Y i is an integer-valued outcome, we must rescale the input tree T by multiplying its branch lengths by the sequence length, such that lengths are in units of the expected number of mutations per genome. Note that because these measures of evolutionary time are derived from a continuous-time Markov model of sequence evolution, multiple hits and reversions are accounted for. Let t 0 be the origin time associated with the root. Let Δ be the time that has elapsed between the i th sample and the root. The log-likelihood for a set of RNA sequences { , Δ } is: where Γ( ) is the gamma function. Equation (1) is sometimes referred to as the Langley-Fitch model (Langley and Fitch, 1974).
Following prior work (Huelsenbeck, Bollback, and Levine 2002;Didelot et al. 2018), we assume a uniform prior distribution for possible locations of the root over the entire length of the tree. We also assume a uniform prior distribution for t 0 , as standard practice in applications of Bayesian inference to HIV-1 infections (e.g., Sweeting et al. 2010;Stirrup and Dunn 2018). If a seroconversion window, i.e., the time interval between the last HIV seronegative visit and the first seropositive visit, is available for the host individual, these visit dates can be used to set lower and upper bounds for the uniform prior on t 0 . If this information is not available, then these bounds may be based on other data such as viral load and CD4 cell count measurements (Pantazis et al. 2019). Otherwise, one may set the upper bound to the first sample collection date, and the lower bound may be derived from regional estimates of the time to HIV-1 diagnosis (e.g., Van Sighem et al. 2015). Finally, we assume a lognormal prior distribution on the clock rate , which can be informed by previous measurements of HIV-1 mutation rates within hosts e.g., Alizon and Fraser (2013).
With these prior distributions and the model likelihood, we implemented a Metropolis-Hastings sampling algorithm in R. A proposal function shifts the root along a branch by some distance ∼ Unif(0, ), selecting a branch at random if it encounters an internal node (i.e., split) as it traverses the length of the tree. If, however, a terminal node is encountered before the root has been shifted by distance d, then the remaining distance is travelled by reflecting back from this terminus. This results in a symmetric proposal distribution. We also used a uniform proposal ′ ∼ Unif( − , + ) for the clock rate and a truncated normal proposal ′ 0 ∼ ( 0 , ) for the origin time. The sampling algorithm returns an S3 object storing a data frame of sampled parameter values and a character vector of sampled trees serialized into Newick strings.

Sampling integration dates
Given a posterior sample of parameters Y, , and t 0 , we need to propagate this information to the distribution of integration times associated with proviral DNA sequences sampled post-ART initiation. Using Bayes rule, the probability of integration time t j for the th DNA sequence given divergence Y j is: where we index by j instead of i to emphasize a shift from HIV-1 RNA to DNA sequences. We assume a uniform prior for integration times, ( ) = ( − 0 ) −1 for 0 ≤ ≤ and ( ) = 0 otherwise, where t 0 is the origin date and T is the time of ART initiation. Substituting Equation (1) (1) and (3) into (2), we can write: where we use a shorthand = ( − 0 ) to simplify the equation. To generate a sample of integration dates from this distribution, we use a simple rejection sampling method. For a given posterior sample of Y j , , and t 0 , we use Brent's algorithm to find the maximum of Equation (4), initialized at the midpoint = 0 + ( − 0 )/2. This maximum was used as an upper bound for rejection sampling for times drawn from the prior distribution, ∼ Unif( 0 , ).
The Bayesian regression and integration date sampling methods described above were implemented in R as a package called bayroot. All source code is publicly available under the MIT license at https://github.com/PoonLab/bayroot. Figure 1. A schematic diagram of the compartmental model used to simulate cell population dynamics. Each box represents a well-mixed population of cells sharing the same rate parameters. We assume that only actively infected cells release virus particles that go on to infect other susceptible cells.

Simulating data
To validate the above method, we used the R package twt ('treeswithintrees', https://github.com/PoonLab/twt) to simulate cell population dynamics forward in time and then to simulate trees by sampling lineages backwards in time to their common ancestors. This package uses exact stochastic simulation of discrete events (Gillespie, 1977). In brief, it calculates the total rate of all events (Λ), draws an exponentially distributed waiting time to the first event ∼ Λ exp(−Λ), and then draws a uniform random number to determine which event occurs. We implemented a compartmental model of cell population dynamics (Fig. 1) that can be represented by the following set of differential equations: This model is a simplified version of the system described by Rong and Perelson (2009). Most notably, our version does not model changes in the viral load. T represents a finite population of naive CD4+ T cells from which the populations of active (A) and resting (latent, L) cells are replenished at rates and (1 − ) , respectively, for 0 ≤ ≤ 1. The S and I subscripts denote susceptible and infected subpopulations of active and latent cells. A branching event ( xy ) requires a source cell to induce a target cell to undergo a change of state (switch compartments from x to y). For example, AA represents the infection rate of a susceptible active T cell by a virus released from an actively infected cell. We assume that virus replication is completely blocked by the initiation of ART at time * (Kearney et al. 2014;Brodin et al. 2016), such that • ( ≥ * ) = 0. A transition event occurs when a cell spontaneously migrates from compartments x to y at rate m xy . For example, m LA represents the reactivation rate of a latent cell. Finally, we assume constant cell death rates x for each compartment x.
The simulation is initialized at time zero with user-specified population sizes of susceptible cells in each compartment, and a single actively infected cell, (0) = 1. We simulated the integervalued population size trajectories { , , , , }( ) forward in time until a stopping time of t = 20 simulation time units. We generated 50 replicate sets of trajectories under two different scenarios by exact stochastic simulation. The rate parameters were set to the following values: r = 0.02, k = 0.5, ( < * ) = 0.002, ( < * ) = 10 −4 , = = 0.001, = 0.005, = 0.1, and = 0.001. ART was initiated at * = 10 time units post-infection in Scenario 1 and at * = 15 in Scenario 2. For each iteration of the simulation, we calculated the rates for every type of event, adjusted by the respective compartment size at the current time t. For example, the rate of transmissions from A I to A S was set to ( ) ( ) ( ). We drew an exponential waiting time given the total rate of all event types: and then determined which event type occurred with probability ( ) ( ) ( )/Λ( ) or ( ) ( )/Λ( ). Next, we incremented or decremented the respective population sizes for compartments affected by the event type. The time, type, and compartments of this event is recorded in a log that is later used to simulate trees. An example set of population size trajectories simulated using this algorithm under Scenario 1 is illustrated in Fig. 2.
To generate a tree relating the sampled lineages in twt, we applied another exact stochastic simulation algorithm in reverse time. For the 50 replicate sets of trajectories generated under Scenario 1, we sampled 10 HIV-1 RNA lineages at times t = 3, 6, and 9 post-infection. For trajectories generated under Scenario 2, we sampled 10 HIV-1 RNA lineages at t = 11, 13, and 15 post-infection. In both scenarios, we sampled 10 latently infected cells at t = 20 post-infection, for a total of 40 sampled lineages per replicate tree. These lineage sampling times defined the initial conditions for the reverse-time simulation of trees. Next, the algorithm samples events from the log generated in the forward-time simulation to build up a tree relating the sampled lineages. The stopping condition of the tree sampling algorithm is that the sampled lineages converge to a single common ancestor, which becomes the root.
We modified twt to output a Newick serialization of this 'transmission tree' among cells, labelling tips with sampling times. This tree included internal nodes with only one descendant branch, representing lineage state transitions, or transmissions to/from an unsampled lineage. Internal nodes were labelled with strings encoding the event type, node states (compartments), and unique identifiers for the individual cells involved. These annotations enabled us to 'colour' the branches of the tree by lineage state. The true integration dates for sampled latently infected cells were recorded to a separate file. An example of a tree generated by this process is shown in Fig. 2.
To simulate molecular evolution, we collapsed all branches corresponding to latently infected cells and used the resulting tree as input for INDELible (version 1.03; Fletcher and Yang 2009). We assigned an HIV-1 env sequence at the root (Gen-Bank accession number AY772699). This sequence is one of the HIV-1 subtype C references curated by the Los Alamos National Laboratory HIV Sequence Database (http://www.hiv.lanl. gov). We configured INDELible to use the Tamura-Nei (TrN) model of nucleotide evolution with transition rates 1 = 4 and 2 = 8 and stationary base frequencies = 0.4 and = = =  (Latentcomp_N) are collapsed prior to simulating virus evolution. This tree visualization was generated from the same S3 object using the R package ggfree (https://github.com/ ArtPoon/ggfree).

0.2.
In addition, we rescaled the tree such that the expected number of mutations per nucleotide site over its entire length was 1. Finally, we used FastTree (version 2.1.11, compiled for double precision; Price, Dehal and Arkin 2010) to reconstruct unrooted maximum likelihood trees from these simulated alignments.

Model validation
We ran our Bayesian sampling method on each of the 100 simulated trees for 2×10 4 steps, discarding a burn-in of 2,000 steps and thinning the remaining chain down to 1,000 steps. We set the lognormal prior distribution on clock rates to = −5 and = 2, and the uniform prior distribution on root dates to a minimum of one simulation time unit before the true origin and a maximum of the first HIV RNA sampling time. In addition, we set the proposal parameters to = 0.01 for the root location, = 0.33 for the time of infection, and = 0.01 for the clock rate. In preliminary runs, we found that these settings were sufficient for replicate chain samples to converge to the same posterior distribution. To sample integration dates for each DNA sequence, we further thinned the chain down to a total of 200 samples from the posterior distribution to reduce auto-correlation.
To compare our results against conventional RTT regression, we censored the sampling times associated with tips that represented DNA sequences and then rooted the tree using the rtt function in the R package ape (implementation by R. M. McCloskey; Paradis and Schliep 2019). We extracted the RTT distances from the resulting tree and fit a simple linear regression of these distances against sampling times. Finally, we used the inverse.predict function from R package chemCal (Massart et al. 1997) to extract the predicted integration dates for the 200 samples from the posterior distribution.
To quantify the discordance between estimated () and actual (t) integration dates, we calculated the root mean square error, RMSE = √∑ =1 (̂− ) 2 / , where n is the number of DNA sequences. We also calculated the mean absolute percentage error, MAPE = 100%× ∑ =1 (|̂− |/ ) / , as an alternative measure of estimation error that is less sensitive to extreme values. We used a paired Wilcoxon rank-sum test to evaluate the significance of differences between the RMSE (or MAPE) values obtained from bayroot and conventional RTT.

Results
To compare conventional RTT regression to our Bayesian approach (bayroot), we simulated the proliferation of HIV-1 among active and latent CD4+ T cells with an exact stochastic method. Our simulation workflow yielded a total of 100 trees reconstructed from HIV-1 RNA and integrated proviral DNA sequences. We assumed that HIV-1 RNA was sampled before the start of ART and that HIV-1 proviral DNA was sampled from the latent reservoir post-ART initiation (Fig. 2). Fifty of the trees were simulated such that HIV-1 RNA was sampled at three time points starting at 3 simulation time units post-infection, at intervals of 3 time units (Scenario 1). For the remaining 50 trees, HIV-1 RNA sampling was delayed to 11 time units post-infection and taken at narrower intervals of 2 time units (Scenario 2). Figure 3 compares the estimates of HIV-1 DNA integration dates produced by RTT and bayroot. Under Scenario 1, both methods tended to produce similar estimates because the sampling conditions were favourable for fitting the molecular clock (Fig. 3A). The median RMSE was 0.947 for RTT and 0.889 time units for bayroot. On a case-by-case basis, bayroot produced significantly more accurate estimates than RTT (paired Wilcoxon test, = 3.55×10 −4 , Fig. 3B). The overall difference between estimates was numerically small. For instance, the median difference in RMSE between RTT and bayroot was 0.059 (interquartile range, IQR = 0.004 − 0.201) time units. In some cases, integration dates were mapped by RTT to the time period after ART initiation, leading to higher RMSE values (Fig. 3B). Since bayroot incorporates the prior information that HIV-1 integration should not occur during effective ART, its estimates are constrained to times preceding ART initiation. We found no significant difference between methods (paired Wilcoxon test, P = 0.66; Fig. S1A) when error was measured by MAPE, which is less influenced by the largest errors than the RMSE. Furthermore, 89.8 per cent of the actual integration dates fell within the 95 per cent credible intervals generated by bayroot.
For Scenario 2, both methods became less accurate with median RMSEs of 2.79 and 2.10 time units for RTT and bayroot, respectively (Fig. 3D). Because the sampling times of the RNA sequences used to calibrate the molecular clock were closer together and more distant from the actual time of infection in this scenario (Fig. 3C), we are less certain about all three parameters of the regression, i.e., the location of the root in the tree, the time associated with the root (x-intercept), and the clock rate (slope). Under these conditions, bayroot benefits from having prior information about the time of infection. For our simulations where t = 0 is the actual time, we constrained the time of infection variable to the interval from −1 to 3 simulation time units. (In practice, one could use a uniform prior bounded by the last seronegative and first seropositive dates for that individual.) In other words, prior information about the time of infection 'anchors' the RTT regression when there are insufficient data to accurately estimate the x-intercept (Fig. 3C). As a result, bayroot was significantly more accurate than RTT under this second scenario (paired Wilcoxon test, = 3.82×10 −7 , Fig. 3D). The median difference in RMSE between RTT and bayroot was 0.405  time units-about seven times greater than scenario 1. In addition, this difference between methods remained significant when error was measured as MAPE (paired Wilcoxon test, 3.4×10 −7 ; Fig. S1B). Decomposition of the mean squared error into bias and variance components indicated that the difference in RMSE was driven more by a reduction in bias in either scenario (Fig. S2). Finally, 89.4 per cent of actual integration dates fell within the 95 per cent credible intervals from bayroot. There was no significant association in this outcome between scenarios (Fisher's exact test, odds ratio = 0.5, P = 0.34).
Running a chain sample for 2×10 4 steps in bayroot required a median of 47.3 (IQR 45.0 − 48.8) seconds in R version 4.2.0 for Linux on a single core of an AMD Ryzen ThreadRipper 1950X processor. If the user is not processing a large number of samples, as we have done here for replicate simulations, we suggest running chain samples for at least 10 6 steps with a thinning interval of 500 steps.

Discussion
The reconstruction of HIV-1 integration dates is a challenging problem. Cells carrying replication-competent provirus in the latent reservoir comprise a small fraction of resting CD4+ T cells (approximately 0.01-10 per million cells; Prodger et al. 2020;Crooks et al. 2015). Sequences of plasma HIV-1 RNA or integrated DNA often cover only a portion of the virus genome (Laskey et al. 2016), making it difficult to resolve their evolutionary relationships. In addition, the development of phylogenetic and statistical methods for analysing these sequence data (Ferreira et al. 2021) has lagged behind ongoing improvements in molecular techniques (Cho et al. 2022;Sun et al. 2022). Here we have described a Bayesian extension of a widely used regression method for estimating HIV-1 integration dates from sequence variation in the latent reservoir (Jones et al. 2018;Brodin et al. 2016;Brooks et al. 2020). Our method provides a means of incorporating additional data about the infection-e.g., the estimated date of infection, time of ART initiation, and previous measures of the rate of HIV-1 evolution within hosts-as prior information. Furthermore, adopting a Bayesian approach enables us to quantify our uncertainty about parameter estimates by sampling from the posterior distribution. We expect this will be important for studies where there is limited access to longitudinal plasma samples for retrospective sequencing, for instance.
Of course, our method also retains some significant limitations of conventional approaches to RTT regression. First, we are assuming that the unrooted tree relating HIV-1 RNA and DNA sequences is known without error. It is possible to relax this assumption by adopting a hierarchical approach and replicating our regression analysis on a posterior sample of unrooted trees that may be generated by a Bayesian phylogenetic program such as MrBayes (Ronquist and Huelsenbeck 2003) or BEAST (Drummond and Rambaut 2007). This is less efficient than sampling from the joint posterior distribution of unrooted trees, mutation model, and the RTT regression parameters. Additionally, we are assuming that the divergence of each sequence is an independent outcome. This convenient approximation is clearly untrue because of identity by descent: sequences that share a more recent common ancestor will have a similar RTT distance because they have inherited the same set of mutations. It is possible to overcome this limitation by adapting the covariance matrix of the regression model to the phylogenetic structure of the data (Neher 2018).
Not all studies use RTT regression to estimate HIV-1 integration dates. For example, one of the methods described by Abrahams et al. (2019) uses approximate maximum likelihood to reconstruct a host-specific tree relating HIV-1 RNA and DNA sequences and then locates the closest tip representing an RNA sequence for every tip representing a DNA sequence, which is assigned the sampling time of the RNA tip. Hence, the DNA sequences can only be associated with a finite number of integration dates. This approach benefits from extensive sampling of HIV-1 plasma RNA over the time period spanning the start of infection to ART initiation. If the ancestral HIV-1 RNA sequence most closely related to an HIV-1 provirus is not represented in the tree, then the latter would be mapped to another branch that may be associated with a sampling time that does not accurately estimate the integration date. In contrast, RTT methods directly use the number of mutations carried by an individual DNA sequence to estimate its integration date. The other sequences are used to calibrate the linear model mapping this divergence to the timeline.

Data availability
The R package bayroot is publicly available under the MIT license at https://github.com/PoonLab/bayroot. We have also provided the simulated data and R scripts used to perform the method validation and generate figures in this repository. The R package twt is publicly available under the GNU Affero General Public License v3.0 (AGPL-3.0) at https://github.com/PoonLab/twt.

Supplementary data
Supplementary data are available at Virus Evolution online.