Phylogenetic estimation of the viral fitness landscape of HIV-1 set-point viral load

Abstract Set-point viral load (SPVL), a common measure of human immunodeficiency virus (HIV)-1 virulence, is partially determined by viral genotype. Epidemiological evidence suggests that this viral property has been under stabilising selection, with a typical optimum for the virus between 104 and 105 copies of viral RNA per ml. Here we aimed to detect transmission fitness differences between viruses from individuals with different SPVLs directly from phylogenetic trees inferred from whole-genome sequences. We used the local branching index (LBI) as a proxy for transmission fitness. We found that LBI is more sensitive to differences in infectiousness than to differences in the duration of the infectious state. By analysing subtype-B samples from the Bridging the Evolution and Epidemiology of HIV in Europe project, we inferred a significant positive relationship between SPVL and LBI up to approximately 105 copies/ml, with some evidence for a peak around this value of SPVL. This is evidence of selection against low values of SPVL in HIV-1 subtype-B strains, likely related to lower infectiousness, and perhaps a peak in the transmission fitness in the expected range of SPVL. The less prominent signatures of selection against higher SPVL could be explained by an inherent limit of the method or the deployment of antiretroviral therapy.


Introduction
At the end of 2020, an estimated 38 million people worldwide were living with human immunodeficiency virus (HIV), with roughly 73 per cent of these individuals accessing antiretroviral therapy (ART; www.unaids.org). For many, HIV has become a manageable chronic condition, thanks to the treatment becoming increasingly accessible through healthcare policies and infrastructural developments. However, much work remains to end the acquired immune deficiency syndrome (AIDS) epidemic as a global health threat. Ongoing transmission in under-surveyed key populations still feeds into the epidemic (Nduva et al. 2020). Therefore, understanding the risk factors for transmission, including viral genetic ones, is paramount for the control and mitigation of the epidemic.
Untreated HIV infections progress in three stages: acute, chronic, and AIDS stage (Alizon et al. 2010). In the chronic stage, a relatively stable viral count per unit volume of plasma blood is established in the viral host (de Wolf et al. 1997). Viral load gradually increases during the chronic stage (O'Brien et al. 1998;Sabin et al. 2000) and accelerates during the final stage of AIDS. The relatively stable viral load over the chronic stage is termed set-point viral load (SPVL), which varies between individuals from 10 2 to 10 6 copies/ml (Bonhoeffer et al. 2003;Fraser et al. 2014). SPVL has been observed to correlate with disease progression in the absence of treatment, with higher viral loads leading to more rapid progression to AIDS (Fraser et al. 2007(Fraser et al. , 2014. While host genetics (e.g. human leukocyte antigen (HLA) system) and other environmental factors contribute to the variation of SPVL between individuals, viral genetics have been found to account for a third of the heritability of SPVL with data from both European (Blanquart et al. 2017;Bertels et al. 2018;Mitov and Stadler 2018) and sub-Saharan African cohorts (Hollingsworth et al. 2010;Lingappa et al. 2013;Yue et al. 2013).
SPVL is a proxy for the amount of circulating virus within an infected individual and therefore determines the potential infectiousness of the individual. SPVL can also be viewed as a reflection of the burden imposed on an infected individual's immune system, which affects the duration of the chronic stage of the infection in the absence of treatment. During the chronic stage, the combined effect of infectiousness and duration of infectiousness on the average number of transmissions can be summarised in a single-peaked landscape for the transmission potential as a function of SPVL (Fraser et al. 2007). Surveillance data showed an approximately lognormal distribution for SPVL, with the most common values being those expected to have the greatest transmission potential, suggesting a selective process acting to balance infectiousness and duration of chronic stage to maximise transmission potential. By comparing viral loads of infected individuals found to be in transmission clusters against non-cluster members, Wertheim et al. showed that higher viral loads at diagnosis are selected among HIV transmission networks in the USA (Wertheim et al. 2019).
In classical epidemiological terms, the transmission potential of a pathogen is related to its basic reproductive number within a given population. With genetic data and phylogenetic methods, the transmission potential can be calculated in different ways (Faria et al. 2018). Phylogenetic trees are commonly used to reconstruct relationships between sequences from samples of pathogens in calendar time or molecular time. Signals extracted from phylogenies have proven useful in inferring the fitness of the sampled organisms (Neher, Russell, and Shraiman 2014). Specifically, fitness estimated from phylogenies of sequences across time can inform evolutionary trajectories of genotypes present in an asexual pathogen population, as demonstrated in retrospective predictions of circulating influenza A/H3N2 strains (Neher, Russell, and Shraiman 2014). One method developed to achieve this is the local branching index (LBI), which calculates the integrated exponentially discounted tree length surrounding a focal node/tip with a timescale parameter (τ ) denoting the tree neighbourhood within which fitness is 'remembered' (Neher, Russell, and Shraiman 2014). In other words, the LBI will be larger at a focal node/tip when the tree branches more frequently near this node/tip, and a larger rate of branching reflects a higher transmission fitness of the viral genotype represented by the focal node/tip compared to other parts of the tree.
In this study, using LBI, we demonstrated the signal of selection on transmission fitness among HIV-1 subtype-B genotypes from individuals with different SPVLs. Powered by a large sample size from the Bridging the Evolution and Epidemiology of HIV in Europe (BEEHIVE) project and the stringent selection criteria in procuring these samples (Blanquart et al. 2017), we show that transmission fitness varies with SPVL (P-value = 0.009), with our central estimate for the variation being an increase until a peak of approximately 10 5 copies/ml. This relationship holds true for all sliding windows across the HIV-1 genome, except for the last 1,500 bp. The relationship between transmission fitness and SPVL higher than ∼10 5 copies/ml is confounded by methodological and epidemiological factors. Interpreting LBI in terms of a viral birth-death epidemic/phylodynamic model, we show that variation in LBI is more sensitive to variation in lineage birth rate (infectiousness) at shorter timescales, while accounting for variation in lineage death rates (duration of infectious state) at longer timescales.

Data
The data set we analysed consists of N = 1,927 whole-genome subtype-B consensus sequences from the BEEHIVE project, collected across European cohorts. Viral RNA was manually extracted from samples (Cornelissen et al. 2017) and then reverse transcribed and amplified using primers to define four overlapping amplicons spanning the whole genome, which were fragmented and sequenced with Illumina MiSeq or HiSeq platforms (Gall et al. 2012). Whole-genome consensus sequences were reconstructed from the resulting short-read data using Iterative Virus Assembler (Hunt et al. 2015) and shiver (Wymant et al. 2018). Sequence subtypes were determined using the COMET software (Struck et al. 2014) and validated using phylogenetic tree placement methods (Blanquart et al. 2017). SPVLs were single measurements of viral load performed on the same sample used for sequencing, with samples taken between 6 and 24 months after seroconversion. This measure of viral load during the set-point window of infection was previously shown to have a greater heritability than an average of multiple measurements taken at different times (Blanquart et al. 2017).
To avoid the uneven representation of the epidemic caused by inconsistent sampling, we restricted our regression analysis data set to contain only samples from the period 2000-12 (see Supplement Figure 1), during which the sampling distribution was relatively constant. We also excluded samples with viral load measured as less than 1,000 copies/ml, given that there are biologically induced experimental limits to accurate reporting and detection of low viral load samples. This filter not only increased our confidence in the SPVL measurements used in the regression, but also removed the potential bias on LBI estimations caused by lower genomic coverage for low viral load samples, which may potentially lead to noisier measurement of the viral genotype and therefore longer terminal branches.

Phylogenetic inference
We created a multiple-sequence alignment by merging individual pairwise alignments between each sample's whole-genome consensus sequence and the HXB2 reference genome. We also took sliding-window subsets of this whole-genome alignment, with a width of approximately 1,000 bp with ∼500-bp overlap. All alignments were used to build maximum-likelihood trees using IQ-TREE (Nguyen et al. 2015), with the substitution model general time reversible model + F + R4. Time calibration was done using treedater (Volz and Frost 2017), with the year of sampling for each sequence. When the year of sampling was unavailable, we used the year of seroconversion.

Transmission fitness computations
We computed the LBI (Neher, Russell, and Shraiman 2014) for each tip of the time-calibrated trees, using an approach adapted from Equations 17-19 of Neher, Russell, and Shraiman (2014) (available at github.com/BDI-pathogens/hiv_spvl_fitness/). We used the recommended timescale parameter (τ ) (Neher, Russell, and Shraiman 2014), namely 0.0625 multiplied by the mean cophenetic distance of tips in the phylogeny, corresponding to τ = 4.42 years for the tree in this study. An example tree coloured with LBI values is shown in Fig. 1. We also varied τ in sensitivity analyses. To compute a z-score for the LBI while accounting for the uneven availability of samples in different years, we permuted the branching patterns of time-calibrated trees 1,000 times while keeping the coalescent events and the year of samples constant (Dearlove and Frost 2015). We used the LBI values from all permuted trees to calculate the z-score of the raw LBI values for each tip in the non-permuted tree.
We also converted raw LBI estimates for the subset of N = 1,542 samples from men who have sex with men (MSM) into lineage birth and death rates. To infer lineage birth rates β, we assume a fixed death rate of δ 0 = 0.15/year inferred by maximum likelihood from the tree sliced between 2000 and 2012 and then we interpret changes in the LBI as changes in β according to the equation for birth-death models (see 'Interpretation of LBI under a birthdeath model'). For a large tree with β − δ << 1/τ , the relation can be expressed analytically: where I LB denotes the LBI value, β is the lineage birth rate, δ 0 is the lineage death rate, and τ is the LBI timescale parameter. Here however we solve numerically the exact relation (see Supplementary File 1) assuming a finite tree with root in 1995 and tips until 2012. In principle, lineage death rates δ may be inferred similarly by assuming a fixed birth rate of β 0 = 0.14/year also inferred by maximum likelihood and then interpreting changes in LBI as changes in δ, but their inference is noisier and more prone to biases, as well as less relevant for the results in this paper.

Regression of LBI and SPVL
Gaussian process regressions-a non-parametric regression/ smoothing approach based on a Gaussian process prior-were applied to normalised quantile-transformed z-scores of LBI estimations and corresponding quantile-transformed SPVL measurements. The squared exponential covariance function was used: with length parameter (l 2 ) 0.15, variance in data (σ 2 ) 0.5, and variance in noise 0.5. A permutation test was used to assess the significance of the existence of some relationship between LBI and SPVL: the variance of the regression curve calculated from the true data was compared with the right tail of the distribution of variances of the regression curves calculated from 1,000 permutations of the normalised quantile-transformed z-score of LBI values. This defines the P-value for the null hypothesis, namely that LBI is independent of SPVL.

Relationship between transmission potential and SPVL
We applied Gaussian process regression to our proxy of transmission fitness (i.e. normalised quantile-transformed z-score of LBI values) and quantile-transformed SPVL measurements. The regression trend showed a clear increase in transmission fitness with increasing SPVL until approximately 10 5 copies/ml. As SPVL increases beyond 10 5 copies/mL, the results showed some evidence for a peak and a decline in transmission fitness and then remained constant over the largest SPVL values (Fig. 2). The existence of a non-constant relation between LBI and SPVL is significant by permutation (P = 0.009) (Supplementary Figures 2-3). We further investigated if LBI differed by transmission mode and sex. The subsets of samples for heterosexual males (N = 111), heterosexual females (N = 107), and injecting drug users (IDUs; N = 105) did not produce significant trends, probably due to limited statistical power (Fig. 2). N = 1,542 samples were from MSM; these made up the majority of the full data set (N = 1,927), and the relationship between SPVL and LBI for this subset of samples was unsurprisingly similar to that of the full data set (P = 0.001).
To test if the relationship holds true throughout the genome, we divided the genome into 17 overlapping windows roughly 1,000 bases wide. For each window we inferred a phylogeny, calculated transmission fitness at the tips, and estimated the relationship of transmission fitness against SPVL. All windows except the last two (P = 0.12 and P = 0.18) showed agreement with the global trend (P < 0.048) (Fig. 3). The region that did not show significant signatures of the relationship (last 1,500 bp) includes partial sequences of tat, rev, and gp41 genes and the entire nef gene. Viral replicative fitness is a direct contributing factor to transmission fitness of HIV-1. While mutations in gag, pol, and env genes directly affect viral replicative fitness, changes in nef do so less (Claiborne et al. 2015, reviewed in: Biesinger andKimata 2008). Therefore, nef may be under less selective pressure for enhancing transmission fitness.

Interpretation of LBI under a birth-death model
The LBI at a given point in a tree (such as a tip) is calculated as the length of the whole tree, exponentially discounting contributions with increasing distance from that point. The LBI timescale τ corresponds to the size of the relevant surrounding tree neighbourhood (i.e. 1/τ is the rate at which distance is discounted). Under a birth-death model for a well-sampled, slowly varying epidemic (β − δ << 1/τ ), the expected LBI I LB is related to the lineage birth rate β and the lineage death rate δ by the following relation: This result is not valid for a fast-growing epidemic nor if the intensity of sampling changes with time; see a detailed derivation in Supplementary File 1 for these scenarios.
In comparison to the conventional transmission potential or basic reproductive number (R 0 ), we show that LBI is unequally affected by the lineage birth and death rates. Since LBI is related to tree branching processes, it is more sensitive to lineage birth than to lineage death as shown in Fig. 4. Specifically, the difference in LBI between two tips differing in lineage birth rate by ∆β and differing in lineage death rate by ∆δ is approximately: The contribution of the variation in δ is suppressed compared to the contribution of the variation in β, especially for large values of LBI. In other words, while LBI is one measure of fitness, it is more sensitive to the infectiousness and less sensitive to the duration of the infectious state 1/δ. This can be contrasted with other measures of fitness, such as the growth rate r = β − δ or the transmission potential TP = β/δ, which are more balanced in their sensitivity to infectiousness and duration of infectiousness. The larger the parameter τ is, the stronger the correlation between the difference in LBI between two lineages and the difference in their transmission potential, since ∆TP = 1 . In order to test the sensitivity of LBI to the lineage death rate, we varied the timescale parameter τ from 3 to 15 years. An increasing proportion of lineage deaths is taken into account by LBI estimations when τ is larger, thus enhancing sensitivity to the lineage death rate. In fact, we observe the effect of finite lifespan of lineages inside the phylogeny, i.e. the duration of the infectious state, on the curves (Fig. 5). Although these relationships are only significant for τ ≤ 5 years (P = 0.001-0.019) for our data set, a downward bending trend for higher viral loads started to appear as the value of τ increased. This is consistent with the expected shape of the fitness landscape for SPVL and to the  idea that different values of τ are sensitive to different factors relating to transmission fitness and SPVL. In the context of HIV-1 subtype-B phylogenies, LBI estimates are a proxy for infectiousness (i.e. lineage birth rates) at shorter timescales but capture signatures of selection against high SPVL (i.e. higher lineage death rates) at longer timescales.

Discussion
We demonstrated that phylogenetic signals can be used to infer transmission fitness of subtype-B HIV-1 viral genotypes and provided the first direct evidence of a relationship (P = 0.009, permutation test) between transmission fitness and SPVL. Transmission fitness is positively correlated with SPVL, until a peak value between 10 4 and 10 5 copies/ml. At values of SPVL greater than this peak, transmission fitness may decline, although this is only weakly supported by our results, partly due to a lack of samples with high SPVL and difficulties inherent to the method to detect differences in death rates. In Fraser et al. (2007), the transmission potential, defined as the expected number of people one infected individual could infect during the chronic stage of infection, is the product of transmission rate or infectiousness and the duration of infectious period. For HIV-1 subtype B, the transmission potential is relatively low for individuals with low and high SPVL and high for individual cases with intermediate SPVL. Our application of LBI reproduced this pattern using phylogenetic information, showing selection against low SPVL and possible stabilising selection for SPVL around 10 4 -10 5 copies of viral RNA/ml. One study concluded that the US epidemic is selecting for high viral loads, which are becoming more common over time (Wertheim et al. 2019). Yet we have seen a decline in viral load (Supplementary Figure 4) for our period of study in European samples, which was also reported in other publications (Pantazis et al. 2014). There could be key differences in what is being selected in these two regional epidemics.
Both theory and analysis demonstrate that LBI can be used to estimate transmission fitness using phylogenetic relationships within a tree. The main assumption for this approach is a panmictic population, without well-defined subpopulations with different coalescent rates. This assumption is not strictly respected for HIV in Europe, but it should not affect our analysis because the dependence of LBI on SPVL should be similar across countries. However, the suggested value for the LBI timescale parameter τ for our subtype-B tree (4.42 years) effectively constrains the signal of transmission fitness to this time span before and after the sample was taken, and accounting for exponentially less extending to the future or into the past. Although we do not know the decaying time for transmission fitness for particular HIV genotypes, the timescale parameter is much smaller than the typical duration of HIV infections and therefore constrains the sensitivity of LBI inference of selection on duration of infectious state. In fact, about half of individuals living with HIV in Europe are late presenters (Late presenters working group in COHERE in EuroCoord et al. 2015) who usually live with the infection for more than 4.42 years; therefore, LBI calculations considered a smaller proportion of lineage death events of sampled individuals from the branching patterns of the tree.
In order to estimate relative transmission fitness, LBI implicitly assumes the data captured lineage births and deaths within the specified time period (i.e. a large τ ), while in HIV infections and transmissions this is not true. Interpreting LBI in the framework of birth-death models, it is more sensitive to relative changes in birth rate or infectiousness in HIV than death rate, at least for death rates <1/τ , while transmission potential is affected by both equally. When we increase the value of timescale τ used for LBI, relative differences in LBI become a better proxy for relative differences in transmission potential, as the contribution of increased lineage death rates starts to appear by the gradual drop of transmission potential with higher viral load samples. The timescale parameter τ may also change the shape of the regression curve for other reasons. It might magnify the variance in estimations for the relatively low number of samples for high SPVL, causing a noticeable change. Also, the increase in τ means the focal point of LBI calculations shift from evolutionary history close to the tip in the tree (the sample) to taking into account a longer time span further away from the tip. This effect may depend on how long transmission fitness is maintained by the viral genome through time, which in turn depends on the viral genome's opportunity to generate new mutations and the interplay between genomic epistasis.
The trends we observed agree with previous modelling results, which found that the transmission potential of HIV-1 subtype B is lower for both low and high SPVLs and highest at intermediate viral loads. However, these estimates were based upon data from untreated individuals (Fraser et al. 2007), and with the current data set, treatment must be considered. If we approximate the effect of ART as reducing infectiousness to zero from the moment of diagnosis and assume the time of diagnosis is before AIDS and independent of SPVL, then the average duration of the infectious state will be independent of SPVL. Transmission potential, which is the product of infectiousness and the duration of the infectious state, is then governed by infectiousness alone. The BEEHIVE data set is a combined European cohorts data set, where ART coverage is above 70 per cent for the sampling years of our data points (CAS-CADE ∼100 per cent (Stirrup et al. 2018), Netherlands HIV monitoring annual reports 78.3-85 per cent (Stichting HIV Monitoring Annual Report 2003-2012, and Swiss HIV Cohort Study 70-90 per cent (SHCS 2021)). For BEEHIVE, only individuals with samples obtained soon after a known time of seroconversion are included, to avoid biases related to their stage of infection. However, time to ART initiation for individuals in the European cohorts varied from immediately to several years post seroconversion. Variation in the time from infection to ART can be decomposed into variation in the time from infection to diagnosis (Wolbers et al. 2008;van Sighem et al. 2015) and variation in the time from diagnosis to treatment (Boender et al. 2018). During our study period ART guidelines changed, including a CD4 count threshold change around 2008 (Wilkin and Gulick 2008), and only since 2019 has immediate ART been recommended for people living with HIV in Europe (EACS 2019). With a high proportion of individuals on ART and the variation in treatment start time, individuals within the population have variable duration of infectious state, and this can change the relationship between transmission potential and high SPVL (Supplementary Figure 5 and method in Supplementary File 2). Early access to ART will also confound the relationship between the transmission potential and SPVL by increasing the proportion of transmissions during the acute phase of the infection. Before a policy of immediate ART initiation upon entry to care, individuals with higher viral load have a faster progression to the treatment initiation (Mellors et al. 1997), and thus these viral strains are more likely to be removed from the transmission network than those of individuals with low viral load. Clearly, further detailed modelling is needed to characterise and infer the effects of treatment on the landscape of HIV-1 transmission potential.
Many characteristics of individual HIV cases contribute to the transmission potential of the virus, and some of these links may be obscure or highly heterogeneous; therefore, it is important we identify a proficient candidate characteristic to infer transmission potential. By knowing which genotypes have a higher transmission potential, better epidemic preparedness can be achieved. Our study took a step back and inferred general transmission fitness patterns made possible with a large data set from the European HIV cohorts and demonstrated a practical application of LBI to detect selection on a continuous phenotype. The relationship we observed between LBI and SPVL enabled a better understanding of evolutionary selection pressure on transmission fitness. The application of this phylogeny-based method should also be expanded to other viruses and other phenotypes as well, not limited to SPVL.

Data availability
The tree used for the analyses is available in newick format at github.com/BDI-pathogens/hiv_spvl_fitness.

Supplementary data
Supplementary data is available at Virus Evolution online.