Modeling Drug Resistance Emergence and Transmission in HIV-1 in the UK

A deeper understanding of HIV-1 transmission and drug resistance mechanisms can lead to improvements in current treatment policies. However, the rates at which HIV-1 drug resistance mutations (DRMs) are acquired and which transmitted DRMs persist are multi-factorial and vary considerably between different mutations. We develop a method for the estimation of drug resistance acquisition and transmission patterns. The method uses maximum likelihood ancestral character reconstruction informed by treatment roll-out dates and allows for the analysis of very large datasets. We apply our method to transmission trees reconstructed on the data obtained from the UK HIV Drug Resistance Database to make predictions for known DRMs. Our results show important differences between DRMs, in particular between polymorphic and non-polymorphic DRMs and between the B and C subtypes. Our estimates of reversion times, based on a very large number of sequences, are compatible but more accurate than those already available in the literature, with narrower confidence intervals. We consistently find that large resistance clusters are associated with polymorphic DRMs and DRMs with long loss times, which require special surveillance. As in other high-income countries (e.g., Switzerland), the prevalence of sequences with DRMs is decreasing, but among these, the fraction of transmitted resistance is clearly increasing compared to the fraction of acquired resistance mutations. All this indicates that efforts to monitor these mutations and the emergence of resistance clusters in the population must be maintained in the long term.


Introduction
Drug resistance is an increasing health problem. Drug resistance mutations (DRMs) emerge in HIV viruses through selective pressure during antiretroviral therapy (ART) and make the current ART drug combination ineffective for both sustaining the patient's wellbeing and for the prevention of virus transmission [1,2]. Drug-resistant viruses can therefore be transmitted to treatment-naive patients, who in turn can transmit them further [3,4], endangering the efficacy of treatment for the whole population. The rates at which DRMs are acquired and transmitted drug resistance (TDR) mutations persist are likely to be multifactorial and have been shown to vary considerably depending on duration and type of treatment and the presence of mutations [5]. Hence, having a deeper understanding of HIV transmission and drug resistance mechanisms is important as it can lead to improvements in current treatment policies [6].
Phylodynamics uses phylogenetic trees (i.e., genealogies of the pathogen population) inferred from the pathogen sequence data to estimate the epidemiological parameters. Several phylodynamic models of pathogen transmission have been developed [7][8][9][10].
An important trade-off in phylodynamic modelling is between the complexity of the biological questions that a model can address and its computational speed. On one side of the spectrum, there are computationally light statistical approaches, such as the study by Mourad et al. [4] of persistence times of drug-resistance in the untreated HIV-1-infected population in the UK. The analysis used a parsimony-based approach [11] to extract "phylotypes" of sequences, the most recent common ancestor of which bore a resistant mutation that is still shared by the majority of the sequences in the phylotype. Once dated and combined with the treatment-naive/experienced status, these phylotypes were used to zoom in on the most readable parts of the phylogeny and compute simple statistics which are immediately accessible from the annotated tree. The simplicity of the method makes it computationally very efficient. It was applied to a large set of ≈25,000 HIV-1 subtype B sequences from the UK, where it showed that around 70% of transmitted drug resistance had a treatment-naive source.
However, to address more refined questions such as estimation of rates of different events (transmission, drug resistance acquisition, etc.), more complex methods are needed, such as modeling of the viral dynamics with ordinary differential equations (ODEs). Kühnert et al. [9] proposed a piecewise-constant two-type (resistant and sensitive) birth-death model to estimate the fitness cost of DRMs. The fitness was measured as a ratio between transmission rates of hosts infected by drug-resistant strains and transmission rates of hosts infected by sensitive strains. They applied this model to the data from the Swiss HIV cohort study. They reconstructed a maximum likelihood tree for 5638 pol-gene sequences from the Swiss HIV cohort study and 4284 closely related sequences from the Los Alamos HIV database. On this tree, for each of the 15 major DRMs present in the Swiss cohort sequences, they identified transmission clusters of up to 250 sequences each, containing > 80% of Swiss sequences and at least one sequence with the mutation. Kühnert et al. [9] estimated the model parameters on all the clusters for each mutation separately in a Bayesian setting. To account for the fact that DRMs appear under antiretroviral (ARV) selective pressure, they put the rates of state change (from sensitive to resistant and vice versa) to zero before significant usage of the related drug(s) in Switzerland. The study showed that some of the mutations (RT:D67N, RT:K70R, RT:M184V, RT:K219Q) decreased the fitness, one (PR:L90M) seemed to increase the fitness, while the others did not have a significant effect.
The models above, and more generally, the family of multi-type birth-death models [7] with a Bayesian birth-death skyline plot (allowing the parameters to change in a piecewise constant manner) [12] they belong to, define ODEs for fine-tuned parameter estimation. However, their complexity prevents resolving them analytically. Numerical solutions of the ODEs, on the other hand, have long computational times, which prevents the application of these models to larger datasets (dozens of thousands of sequences), while larger datasets are desirable for more accurate parameter estimation.
A compromise between the model complexity and computational speed when applied to large datasets needs to be found. In this study, we propose such a compromise that improves the approach by Mourad et al. [4] by using maximum likelihood and combining it with the skyline ideas of Stadler et al. [8] to analyze DRM transmission patterns.
Our approach uses ancestral character reconstruction (ACR) on a partially sampled transmission tree. Using the ancestral scenario reconstruction tool PastML [13], we study ancestral states for presence/absence of common surveillance DRMs. In a tree annotated with PastML, we can discriminate between two types of resistant nodes: (1) those whose parent node does not have the DRM, which correspond to acquired drug resistance (ADR), and (2) those whose parent node is also resistant, such nodes form TDR clusters. We also identify the scenarios of DRM loss (when the parent node has the mutation, while the child does not). Moreover, we account for the changes in treatment policies by allowing for separate ACRs for different time intervals (e.g., before and after the first DRM-provoking ARV introduction). Once the reconstruction is performed, we visualize the results with PastML and calculate various statistics for transmission patterns.
We apply our approach to analyze the patterns of DRM emergence, transmission, and loss in HIV-1-infected individuals in the UK, using sequences and metadata from the UK HIV Drug Resistance Database [14].

Materials and Methods
The UK HIV Drug Resistance Database provides HIV protease (PR) and reverse transcriptase (RT) sequences extracted during resistance tests and their corresponding metadata (e.g., treatment status of the patient before the test: treatment-experienced, -naive, or unknown; date of the test). These sequences were obtained by Sanger sequencing, which provides the consensus sequence of the predominant virus variants in a patient at the time of sampling. These sequences are subject to sampling uncertainty and potential sequencing errors, but the most discordant sequences were detected as outliers and removed from our analyses (see Transmission tree reconstruction). Furthermore, our dataset is so large (∼80,000 sequences) that the few remaining errors, which are unavoidable, most likely have a small impact on the overall results.
In response to our request for data from the database, we received 88,009 sequences for 60,846 different patients, sampled between 1996 and 2016.

Sequence Subtyping and Alignment
We subtyped (pure subtypes and recombination positions) and aligned the sequences against the Los Alamos 2010 subtype reference pol-gene alignment [15] using jpHMM [16] (for detailed options, see Appendix A).
Altogether, we obtained a large nucleotide alignment of 88,009 sequences, from which we extracted the alignments for the B and C subtypes. We filtered them to contain only the first sequence (in terms of sampling date) when several sequences were present for the same patient. We hence obtained a 40,055-sequence alignment for the B subtype, and a 19,139-sequence alignment for the C subtype. To each of them, we added five randomly selected HIV-1 group M sequences of other pure subtypes to be used as an outgroup for tree rooting.

Transmission Tree Reconstruction
We reconstructed phylogenetic trees for B and C sequences separately using RAxML-NG (v0.9.0, evolutionary model GTR+G4+FO+IO; for detailed options, see Appendix A) [17] and rooted them with the outgroup sequences, which we then removed. For tree reconstruction, the positions of surveillance DRMs [18] were removed from the nucleotide alignment, as they are influenced by treatment-selection forces, unlike the other positions, and could bias the reconstruction by grouping together the sequences that share the same DRMs [19]. We kept the non-surveillance (e.g., accessory, polymorphic) DRM positions in order to keep a sufficient phylogenetic signal. While accessory mutations are also influenced by treatment-selection forces (to compensate for the deleterious effects of major drug resistance mutations [20]), they are not frequent enough to bias tree reconstructions. Polymorphic mutations can appear spontaneously in treatment-naive individuals and hence are an important source of phylogenetic signal and were kept. It should also be noted that our trees were reconstructed from DNA sequences, not proteins, which reduces the impact of convergent mutations at the amino acid level.
We then dated each tree with LSD2 [21] (v2.3: github.com/tothuhien/lsd2/tree/v1.4.2.2 accessed on 1 April 2023) using tip sampling dates, under strict molecular clock with outlier removal (see Appendix A for detailed settings). Sequences whose mutation rates were different from the mean rate for more than three standard deviations were considered as outliers.

Ancestral Character Reconstruction
For each DRM (surveillance, polymorphic or accessory) listed in the Stanford HIV Drug Resistance Database [22] (hivdb.stanford.edu/pages/surveillance.html accessed on 10 May 2023), we extracted its presence/absence in the (unaligned) sequences of our datasets and the ARVs that can provoke it with Sierra, the Stanford Algorithm [23] web service. We then analyzed the DRMs that were found in at least 0.5% of sequences (after filtering by patient and temporal outlier removal) of our dataset (either B or C, analyzed separately).
Each DRM name (e.g., RT:T215D) contains 2 pieces of information: the DRM position (e.g., RT:T215) and its mutated amino acid, associated with resistance (e.g., D). The DRM position contains the protein name: RT (reverse transcriptase) or PR (protease), the reference position of the amino acid (e.g., 215), and its wild-type amino acid (e.g., T).
We analyzed DRMs with prevalence > 0.5% for each dataset. Each DRM position (e.g., RT:V179 in the C dataset) was analyzed independently by reconstructing its states in the ancestral nodes, based on the tip states. The DRMs with prevalence > 0.5% found in this position (e.g., RT:V179D and RT:V179E in C) were analyzed together. For the majority of the DRM positions, only one DRM (i.e., mutated amino acid, associated with resistance) with prevalence > 0.5% was found (e.g., PR:L90M for the position PR:L90). In the B dataset Possible states for ancestral character reconstruction (ACR) corresponded to DRM presence (i.e., the resistant state) or absence (sensitive state) for DRM positions with only one DRM. For instance, for PR:L90M, the resistant state corresponds to the amino acid M, and the sensitive state to any other amino acid; in practice, the sensitive state is almost uniquely L. In the B dataset, 97.79% of sequences have L at the position PR:90; 1.93% have M; less than 0.01% have W or F, and 0.23% have an ambiguity at this position (so their initial state for ACR is unresolved between sensitive and resistant). For positions with several DRMs, the resistant state was split into all the possibilities (e.g., D, F, S, or Y for RT:T215).
For polymorphic mutations (e.g., RT:S68G), ACR was performed on the corresponding (B or C) time-scaled tree with PastML (v1.9.40, MAP (maximum a posteriori) decision rule), without taking into account the year of ARV acceptance, as the these mutations could be present independently of ARVs.
To reconstruct the ancestral character states for non-polymorphic DRMs, we used the procedure visualized in Figure 1a (which we first proposed and applied to study HIV resistance patterns in Cuba in [24]). For each ARV, we extracted the dates of their acceptance with the Wikipedia python package (https://github.com/goldsmith/Wikipedia accessed on 1 April 2023). We cut the time-scaled tree at the earliest of the dates of acceptance of ARVs that can provoke the DRM (e.g., for PR:L90M, saquinavir (SQV) was accepted in 1995). We hence obtained the pre-treatment-introduction tree and a forest of post-treatmentintroduction subtrees. For the trees in the forest, we added additional one-child root nodes (as parents of the corresponding tree roots, at distances that corresponded to the differences between the root dates and the ARV acceptance date), which we marked as sensitive in the PastML input annotation file. We performed ACR with PastML on the forest, and then combined it with the all-sensitive annotation for the pre-treatment-introduction tree nodes.
For two of the multiple-DRM positions (RT:T215 and RT:K219), all the corresponding DRMs were non-polymorphic and provoked by the same ARVs (the earliest accepted being zidovudine (AZT, accepted in 1987) for all of them). We therefore cut the tree as explained above, and reconstructed the ACR for D, F, S, Y, or sensitive (for RT:T215) and for E, N, Q, or sensitive (for RT:K219) on the after-1987 forest.
Finally, for RT:V179, the mutation RT:V179D was polymorphic, while RT:V179E was non-polymorphic (provoked by nevirapine, NVP, accepted in 1996). To reconstruct ancestral characters for RT:V179, we followed the procedure visualized in Figure 1b: First, we cut the tree at 1996, and reconstructed the ancestral characters (E, D, or sensitive) on the after-1996 forest (the input states for the forest roots were sensitive or D). We then extended this reconstruction on the before-1996 tree only for RT:V179D (i.e., possible states: D or sensitive).
Once ACR was performed for all the DRM positions, we combined the predictions into a common table mapping node names to their states. A node state was sensitive if no DRM was reconstructed for this node at any position; otherwise, the state was a combination of DRMs reconstructed for this node in separate DRM analyses (e.g., RT:K103N+RT:V106I if those DRMs were reconstructed as present for the node of interest while the others were reconstructed as absent). We visualized this combined result using the COPY method of PastML. or sensitive (gray), for a non-polymorphic DRM (e.g., PR:L90M), we cut the time-scaled tree at the date of acceptance of the first ARV that can provoke this DRM (for PR:L90M, SQV accepted in 1995), as shown in the left panel. We hence obtain the pre-treatment-introduction tree (upper part of the tree) and a forest of post-treatment-introduction subtrees (bottom part). For the trees in the forest, we then mark their roots as sensitive (middle left panel). We perform the ACR with PastML on the forest (middle right panel) and combine the results with the the all-sensitive annotation for the pre-treatment-introduction tree nodes (right panel). (b) To reconstruct the ancestral character states for the DRM position RT:V179, corresponding to a polymorphic DRM RT:V179D (violet), but also to a non-polymorphic DRM RT:V179E (orange), we cut the time-scaled tree at the date of acceptance of the first ARV that can provoke RT:V179E (NVP, accepted in 1996), as shown in the left panel. For the trees in the after-1996 forest, we then mark their roots as either sensitive (gray) or D (violet, middle left panel) and perform the ACR with PastML (middle right panel). We then extended this reconstruction to the before-1996 tree only for RT:V179D (right panel).

Transmitted versus Acquired Drug Resistance
On a tree whose nodes are annotated with their DRM status, present (resistant) or absent (sensitive), we defined three configurations: transmitted drug resistance (TDR), acquired drug resistance (ADR), and DRM loss (see Figure 2).
We defined ADR cases as parent-child node pairs, where the parent DRM status is sensitive and the child DRM status is resistant.
We defined TDR cases inferred from the tree as either: 1. An internal node whose state was estimated as resistant (i.e., containing the DRM of interest, see Figure 2c,d). As the internal nodes of the tree roughly correspond to transmissions, such a node indicates a transmission of a resistant virus. 2.
(For non-polymorphic mutations only) A hidden internal node between a node whose DRM status is resistant and its parent node whose DRM status is sensitive, if all the tips in the node's subtree are treatment-naive. According to the treatment status and the fact that the mutation is non-polymorphic, the initial resistance could not be acquired through treatment pressure, and hence must have been transmitted from a patient who was not sampled (and does not appear in the tree, see Figure 2b,d).
Connected parts of the tree corresponding to TDR cases form TDR clusters (see Figure 2). We calculated their sizes as the numbers of resistant tips connected to each cluster. Note that if a TDR cluster subtree contains only treatment-naive patients, it implies that its root ADR event corresponds to an unsampled treated patient (see Figure 2b,d).
We define DRM loss cases as parent-child node pairs, where the parent DRM status is resistant, while the child DRM status is sensitive.
Using these configurations, we calculate the source of the DRM status of each tip in the tree as follows.

1.
For treatment-naive tips, the source of their DRM status is: TDR+DRM loss if the tip is sensitive and is involved in a DRM loss configuration (see Figure 2c,d); • Transmission of a virus without the DRM if the above two cases do not apply.

2.
For treatment-experienced tips, the source of their DRM status is: • ADR (+DRM loss if the tip is sensitive) for one of the treatment-experienced tips connected to a TDR cluster (see Figure 2c). The patient corresponding to this tip is assumed to be the source of the TDR cluster. The later DRM loss is possible if the treatment was changed to drugs that do not provoke the DRM in question.
For other treated tips connected to this cluster, we assume that they received a resistant virus via TDR. Assuming their treatment was such that it could not provoke the DRM in question, they could later lose it (hence, +DRM loss if they are sensitive); • ADR for a resistant tip not connected to a TDR cluster ( Figure 2a); • Transmission of a virus without the DRM if the above cases do not apply.

3.
For the tips whose treatment status is unknown, we consider both cases (naive or resistant) with equal probabilities (0.5).

For Polymorphic DRMs
We do not consider the treatment status (as such DRMs could appear independently of treatment) and calculate the source of each tip's DRM status as follows: • ADR for a resistant tip not connected to a TDR cluster (as in Figure 2a, independently of the treatment status); • ADR (+DRM loss if the tip is sensitive) for one of the tips connected to a TDR cluster (as in Figure 2c, independently of the treatment status). The individual corresponding to this tip is assumed to be the source of the TDR cluster. For other tips connected to this cluster, we assume that they received a resistant virus via TDR. They could later lose it (hence, +DRM loss if they are sensitive); • Transmission of a virus without the DRM if the above cases do not apply.
We count the numbers of tip DRM status sources of each type (ADR: N ADR , TDR: N TDR , or loss: N loss (see Algorithm A1, Appendix B for details)) and report the results in Tables 1 and 2. We count all the identified DRM loss events, all the identified (observed and hidden) TDR events, and only those ADR events that are not at the root of naive-only TDR clusters, as the latter happened in unsampled treatment-experienced patients (see Figure 2b,d).
Note that N resistant tips = N ADR + N TDR − N loss . For example, in Figure 2c, all the events correspond to observed tips, so we count one ADR, three TDR, and one DRM loss events: N resistant tips = 3 = 1 + 3 − 1. Figure 2d represents a more complex case: we count one hidden TDR event (as it led to the resistance status of one of the observed tips) and three observed TDR events (leading to resistance statuses of other observed tips). We do not count the ADR event (as it corresponds to an unobserved patient, whose virus is not in our dataset). We also count one DRM loss event, which led to one of the tips regaining its sensitive state. Hence, N resistant tips = 3; N ADR = 0; N TDR = 4; N loss = 1; 3 = 0 + 4 − 1.

Times of DRM Loss
To estimate the DRM loss times, we used survival analysis with an exponential (constant hazard) model (Weibull model with β = 1), implemented in Python3 package SurPyval (github.com/derrynknife/SurPyval accessed on 1 April 2023, v0.10.10). This model takes as input observations about event durations and estimates the rate at which the event occurs. The input data might be left-, right-, or interval-censored. Left-censored data represent times that are longer than the event occurrences, e.g., if the DRM loss occurred in exactly 2 years, but the observation was only made after 3 years, the 3-year duration represents a left-censored data point. Right-censored data represent times that are shorter than the event occurrences, e.g., if for the same DRM loss the only observation was made after 1 year (and observed no DRM loss), the 1-year duration represents a right-censored data point. Interval-censored data represent cases when both a left-and a right-censored data point are available, e.g., for the same DRM loss an interval-censored datum might state that it occurred sometime between 1 and 3 years.
For each individual represented in our dataset, we extracted at most one data point for the loss survival analysis, as described below. We estimated the loss times for nonpolymorphic DRMs, assuming each DRM can be acquired only in the presence of selective pressure of the ARVs that could provoke it and lost in the absence of such ARVs. However, the treatment composition of individuals was not recorded in our datasets; we only had information on whether they were on treatment or were treatment-naive. Therefore, if a DRM loss did happen (i.e., a sample with a DRM followed by a subsequent sample without the DRM) and the corresponding individual's state was treatment-experienced, we assumed the treatment was such that it could not provoke this DRM and counted this as a DRM loss, just as in the case of an ARV interruption.
A right-censored data point represents the maximal observed duration during which a mutation loss did not occur. We extracted such points for the individuals who had several consecutive treatment-naive samples with the DRM of interest (and of the subtype of interest) in our metadata: we took the difference in sampling times of the last such sample and the first one.
A left-censored data point represents a duration that is longer than the mutation loss time. To estimate such a duration, we needed to know not only (1) the time by which the individual's virus lost the DRM, but also (2) the earliest time by which it could have acquired it (the difference making an upper limit on the loss duration). For (1), we used the time of the earliest sample without the DRM, provided it was preceded by samples with the DRM. For (2), we used either (2a) the time of the latest sample without the DRM preceding the aforementioned samples (where the DRM was present and then lost), if such sample existed in the metadata, or (2b) if the earliest metadata sample already had the DRM (which implies it corresponded to a resistant tip in the tree), the time of the tip's most recent ancestral node whose status was sensitive (with marginal probability > 0.95).
For individuals for whom both a left-and a right-censored data point was present, we converted them to an interval-censored one.
We reported the resulting DRM loss time estimates (i.e., inverse of the loss rates) for non-polymorphic DRMs with at least 5 left-censored and 5 right-censored data points (interval-censored data points counted as both). We estimated confidence intervals (CIs) as the 2.5 and 97.5 percentiles of the loss times estimated on bootstrapped data points of the same size (with 1000 repetitions).

HIV in the UK
Antiretroviral therapy (ART) was introduced in the UK more than 30 years ago and transformed HIV from a fatal infection into a chronic, manageable condition [25][26][27]. It is accepted that successful ART results in an "undetectable" viral load, which is protective from passing on the virus to others [28,29].
In the UK, a patient's viral load is regularly monitored by clinicians: patients attend biannual or quarterly clinical visits, depending on how well they do on treatment. Moreover, the increase in viral load comes with symptoms (generally, opportunistic infections that persist longer than they should). A suspicious increase from undetectable to detectable viral load (i.e., viral rebound) is the first sign of treatment failure.
In case of a viral rebound, the virus is sequenced to discriminate between resistance (presence of a known DRM) and poor adherence (failure without DRM, if a patient does not take the drugs regularly according to prescription). If resistance is the reason for a treatment failure, the treatment is changed.
Therefore, in the case of treatment failure, there is a window of opportunity for the virus to be transmitted: between the time the viral load increases to transmittable levels and the time when the clinician realizes it has increased and changes treatment. The probability of transmission varies across patients and depends on various factors [5].
The information collected from the HIV drug resistance tests carried out in the UK since 1996 is available in the UK HIV Drug Resistance Database. The database stores protease (PR) and reverse transcriptase (RT) sequences for about 50% of infected individuals in the UK.

UK HIV Dataset
We used the data from the UK HIV Drug Resistance Database containing samples from 1996 to 2016 to estimate transmission mechanisms for different common DRMs.
Out of 88,009 initial sequences obtained from the database, the majority were of subtypes B (58,569 sequences, 66.5%) and C (27,151 sequences, 30.1%); we also detected 8 D, 1 F, 2 G, and 3 K (<0.0001%) sequences and 2276 potentially recombinant sequences (2.6%; in particular 494 A, B, G and 446 B, K recombinants (0.5%)). We report these and other dataset statistics in Table 3. Table 3. Statistics on the B and C datasets. The "with DRM(s)" statistics count samples with at least one unambiguous resistant amino acid at any DRM position. Samples that contained either non-resistant or ambiguous amino acids at all DRM positions were considered as "without DRMs". "p DRM(s)" stands for polymorphic DRMs, while "np DRMs" stands for non-polymorphic ones. Note that the same sequence might contain both p and np DRMs (at different positions). We focused on subtypes B and C. In our phylogenetic analyses, we kept one sequence per patient (the first sampled) in order to (i) have a homogeneous transmission tree where each patient is represented by a single tip (i.e., between-host rather than between-and within-host) and (ii) reduce the computational complexity of tree reconstructions (which took several months on a computing cluster even with a single sequence per patient). The remaining sequences of the multiply sampled patients (26,526 out of 85,720 sequences) were used to estimate reversion times but were not integrated into phylogeny-based analyses to avoid biasing the results (DRM loss and ADR can be extracted from such serial data, but not TDR).
We hence obtained a 40,055-sequence dataset for B and a 19,139-sequence dataset for C. We further filtered these datasets by removing temporal outliers (<2% of sequences), as they could correspond to erroneous dates or poorly sequenced samples (e.g., due to viral load-mediated stochasticity, base-calling software errors, or issues with sample transport and receipt). The final datasets contained 39,159 sequences for B and 18,809 for C.
We detected 161 DRMs found in at least one sequence of the B dataset, and 146 DRMs for C. 31.4% of B and 27.4% of C sequences had at least one of these DRMs present, 18.5% of B and 16.9% of C sequences had only one mutation, while the others had multiple DRMs present. While the subtypes B and C are different, as are the locations where these subtypes are most prevalent (African countries for C versus the UK and other European countries for B), we did not detect major differences in DRM distribution in the B and C datasets. Hence, while more C than B sequences correspond to imported cases, the UK health policies must play an important role on their DRM patterns, independently of the subtype. In a recent study Blassel et al. [30] compared DRMs in a UK and an African datasets and reported that the median number of DRMs in resistant sequences differed between the two datasets (three in the African sequences versus one in the UK sequences). In our case, there was no difference between B and C datasets if all DRMs were considered (median number of one DRM for both B and C datasets in resistant sequences); if we considered only non-polymorphic DRMs, a slight difference appeared (one for B vs. two for C). Detailed statistics on DRM number distributions are shown in Table A1. There was, however, a significant difference in the TDR distribution: more TDR could be suspected among the B samples (12% of treatment-naive sequences had non-polymorphic DRMs present, while in the C samples, there were only 7% of such sequences).

Drug Resistance Analyses
We reconstructed time-scaled phylogenetic trees for B and C datasets and performed ancestral character reconstruction for each of the selected DRMs and positions to look at their transmission patterns. Consistently with what was previously reported in HIV-1 group M studies (of the pol gene [31] and of the full-genome [32]), we estimated a faster mutation rate (1.9 × 10 −3 (mutations per site per year)) and a more recent root date (1965) for subtype B than for subtype C (1.4 × 10 −3 ; 1944). More details on B and C datasets can be found in Table 3.
On the time-scaled trees we analyzed the transmission patterns of the DRMs found in at least 0.5% of sequences: 31 DRMs (on 26 different positions) for B and 21 (on 20 different positions) for C. The threshold of 0.5% permitted us to analyze all the major DRMs while having enough sequences representing these DRMs in the dataset. The major drug resistance patterns found in the B and C datasets are visualized in Figures 3 and 4. The statistics on these DRMs and their loss times are shown in Tables 1-4.
While some of the DRMs (e.g., RT:M184V) are comparably prevalent in B and C (4.8% of resistant cases vs. 5.4%), others are very subtype-specific. For instance, the nonpolymorphic mutation PR:L90M is present in 2.2% of B resistant cases and only in 0.6% of C. Another example is the mutations in position RT:106. In the B dataset, the polymorphic DRM RT:V106I is present in 3.9% of resistant cases and is 30 times more prevalent than the non-polymorphic DRM RT:V106M, which was not selected for our analysis due to its low prevalence; for the C dataset, we have the opposite distribution: RT:V106M is present in 2% of resistant cases and is 16 times more prevalent than RT:V106I, which was not selected for our analyses. More examples are given in Tables 1 and 2. The nodes are colored by DRM found in them (if several DRMs are present, the color of the (lexicographically) first DRM is used). The nodes with no DRM are colored gray and labeled "sensitive". The parts of the tree where no state change happens are clustered together into metanodes, and their size corresponds to the number of samples (tips) they contain (shown in labels), e.g., "RT_K103N+RT_S68G+RT_T215S 36" (violet, on the bottom) corresponds to a transmitted resistance cluster containing 36 samples in the B dataset, having three mutations. Configurations present several times are shown once and the number of occurrences is shown on the corresponding branch, e.g., a branch of size 247 leading to the metanode "RT_V179D 1-8" (salad green, on the top right) represents 247 cases of acquiring the mutation RT:V179D leading to small transmission clusters of sizes between 1 and 8. Configurations representing less than 34 samples are not shown to increase readability.
Using the metadata only, we can already see that there is a clear difference between polymorphic and non-polymorphic mutations. While the presence of most of the latter ones correlated with the treatment status (e.g., 86.5% of B sequences with the non-polymorphic mutation RT:M184V are from treatment-experienced patients), it is the opposite for the former, which are more prevalent in treatment-naive sequences (e.g., 78.1% of B sequences with RT:S68G are treatment-naive, see Table 1). Indeed, while the polymorphic DRMs can appear spontaneously, the non-polymorphic ones are selected by treatment, and carrying them often implies a fitness cost [9]. However, a few non-polymorphic DRM do not follow this pattern and are more prevalent in treatment-naive individuals: RT:T215D, RT:T215S, and RT:K219N in B and RT:V179E in C. The RT:T215D/S are reversions often developed in patients primarily infected with strains with RT:T215Y/F and hence have a higher fitness [33]. This is further confirmed by our estimation of the loss times: the loss times of RT:T215D/S are long (9.3 and 6.8 years versus 1.1 and 1.8 years for RT:T215Y/F, see Table 4), which may explain their prevalence in treatment-naive patients. Similarly, we estimated a rather long loss time for RT:K219N (3.7 years). We did not have enough data to estimate the loss time of RT:V179E. While this mutation is generally considered as non-polymorphic [34], its natural presence in treatment-naive patients has been reported for the HIV-1 common recombinant form CRF55_01B [35]. DRMs are present, the color of the (lexicographically) first DRM is used). The nodes with no DRM found are colored gray and labeled "sensitive". The parts of the tree where no state change happens are clustered together into metanodes, and their size corresponds to the number of samples (tips) they contain (shown in labels). Configurations present several times are shown once and the number of occurrences is shown on the corresponding branch, e.g., a branch of size 23 leading to the blue metanode "PR_L90M 1-3" (top left) represents 23 cases of acquiring the mutation PR:L90M, leading to small resistance clusters of sizes 1-3. Configurations representing less than 11 samples are not shown to increase readability.
Using the information from the tree, we refined the mutation statistics further, classifying resistant mutations sources into TDR vs. ADR and detecting DRM loss events (see Tables 1 and 2). The C dataset featured smaller TDR clusters (apart from the polymorphic mutation RT:E138A) than B. This could be explained by multiple introductions of subtype C into different regions of the UK and different risk groups, particularly from Africa via immigration, which is consistent with the higher diversity of the C strains observed in our data (C: 0.019 vs. B: 0.014 (mutations per site per branch), Table 3) and with the dates of origin of the two UK sub-epidemics (C: 1944 vs. B: 1965, Table 3).
A large size (e.g., 78 individuals in the B dataset for RT:K103N) of some of the TDR clusters and a rather high proportion of TDR cases among the resistant ones (see Tables 1 and 2) is clinically problematic, as it means a high level of resistant strain transmission, leading to a decrease in treatment choice at the population level.
We further analyzed each mutation position over time (see Supplementary Tables S1-S46) and found a common pattern: the proportion of resistant cases with respect to all cases decreases over time; however, the proportion of resistant cases in treatment-naive individuals and, consistently, the proportion of TDR with respect to ADR increases. This pattern is illustrated well by the mutations in position RT:215 (see Figure 5 and Table A2).
However, there are exceptions with respect to the decrease in the proportion of resistant cases over time, especially among the polymorphic DRMs, consistent with the fact that they have little or no fitness cost associated with them. For the polymorphic mutation RT:E138A, this proportions has been increasing from 2001 to mid-March 2016 (the last sampling time in our data): from 1.9% to 2.2% in the B dataset and from 8.3% to 11.6% in the C dataset (Tables S8 and S31). Similarly, the proportion of resistant cases with polymorphic RT:S68G has been increasing from 4.6% in 2001 to 8.1% in 2016 in B and from 0.3% to 0.9% in C (Tables S21 and S41). The proportion of resistant cases with polymorphic RT:V106I has been increasing in B from 2% in 2001 to 2.7% in 2016, while the proportion of non-polymorphic RT:V106M (similar to RT:V106I) in C seems to have stabilized at 2% over the last five sampling years (2011-2016, Tables S23 and S43). The proportion of resistant cases with polymorphic RT:V179D has been increasing in B from 1.3% in 2001 to 2% in 2016, as did the proportion of non-polymorphic RT:V179E (similar to RT:V179D) in C, from 0.1% in 2006 to 0.6% in 2016. Meanwhile, the proportion of RT:V179D has remained stable (∼1.5%) over the last 10 sampling years (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) , Tables S25 and S45). Finally, the proportion of resistant cases with polymorphic PR:Q58E has been increasing in subtype C: from 0.6% in 2006 to 0.8% in 2016 (Table S28; we did not analyze this for B due to its low prevalence). These results clearly indicate that the spread of polymorphic DRMs should become a subject of particular surveillance. . The parts of the tree where no state change happens are clustered together into metanodes, and their size corresponds to the number of samples (tips) they contain (shown in labels), e.g., "RT_T215D 99" (lilac, top left) corresponds to a transmitted RT:T215D resistance cluster containing 99 samples in the B dataset. Configurations present several times are shown once, and the number of occurrences is shown on the corresponding branch, e.g., the branch of size 804 leading to the metanode "RT_T215Y 1-5" (violet, right) represents 804 cases of acquired RT:T215Y mutation leading to small transmission clusters of sizes between 1 and 5. Configurations representing less than 2 samples are not shown to increase readability.

DRM Loss Times
We estimated the times of DRM loss for non-polymorphic DRMs in our datasets and compared them to the estimates previously reported by Castro et al. [5]. Castro et al. analyzed 313 patients from the UK Drug Resistance database who were treatment-naive and had a DRM present in their first resistance test (performed between 1997 and 2009), mixing all the subtypes and using survival analysis. We also used survival analysis, but we had a larger dataset that included information not only from the metadata, but also from the tree, and thus analyzed the subtypes separately. Our results and the comparison are shown in Table 4. Overall, out estimates are compatible with those by Castro et al.  Table 4. Loss times (with 95% CIs) for non-polymorphic DRMs found in B and C datasets with prevalence > 0.5% and at least 5 left-and 5 right-censored data points (the exact numbers of data points are shown in Table A3). The visualization using ACR for the DRMs at the position RT:T215 ( Figure 5) is consistent with the estimated loss patterns. For two of the mutations found in this position (RT:T215D and RT:T215S) the loss times are long (9.3 and 6.8 years, respectively), which allows them to form large TDR clusters (up to 99 and 45 samples, respectively, left and bottom part of Figure 5). For the other two mutations found in this position (RT:T215F and RT:T215Y), the loss times (including potential reversions to D or S) are rather short (1.8 and 1.1 years), which prevents them from forming significant TDR clusters.

Discussion
We proposed fast maximum-likelihood ACR methods for the investigation of drug resistance patterns in large sequence datasets. Their application to ∼40,000 subtype B and ∼20,000 subtype C sequences from the UK HIV Drug Resistance Database allowed us to investigate the trends in drug resistance patterns between 1996 and 2016 and to estimate the loss times for 25 common non-polymorphic DRMs.
An important advantage of our methods is their applicability to very large datasets (dozens of thousands of sequences). Previous studies had to face an uncomfortable choice between using more complex models on filtered data [9] or using less accurate (e.g., parsimony) approaches on full datasets [4]. Our approach uses a robust maximum likelihood framework and permits the extraction of global drug-resistant patterns from all the available data.
While the proportion of resistant cases in the UK seems to decrease with time, the proportion of resistant cases in treatment-naive individuals (hence, acquired via TDR) is increasing. In addition, our results show that polymorphic DRMs obey a different scheme, with an increase in both the proportion of resistant cases and TDR and large resistance clusters. The TDR cases form resistance clusters, which are clearly identifiable on phylogenetic trees. Locating these clusters within the UK's regions and cities and among risk groups would be an important step in stopping the spread of drug resistance. The global trend that we observe in the UK is visible in other high-income countries (e.g., Switzerland [36], Italy [37], and Portugal [38]) but differs from, for example, West Africa, where the prevalence of multiple resistance in the population is a major concern [39]. Furthermore, detailed analyses in high-income countries indicate that a high level of ADR is more frequently observed in certain risk groups (e.g., people of African origin, unemployed people, and people with mental illness, among others, in Switzerland [40]) that require special surveillance to prevent treatment failure and HIV-1 transmission.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/v15061244/s1. Supplementary Tables S1-S46 describing each position containing DRMs with prevalence > 0.5% in B and C data sets, and the evolution of their resistance statistics over time: TDR, ADR, and loss counts as well as the proportions of resistant cases among treatment-naive and -experienced individuals.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The HIV-1 sequences and the metadata (anonymized patient id and gender) used in this study were obtained from the UK HIV Drug Resistance Database [14] in 2017. The visualizations of the ACR results produced in this study are available at github.com/evolbioinfo/HIV1-UK (accessed on 2 May 2023).

Acknowledgments:
The authors thank Stéphane Hué for valuable discussions on HIV spread in the UK.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: