Introduction

Despite recent reductions in HIV-1 prevalence in Washington, DC from 2.5% in 20131 to 1.8% in 2018, the United States (US) capital is still experiencing a generalized HIV-1 epidemic – as defined by the World Health Organization2,3,4. There were 340 newly diagnosed cases in DC in 2018, and the DC rate is five times higher than the national rate3. Blacks, men, men who have sex with men (MSM), and heterosexuals (HRH) account for the majority of people living with HIV-1 (PLWH) in DC2,3. However, ~20% of the newly diagnosed persons had an unknown risk for transmission in both 2016 and 20172,3. Furthermore, the leading group (33.3%) of newly diagnosed cases was between the ages of 20–29 years old3. This same age group had the highest percentage (27.5%) of drug resistance mutations (DRM) at diagnosis, suggesting broader spread of HIV-1 drug resistant variants and potential concern for future therapeutic options, especially if these mutations are against first line antiretroviral (ART) drugs for newly infected individuals. With blacks and young adults being the most impacted groups of individuals for HIV-1 in DC, understanding the current HIV-1 phylodynamics can provide informative data to guide programs that prevent and reduce the incidence of HIV-1. Moreover, identifying potential transmission clusters amongst individuals in DC and their associated epidemiological features may help infer otherwise ‘unknown’ transmission modes and provide insight for more targeted prevention and intervention strategies.

In 2011, the DC Cohort, a longitudinal observational NIH-funded cohort study of PLWH who are receiving care at clinical sites in DC, began enrollment. As of 2018, the Cohort has enrolled approximately 10,000 PLWH2. By capitalizing on the longitudinal study of the DC Cohort, phylodynamics (i.e., the study of how epidemiological, immunological, and evolutionary processes act and potentially interact to shape viral phylogenies5) can provide insights into HIV-1 infection in DC. Analyzing sequence data can detect new variants within the population, identify population structuring and associations with risk factors, and, in combination with demographic information, predict areas of interest to direct public health efforts.

The great power and resolution of Next-Generation Sequencing (NGS) technologies are changing phylodynamics research. NGS is used for active infectious disease surveillance6, detection of circulating drug resistant variants7,8, and inference of HIV-1 transmission clusters. High variation among viral strains of RNA viruses such as HIV-1 are a result of high mutation rates, large population sizes, and short generation times9. NGS can detect mutations present in less abundant strains (<1%)8. Such rare mutations are particularly relevant in the context of the evolution of drug resistance, since they may facilitate viral adaptation leading to treatment failure10,11. Moreover, sequence variants (or haplotypes) can be reconstructed from NGS sequencing reads. Viral populations may contain a pool of different variants that are resistant to different antiretroviral drugs12,13 and also help the virus to evade the immune system14. Reconstructing the haplotypes present in a viral sample and assessing their phylodynamics may show additional or different transmission clusters present between individuals or identify a few HIV-1 strains that are dominating the HIV-1 viral population15,16,17. The use of powerful NGS technologies to study the HIV-1 epidemic at local levels (e.g., Washington, DC) may generate deeper insights into the ongoing HIV-1 dynamics. Near full length sequences and amplicon sequences that span entire HIV-1 genes are becoming more prevalent with this advanced technology18,19. Some studies have indicated that pol is less informative than env for phylogenetic resolution20. As env evolves at a faster rate than pol19,21, env has shown to be useful in determining the recency of HIV-1 acquisition22 and could provide more resolution to infer active or recent transmission clusters than pol.

This study applies NGS to a subset of newly and previously diagnosed participants in the DC Cohort to characterize the recent (2016 and 2017) local phylodynamics of HIV-1 in Washington, DC. Towards this general aim, we 1) estimate the diversity of HIV-1 in Washington, DC, 2) determine the circulating drug resistant mutations, 3) identify and evaluate potential transmission clusters with consensus sequences and their association with epidemiologic and clinical factors, and 4) predict HIV-1 haplotypes for each sample and assess their potential for detecting transmission clusters. The number and size of transmission clusters may vary across HIV-1 gene regions12,15,23,24,25, hence in this study we also compared phylodynamic inferences based on the polymerase and envelope HIV-1 genes.

Results

Sample and phenotypic characterization

Our sampling included PCR products from 68 participants in the DC Cohort. Most of the study participants resided in Washington, DC (Table 1). The majority were non-Hispanic black (82.4%) and male (69.1%), with 52.9% of participants being non-Hispanic black males. The majority of participants were infected through heterosexual sexual contact (39.7%) followed closely by MSM sexual contact (35.3%). A total of 76.4% of the patients were on an ART drug regimen at the time of blood sample collection. The demographics of our subsample of DC Cohort participants reflects a similar composition of PLWH in DC3,4 but includes slightly more participants infected through heterosexual contact and non-Hispanic Blacks, and a lower proportion of participants on ART than the overall Cohort sample.

Table 1 Demographic and clinical characteristics for DC Cohort participants whose samples were sequenced and passed filtering criteria.

HIV-1 diversity in DC

We performed in-depth phylodynamic profiling of HIV sequences from 171 PCR gene products that passed quality thresholds, including 62 PR/RT, 62 int, and 47 env amplicons. The subtyping analyses showed that all of the participants belonged to subtype B; therefore, subsequent analyses included data from all participants. Our participants were dispersed amongst and showed a star-like pattern with other DC HIV PR/RT sequences (see Supplementary Fig. S1 online). The env(c) data showed higher nucleotide diversity (π) and Watterson genetic diversity (θ) than pol(c) (Table 2). Males had a higher diversity, though not significant, than females (haplotype diversity: PR/RT: p = 0.2039, int: p = 0.9571, env: p = 0.3404). Participants whose risk factor was IDU (n = 6) had 50% less diversity than those with MSM and HRH risk, though again not significant (haplotype diversity: PR/RT: p = 0.7323, int: p = 0.7861, env: p = 0.6560). Non-Hispanic black participants had a higher genetic diversity for the pol(c) gene than for the env(c) gene (Table 2). The average haplotype diversity when calculated with the number of reconstructed haplotypes by PredictHaplo showed that env had more haplotype diversity compared to PRRT and int (Table 3). The average number of haplotypes per participant was the same for PR/RT and int (2 haplotypes) and slightly higher for env (4). Four of the six participants that had a higher number of haplotypes reconstructed (7–12 haplotypes in one or more gene regions) also had a higher average haplotype diversity estimate of 0.634 (range: 0.392–0.777), while the other two had a very low average haplotype diversity estimate (0.032, range: 0.027–0.038). Participants with HRH and MSM risk factors were found to have an average of 3 reconstructed haplotypes, with haplotype diversities of 0.338 and 0.335, respectively.

Table 2 Nucleotide diversity between the pol and env concatenated consensus sequences.
Table 3 Haplotype diversity estimates from PredictHaplo results.

Drug resistant mutations

The consensus concatenated gene regions (pol(c) and env(c); see Methods for definition of “(c)”, which in short stands for consensus) for each participant were used to evaluate the presence of Drug Resistant Mutations (DRMs) (Table 4). The PR gene from participants had the fewest DRMs (1), compared to RT (25) and int (11) genes. The majority of DRMs were NRTI, NNRTI, and RT surveillance DRMS (SDRMs). Together, they included 12 to 24 resistant participants and 15 to 44 total DRMs (9 to 19 unique DRMs). Finally, 34 participants (50.0%) showed at least one mutation and 24 (35.3%) showed at least two different DRMs. Of the ARV treatment naïve participants, none were found to have a DRM present in the PR and RT genes, and only a single participant was found to have an IN Accessory DRM at amino acid 157. This treatment naïve individual with a DRM was a part of a transmission cluster for all genes except the haplotype V1V2 gene region. An overall DRM prevalence of 1.4%, 35.7%, and 15.7% was estimated for PR, RT, and int, respectively. The 40–49 year-olds in our study had the highest prevalence of DRMs (22.9%), while the 20–29 year-olds in our study did not show any DRMs. Overall, DRMs caused amino acid changes in only one codon position in PR, while 22 and 9 different codons positions changed in RT and int, respectively, when analyzing the consensus sequences. However, more codons were affected by an amino acid change in PR (5 codons) and RT (27 codons) when analyzing the haplotype sequences. On average, more DRMs and more unique DRMs were identified in the haplotype sequences (Table 4). Also, one young adult (20–29 years old) contained DRMs in RT, and one treatment naïve participant contained DRMs in RT. Slightly more participants had a haplotype sequence that showed at least one DRM (37 participants; 54.0%) and at least two different DRMs (30; 44.1%).

Table 4 Drug Resistant Mutations.

FUBAR analysis, which identifies nucleotide positions under positive selection, identified inferred two, three, and four codons under positive selection in the PR gene, RT gene, and int gene, respectively, when analyzing the consensus sequences (Table 4). Positively selected sites 37 and 57 were inferred by FUBAR for PR and sites 35, 83, and 162 for RT. Amino acid positions 201, 216, 265, and 283 were found to be under positive selection for int. No codons overlapped between the FUBAR analysis and DRM analysis for any gene. None of the sites predicted by FUBAR are known resistance sites26, suggesting DRMs are fixed in the population. Additionally, FUBAR also found four codons under positive selection for the V1V2 and V3 genes (Table 4). These codon sites were 8, 15, 34, 53 for V1V2 and 82, 90, 93, 106 for V3. Every IDU participant had a mutation in at least one of the sites predicted by FUBAR in the V1V2 and V3 genes. HRH (55.6%) and MSM (72.0%) participants had a high prevalence of sites under selection as well. Additionally, a total of 51.4% and 17.1% of the male and female participants, respectively, contained a mutation at one of these predicted sites. Furthermore, FUBAR analysis inferred four codons under positive selection in the PR and RT genes and nine codons for int from the haplotype sequences. These sites were 64, 72, 77, and 93 for PR; 35, 85, 102, and 200 for RT; and 201, 206, 211, 218, 230, 256, 265, 283, and 285 for int. The only codon position predicted by FUBAR in the haplotypes was 230 for int, and it is associated with reduced susceptibility to integrase inhibitors (INSTI)27. Seven (4, 34, 41, 43, 44, 45, 64) and eight (22, 40, 84, 92, 93, 95, 105, 111) codons were inferred to be under positive selection for V1V2 and V3 haplotype sequences, respectively. Interestingly, only a handful of inferred positively selected codons overlapped between the haplotype and consensus sequences, those were at position 35 for RT; positions 201, 265, and 283 for int; 34 for V1V2; and 93 for V3.

Transmission clusters

Transmission clusters were assessed by phylogenetic methods and HIV-TRACE, a genetic-distance based clustering method. Phylogenetic methods found support (>70% bootstrap or >0.95 posterior probability) for 33.8% of the sequences associated with six transmission clusters in pol(c) and for 31.9% of the sequences associated with seven clusters in env(c) (highlighted in Fig. 1). All of the clusters were comprised of two to three sequences, except one cluster in pol(c) which had twelve sequences. Ten of the twelve sequences in the large pol(c) cluster contained a DRM within RT. Furthermore, 11 of the 12 sequences in this large cluster were included on the same sequencing run; therefore, we were unable to undoubtfully discriminate between laboratory artifacts or batch effects and HIV infection to interpret this transmission cluster. However, there were other samples from that same sequencing run that did not cluster with these twelve sequences and formed a dyad cluster. The most common DRM was T215C, which does not reduce NRTI susceptibility, and was found in eight of the twelve sequences.

Figure 1
figure 1

Cladogram of the pol and env concatenated genes of Washington, DC showing sex, race/ethnicities, and risk factors in rings. All phenotypes present are represented with different colors, see legend. All sequences were subtype B. Well-supported clades are depicted. MSM = men who have sex with men; HRH = heterosexuals; IDU = injection drug users; UNK = unknown; OTH = other. Numbers correspond to the de-identified participant.

HIV-TRACE grouped only a few sequences into two clusters for pol(c) (12.8%) and into a single cluster for env(c) (2.9%) (Fig. 2I). More transmission clusters were estimated with the haplotypes reconstructed from PredictHaplo (Fig. 2II, Table 5). For PR, RT, and int (genes also used in past DC HIV studies28,29), 82.1% of our participants were incorporated into transmission clusters. Unique to our dataset was the use of envelope to predict transmission clusters; 35.8% of our participants were included in V1V2 and V3 transmission clusters. Haplotypes were not concatenated for this analysis, but little overlap of cluster composition was found between gene regions. Often one participant’s haplotype would cluster with another participant in one gene region, but it would cluster with a different participant in a different gene region. Thus different gene regions displayed different transmission clusters that would not have been detected if one only analyzes a single gene. Clusters predicted in all gene regions were mainly composed of two participants (42 of the 55 total clusters). Likewise, the majority of the transmission clusters predicted by past DC HIV studies28,29 were comprised of two or three participants, regardless of the gene region, clustering estimation method, or haplotype/consensus construction approach.

Figure 2
figure 2

Cluster network (HIV-TRACE) of (I) consensus pol and env genes and (II) haplotypes reconstructed with PredictHaplo for PR, RT, int, V1V2 and V3 genes of Washington, DC participants by risk factor and race/ethnicity. Numbers correspond to the de-identified participant. Some participants have multiple HIV-1 haplotypes.

Table 5 Characteristics of transmission clusters with HIV-TRACE and comparison between different genes.

Discussion

Collectively, our study aimed to characterize the local and recent phylodynamics of a subset of DC Cohort participants from Washington, DC metro area living with HIV. By combining clinical and behavioral data with NGS data, we were able to identify transmission clusters across groups with different demographics and risk behaviors. Additionally, this study reconstructed sequence variants present within a participant (i.e., intra-host) and investigated associations between the reported participant characteristics and transmission clusters.

Our population genetic estimators indicate that HIV-1 env(c) is more genetically diverse than pol(c). Similar diversity estimates were found in other studies19,29,30,31. Our estimate of genetic diversity for pol(c) in DC (θ = 0.079) was lower than those reported for subtype B in Pérez-Losada et al.29 (θ = 0.084 and 0.090), but higher than those currently reported for the US subtype B sequences in Los Alamos HIV-1 database (θ = 0.075 for int; θ = 0.067 for PR/RT). This same trend was seen in env(c), where our DC HIV-1 genetic diversity (θ = 0.202) was greater than that reported for V1-V3 from Los Alamos HIV-1 database (θ = 0.168) across the US. Haplotype diversity estimated from the PredictHaplo results also found env genes more genetically diverse than PR/RT and int.

Notably, there were interesting differences in diversity estimates among risk groups. Participants who were infected through injection drug use (IDU) were found to have about half of the diversity of MSM and HRH participants. This low diversity, potentially associated with low multiplicity of HIV-1 infection, is not unusual within IDU individuals32,33. In both pol(c) and env(c), males also showed higher genetic diversity, which could be attributed to half (51%) of the males in this study being infected through sex with other men (16% of the males had an unknown risk factor). Our measurements of HIV-1 diversity in HRH were also high and similar to those of MSM; however, we did not observe differences in HIV genetic diversity by race or ethnicity.

The HIV-1 subtype B epidemic in DC is highly diverse, and our results here agree with previous conclusions suggesting a mature epidemic29,34. High genetic diversity could result from risk groups intermingling and viral strains being exchanged and the transient nature of the DC metro area population. DC is an international stop for some, a temporary residence for others, and home for many. This constant influx of incomers could have a boosting effect on the DC HIV-1 population by consistently introducing new viral strains into the pool. Treatment and vaccine development can be compromised by high genetic diversity. As HIV-1 continues to evolve and, as seen here, high genetic diversity levels are kept constant over time34, resistance to vaccines and ART drugs may increase, which could ultimately lead to treatment and prevention failures35.

A Drug Resistant Mutation (DRM) prevalence of 48.6–54.0% was detected in this subset of the DC Cohort, depending on use of consensus sequence or haplotypes. Lower DRM prevalence rates were previously reported for the DC area28,29,34,36 (17.3–37.9% between 1994 and 2016). A much higher rate (66%) was reported in a smaller study of ART treatment-naïve and experienced pediatric patients in Rhode Island37. DRM rates over 50% are also seen in large sequence databases in the UK and Switzerland29. Our study found fewer codons affected by a DRM and fewer DRMs in our participants than previous studies28,29 of the DC epidemic. We only found 32 and 38 codons to be affected when analyzing consensus sequences and haplotypes, respectively, whereas Pérez-Losada et al.29 found 83 codons affected for subtype B, however that study contained 20 times more sequences than ours. Moreover, our study found three novel DRM sites that were not identified in either previous study: P145PAST (IN Major) and A128APST, Q146QH (IN Accessory).

More recently, a study by Kuhnert and colleagues38 reported on the fitness of fourteen HIV-1 resistance mutations, of which seven were detected in RT in our study. Three of the seven DRMs (codons: 41 L, 67 N, 184 V) were NRTI-related and the other four DRMs (codons: 103 N, 108I, 138 A, 181 C) were NNRTI-related. Six DC Cohort participants contained the 184 V DRM, which was found by Kuhnert et al.38 to have the highest transmission cost – i.e., the success (low transmission cost) or lack thereof (high transmission cost) of transmission of hosts infected by drug resistant strains. Because of this high cost to the virus, the mutation resulted in very short transmission chains despite evolving frequently under treatment failure38. Of the six DC Cohort participants, all were included in a cluster when using the haplotypes, but none of the participants clustered with each other. Additionally, this mutation was found to be persistent in the DC HIV-1 viral population since 200534. Previous studies showed that the most important NNRTI mutation currently is 103 N because of its connection to first-line treatment failure39,40,41; a quarter of our DC Cohort participants with one or more DRMs contained this mutation, which was also found to be at a low frequency in the DC HIV-1 viral population since 200534.

In treatment-naïve participants, both when using consensus sequences and haplotypes, we estimated a low prevalence rate of DRMs (14.3%). Similarly, low prevalence rates have been seen in the past in the US (42; 15% between 1999 and 2011) and even lower in treatment-naïve individuals in Europe (41,43,44,45; 10% between 2001 and 2013). Moreover, we also found fewer DRMs in treatment-naïve participants compared to a recent study of PLWH in DC28 (22.5% between 1994 and 2013). Kassaye et al.28 also observed a downward trend of overall prevalence of DRMs over time in treatment-naïve individuals. Our results suggest a further decrease in overall prevalence of DRMs in the current DC HIV-1 epidemic. A lower prevalence of DRMs in surveillance versus targeted treatment-naïve studies could result from sampling design. A surveillance study of DC would likely provide the more accurate picture of DRM trends in the population.

Novel sites that continue to evolve in the DC epidemic and have not become fixed in the population are of serious concern for future drug therapy and conferring resistance to these drugs. More and different codons were predicted to be under positive selection in the haplotype sequences than the consensus sequences. FUBAR predicted sites in V1V2 are likely being impacted by the immune system, which the virus is actively trying to evade (diversifying selection). V3 is associated with co-receptor binding46; codons predicted here could be rising advantageous mutations by HIV-1 to adapt to the host cells’ response against the virus. Five codons (PR: 37, RT: 35, and int: 201, 265, 283) were identified by both our study, in both haplotypes and consensus sequences, and Pérez-Losada et al.29 as sites under selection for subtype B. Since none of these sites corresponded to any known Stanford DRMs, these may be newly evolving resistance mutations in the DC HIV-1 epidemic. Given that all of our participants were on dual- or multiple-drug regimens, these sites may also be indicative of potential escape mechanisms by the virus in response to multiple-drugs treatments. Thus, these amino acid replacements are candidates for fitness testing with and without associated drugs to infer their ability to confer drug resistance, their relative fitness status in different environments, and their transmissibility across individuals.

Additionally, identifying transmission clusters is critical to recognizing groups who may be at risk of contracting HIV-1 or who may already be infected but are not yet aware of their diagnosis. Phylogenetic studies suggest that transmission clusters greatly contribute to the spread of HIV-1 within the population47; therefore, identifying high-risk groups, whether that is based on risk behavior or geographical location48, can help public health officials to better target prevention efforts and treatment options. The spread of infection is often associated with early HIV-1 infection47, consequently, molecular surveillance of the DC epidemic should continue in order to identify potential areas or clusters of transmission and, thus, help lower the HIV-1 incidence.

Towards this goal, we combined NGS sequence data with clinical characteristics to obtain a dynamic picture of the evolution of HIV-1 in the DC metro area. We detected high levels of clustering using haplotypes (32.8–64.2%, not including the V1V2 region), as also seen in other HIV-1 cohorts (24–65%)29,35,49,50,51,52,53,54,55,56,57,58,59,60,61. Other studies, including one completed with Sanger sequencing in DC28,31,36,52,62, however, have found lower levels of clustering (7–17%), in agreement with those reported here for the consensus sequence phylogenetic clusters (pol(c) and env(c)) and the V1V2 region for haplotypes (9.0%). Therefore, a more comprehensive understanding of HIV-1 transmission events in DC has been achieved when evaluating multiple genes together, rather than primarily focusing on polymerase genes that are typically screened for DRMs in clinical settings or used in investigations at the local health department level. By excluding envelope genes, informative transmission events can be missed, which could hinder community health prevention and intervention efforts. In an ideal setting, using all the genetic information available would be most favorable when investigating local HIV-1 phylodynamics.

In agreement with a recent study of HIV-1 transmission clusters in Chicago59, we also found association of risk factors within clusters. More HRH participants fell in our haplotype transmission networks compared to MSM, IDU, and participants with unknown risk (HRH = 23, MSM = 18, UNK = 11 each & IDU = 6). A total of 58.3% of the clusters that included an HRH participant also had an MSM participant. Likewise, a US study that included 12 major US cities63 found transmission clusters that contained overlap between participants who were MSM and HRH. Mixing of risk types in HIV-1 subtype B transmission clusters has also been observed in Switzerland, Iceland, and Nordic European countries60,64,65. Contrarily, Kouyos et al.65 found segregation based on location among individuals who were included in a transmission cluster despite having overlapping risk factors. Risk groups may be mixing due to underreporting of risk behaviors or bisexual behavior65,66,67. This heterogeneity of risk groups in transmission clusters suggests that focusing on individuals within city areas (e.g., wards in Washington, DC) to concentrate resources and information may help in addressing the HIV-1 epidemic.

Otherwise, we were unable to determine the mode of transmission for the “unknown modes of transmission” group (16.2% of our sample). Nonetheless, our results suggest that the mode of transmission may not be as important for prevention and intervention efforts as the location where transmission events are occurring. Likewise, Morgan et al.59 suggested not targeting efforts towards risk groups, but rather age groups, particularly younger people, in Chicago. The average age of the DC participants included in a haplotype transmission cluster was 46.6 years of age. If DC’s younger population is being the most affected, as suggested by the new cases identified by the DC DOH in 2016 and 20171,3, taking a spatial dynamic approach to intervention with continued surveillance may help. Through surveillance studies, further adapted location-based prevention efforts can be employed.

Notably, our analysis has some limitations. We included only 68 participants of the approximately 10,000 people enrolled in the DC Cohort3, whereas past studies conducted in DC included 700 (Kassaye et al.) and 1,500 participants (Pérez-Losada et al.). However, the demographics in our sample size are similar to PLWH in DC3,4. As a prospective study conducted as part of an ongoing HIV-1 surveillance program associated with the DC Cohort, we capitalized on all the current cases that met our inclusion criteria (see Matierals and Methods: DC cohort). These past studies of HIV-1 diversity in Washington, DC were historical in nature and, therefore, had larger sample sizes available. Our study also applied a powerful next-generation sequencing approach instead of Sanger sequencing (previous DC studies), to characterize the current HIV-1 epidemic. With the implementation of NGS, mapping very diverse short reads to a reference genome poses alignment issues14, which can add difficulty to downstream analyses. We circumnavigated this alignment issue by using HAPHPIPE, where the reads were mapped against a tailored reference genome, thus resulting in higher alignment rates and fewer errors. Nonetheless, aligning very diverse reads still remains an issue. Although new sites under selection were identified using NGS, their clinical relevance as potential DRMs requires further validation. We also recognize that we used conservative genetic distance cutoff values for determining transmission clusters, which could result in lower numbers of transmission clusters68; however, this conservative estimate reduced the number of false positive transmission clusters. Finally, due to the nature of predicting transmission clusters and the potential for missing individuals, we are not able to determine the direction of infection within the transmission clusters. We were also unable to rule out batch effects or laboratory artifacts accounting for any transmission cluster which participants were included in the same sequencing run.

Conclusions

This study showed that NGS and epidemiological data can be used to characterize the current phylodynamics of a subset of people living with HIV, enhancing our understanding of the diversity and local dynamics of the HIV-1 epidemic in the DC area. HIV-1 diversity in DC is high and seems to remain stable over time. Furthermore, NGS of the envelope gene provided sufficient coverage to compare transmission cluster inference across HIV-1 gene regions14,20. Additional transmission clusters were identified when using HIV-1 intra-host haplotypes instead of consensus sequences, which led to networks linking a higher number of participants. Moreover, transmission clusters varied across genes, with each gene suggesting a different transmission story. Hence, using multiple HIV-1 genes or whole genomes is recommended to infer more reliable transmission clusters. Inferred clusters should then be linked to locations in DC to target transmission intervention efforts. Additionally, HIV-1 drug resistance was only found when using haplotypes in a single young adult in our cross-sectional sample of the cohort. Future studies should also focus on age groups and geographic regions rather than only risk factors. As the DC area maintains significant rates of HIV-1 infection, integrating present and past molecular data from previous studies conducted in DC in 201729 and 201328 will help to paint a comprehensive picture of the HIV-1 transmission and evolution of drug resistance in this high prevalence urban U.S. city. Future HIV-1 phylodynamic studies should also include more participants, particularly young adults, and newly diagnosed persons to provide a comprehensive view of DRM prevalence in treatment-naïve individuals in the DC area. Studies revealing the severity of transmitted drug resistance in the DC population may provide physicians and public health workers with additional information to design more effective treatment plans for newly diagnosed individuals and intervention strategies for targeted key populations.

Materials and Methods

Ethics

Institutional Review Board (IRB071029) approval was obtained from The George Washington University IRB (which serves as the IRB of Record for eight of the participating sites), the DC DOH IRB, and the remaining site IRBs. Informed consent was obtained and documented prior to conducting study procedures. Sample collections from participants were performed in accordance with relevant guidelines and regulations.

DC cohort

Participants from the DC Cohort were recruited for this molecular epidemiology sub-study from January 2016 through May 2017. Eligibility criteria included current DC Cohort enrollment, ≥18 years of age, HIV-1 diagnosis within prior 12 months of enrollment or detectable HIV-1 viral load of ≥1,500 copies/mL, ability to provide written informed consent, and completion of a behavioral survey; a total of 104 participants met the eligibility criteria. Blood samples were collected at the clinical sites and transported to George Washington University for processing, targeted amplification, library preparation and NGS. Sample sequences were paired with clinical and demographic data retrieved from the database from the DC Cohort (Table 1). Clinical and demographic characteristics collected included age, race/ethnicity, sex at birth, gender, country of birth, state of residence, zip code, HIV-1 risk factor, presence of co-infections (e.g., chlamydia, gonorrhea, syphilis, trichomoniasis, and Hepatitis B and C), duration of infection, CD4 count, viral load, ART exposure, ART regimen type, date of sample, and date of HIV-1 diagnosis. The paired data were de-identified and analyzed using the approaches described below.

Next-Generation sequencing

Total RNA was extracted from each patient’s plasma sample, and cDNA synthesis followed. The QIAamp Viral RNA Mini Kit (Cat. #52904, Qiagen, Gaithersburg, MD) and the SuperScript IV First-Strand Synthesis System (Cat. # 18091050, Invitrogen, Carlsbad, CA) were used respectively and according to manufacturers’ instructions. Multiple sets of HIV-1 specific primer pairs were used to target and amplify using polymerase-chain-reaction (PCR) the protease (PR), reverse transcriptase (RT), integrase (int), and envelope (env) HIV-1 genes (~43% of genome)36. Library preparation was completed with Nextera XT Library Prep (Cat. # 15032350, Illumina, Dan Diego, CA). Samples were then sequenced on eight runs on an Illumina MiSeq platform using the MiSeq v2 (300 cycles) chemistry (Cat. # MS-102–2002, Illumina). Both library prep and sequencing were completed according to the manufacturer’s instructions. All DNA sequence files are available from the GenBank database under SRA accession: PRJNA517147.

Sequence analyses

The raw sequence data for each patient were processed through HAPHPIPE (https://github.com/gwcbi/haphpipe), a HAplotype reconstruction and PHylodynamics PIPEline for genome-wide assembly of viral consensus sequences and haplotypes69. Briefly, HAPHPIPE includes modules for quality trimming, error correction, assembly, and haplotype reconstruction. We put the raw sequencing FASTQ files through quality control and quality trimming with Trimmomatic70. Error correction of the reads was completed with an earlier version of HAPHPIPE that used BLESS71, and the cleaned reads were mapped against the current HIV-1 subtype B reference sequence HXB2 (Genbank accession: K03455)72. Through iterative refinement, the cleaned reads were then mapped back to the reference sequence generated in the mapping step with Bowtie273. This iterative refinement step was completed twice, first using only a random subsampling of the reads (25% subsampling) and the fast-local mapping option to speed up the computational time, and second using all of the sequence reads and the very sensitive mapping option to further refine the individually crafted reference sequence. A consensus sequence was generated from the refined reference sequence, and a final refinement step was concluded with BLAST74 against this refined consensus sequence. The resulting sequences were filtered to include participants that contained a passing amplicon, defined as having > 95% of the amplicon covered by 10x or greater read coverage. Amplicons that did not pass this filter were removed, and this subset was then used for subsequent phylodynamic analyses (Table 1).

Sequence data for each PCR amplicon (PR/RT, int, and env) were aligned individually using MAFFT with the L-INS-i algorithm75 in Geneious (ver. 9.1.6)76. Protease (PR) and reverse transcriptase (RT) were extracted from the PR/RT amplicon, and env was further divided into the variable regions: gp120 V1V2 (HXB2 coordinates: 6615–6812) and gp120 V3 (HXB2: 6984–7349). PR, RT, and int were concatenated into the pol(c) gene region, and V1V2 and V3 were concatenated into the env(c) gene region. Concatenated gene regions will be distinguished from whole gene regions by adding “(c)” to the end of the gene name. Each gene (PR, RT, int, V1V2, and V3) was extracted from the amplicon data to fulfill different purposes: (1) remove any nucleotides belonging to other genes, for example the env PCR amplicon contained primer sequences, and therefore nucleotides belonging to the vpu gene region; (2) simulate amplicon sizes that could be covered end to end by paired-end reads to be consistent and comparable to future NGS studies with HIV-1 when using PrimerID or other local haplotype phasing techniques8,77; and (3) account for differences in PCR performance between and within samples by extracting a common, high-coverage region. Therefore, missing data in this dataset were low, and often only due to amplification failure of an entire amplicon. Concatenating the genes into their respective gene regions (pol and env) retained variants in genes that are often studied, such as protease and reverse transcriptase for drug resistant mutations. It also allowed comparisons to past studies based on Sanger sequencing that used either parts of genes or whole genes. Our overall goal was to keep the integrity of the individual genes while using as much of the NGS data as possible.

Identification of subtypes and drug resistant mutations

HIV-1 subtype identification was completed for each concatenated gene region (pol(c) and env(c)) using the REGA subtyping tool (version 3)78,79. A total of 170 subtype reference sequences from the Los Alamos HIV-1 database (LANL; http://www.hiv.lanl.gov/) were included to assign the patient sequences to a particular subtype clade and validate the findings from REGA using phylogenetic methods described below. Drug resistant mutations were identified aligning the consensus concatenated gene nucleotide sequences with reference strains in the Stanford HIV Drug Resistance Database (https://hivdb.stanford.edu) using the HIVdb program80. Nucleotide positions under positive selection were identified using Fast Unconstrained Bayesian AppRoximation (FUBAR)81 in HyPhy82. Recombination in our HIV-1 data was accounted for with GARD83,84.

Phylogenetic analyses

Phylogenetic estimations were completed for each concatenated gene region. The best-fit model of molecular evolution85 was estimated for each pol(c) and env(c) from the data using jModelTest286 in CIPRES Science Gateway87. Amino acid positions corresponding to identified DRMs described above were removed prior to phylogenetic estimations to avoid potential bias due to selection. A maximum likelihood phylogenetic estimate using RAxML88 was made for each region with the 3 codon-position partitions89. The branch support for the RAxML phylogenetic trees was estimated with a bootstrap approach with 1,000 replicates90. Bayesian trees were inferred using MrBayes91. Four Markov chains (one cold and three heated) were run for 8 × 108 generations sampling every 2,000 steps for each gene region, and each run was repeated twice. The output was analyzed in Tracer92 to assess convergence and mixing of the chains. Subtypes references for subtype D (GenBank accessions: K03454, AY371157, AY253311, U88824) and circulating recombinant forms CRF28,42-BF (GenBank accessions: FJ213781, FJ358521, FJ670529) and CRF10-CD (GenBank accessions: AF289548, AF289549, AF289550) were pulled from LANL and used as proper outgroups for the phylogenetic analyses93. Additional RT sequences from DC29 were included to observe how our data related to other DC sequences. We visualized the epidemiological data on the resulting trees with the Interactive Tree of Life (iTOL).

Haplotype reconstruction

For the identification of transmission clusters and testing for associations of clinical variables to transmission clusters, it is ideal to characterize within patient viral variation as individual sequence variants (haplotypes) instead of combining all of the individual reads into a single consensus sequence94,95. Therefore, haplotypes for each patient were predicted from the sequence data using HAPHPIPE’s haplotype stages. Haplotype reconstruction was performed on each PR/RT, int, and env targeted PCR amplicons using PredictHaplo96. Each gene region (PR, RT, V1V2, and V3) was then extracted from the corresponding targeted amplicon and, using the methods described below, transmission clusters were estimated using the predicted haplotypes. No concatenation of the individual genes to form the regions pol(c) and env(c) was done with the haplotypes.

Identification of transmission clusters

Transmission clusters were assessed for each pol(c) and env(c), as well as for each of the gene regions with the haplotypes, using phylogenetic methods89,91 and the genetic-distance based clustering method HIV-TRACE97. Phylogenetic transmission networks were defined as clades with bootstrap proportions ≥70 or posterior probabilities ≥95%. Genetic distance thresholds of 0.0129,62 and 0.0298 substitutions/site were used for pol and env in HIV-TRACE, respectively, to identify potential transmission events. Ambiguities were handled with the HIV-TRACE option “average” to avoid biases and false positives and minimum overlap was 1/genetic distance threshold and adjusted for size of amplicon, as recommended. Default settings were used for the remaining parameters. Transmission clusters were compared between gene regions.

Diversity estimation

Haplotype diversity (h), the number of segregating sites (S), nucleotide diversity (π), and Watterson’s genetic diversity (θ) (see99) were estimated for both the consensus pol(c) and env(c) regions per patient using DnaSP (ver. 6.11.01)100. Haplotype diversity (h), which takes into account the number of haplotypes and their relative frequencies, was also estimated from PredictHaplo results according to Nei and Tajima101. Both diversity estimates were used as the number of haplotypes estimated from DnaSP is representative of inter-patient diversity, whereas the number of haplotypes and haplotype diversity estimates from the PredictHaplo results are representative of intra-patient diversity. Significance for haplotype diversity between clinical variables was measured with the Wilcoxon rank sum test or Kruskal-Wallis test in R v 3.6.0102 using RStudio v 1.2.1335103.

Ethical approval

Written informed consent was obtained from all participants prior to enrollment in the DC Cohort and the molecular epidemiology sub-study. The DC Cohort and molecular epidemiology studies were approved by the Institutional Review Board at The George Washington University, which serves as the IRB of record for Whitman-Walker Health, La Clinica del Pueblo, Family and Medical Counseling Service, Unity Health Care, The GW Medical Faculty Associates, MetroHealth, and Children’s National Health System (pediatric and adolescent clinics). The study was independently approved by the IRBs of record at Howard University Hospital (adult and pediatric clinics), MedStar Washington Hospital Center, Georgetown University, and the Veterans Affairs Medical Center.