HIV‐1 proviral landscape characterization varies by pipeline analysis

Abstract Introduction HIV rebounds after cessation of antiretroviral therapy, representing a barrier to cure. To better understand the virus reservoir, analysis pipelines have been developed that categorize proviral sequences as intact or defective, and further determine the precise nature of the sequence defects that may be present. We investigated the effects that different analysis pipelines had on the characterization of HIV‐1 proviral sequences. Methods We used single genome amplification to generate near full‐length (NFL) HIV‐1 proviral DNA sequences, defined as amplicons greater than 8000 base pairs in length, isolated from peripheral blood mononuclear cells (PBMC) of treated suppressed participants with HIV‐1. Amplicons underwent direct next‐generation single genome sequencing and were analysed using four HIV‐1 proviral characterization pipelines. Sequences were characterized as intact or defective; defective sequences were assessed for the number and types of defects present. To confirm and extend our findings, 691 proviruses from the Proviral Sequence Database (PSD) were analysed and the ProSeq‐IT tool of the PSD was used to characterize both the participant and PSD proviruses. Results and discussion Virus sequences derived from thirteen ART‐treated virologically suppressed participants with HIV were studied. A total of 693 HIV‐1 proviral sequences were generated, 282 of which were NFL. An average of 53 sequences per participant was analysed. We found that proviruses often harbour multiple sequence defect types (mean 2.7, 95% confidence interval [CI] 2.5, 3.0); the elimination order used by each pipeline affected the percentage of proviruses allotted into each defect category. These differences varied between participants, depending on the number of defect categories present in a given provirus sequence. Pipeline‐specific differences in characterizing the HIV‐1 5′ untranslated region (5′ UTR) led to an overestimation of the number of intact NFL proviral sequences, a finding corroborated in the independent PSD analysis. A comparison of the four published pipelines to ProSeq‐IT found that ProSeq IT was more likely to characterize proviruses as intact. Conclusions The choice of pipeline used for HIV‐1 provirus landscape analysis may bias the classification of defective sequences. To improve the comparison of provirus characterizations across research groups, the development of a consensus elimination pipeline should be prioritized.


| INTRODUCTION
During HIV-1 infection, viral DNA integrates into the genomes of CD4 + T cells and can persist during antiretroviral therapy [1][2][3]. Previous work has investigated what proportion of integrated sequences are intact and therefore may contribute to the replication-competent HIV-1 reservoir [4][5][6][7][8]. The vast majority, greater than 90%, of integrated proviruses are defective and cannot support virus replication [4,6]. When characterizing intact proviral sequences, the nature of the defective proviruseswhether due to large deletions, hypermutation, inversions, premature stop codons or other categories of defectsis typically also reported. These provirus sequence defects arise through different mechanisms that include HIV reverse transcriptase-induced mutations, template switching during reverse transcription and APOBEC3-mediated cytidine deamination that causes guanine-to-adenine (G-to-A) hypermutation [4,9,10].
Current proviral landscape research focuses primarily on the dichotomy between defective versus intact sequences, though "defective" proviruses may still retain transcriptional and translation activity that is relevant to HIV eradication efforts [8,11]. In general, research groups that characterize provirus sequences use custom in-house analysis pipelines. Some laboratories categorize amplified DNA sequences manually, whereas others have developed automated, scripted pipelines [5,7]. While most pipelines include all the classic provirus defect categories (e.g. hypermutation, large deletions, etc.) the order in which defective proviruses are categorized and removed from the pipeline may differ. It is therefore possible that different research groups may arrive at similar conclusions as to the percentage of intact proviruses, whereas the proportion of proviruses assigned into each defect category varies. If true, the direct comparisons of HIV provirus characterizations across research groups may be more difficult and could compromise a deeper, meta-analytic understanding of the defective nature of the HIV reservoir. Furthermore, recent work has moved beyond a dichotomous classification of intact and defective, focusing on forces that shape the proviral landscape; for example cytotoxic T lymphocytes (CTLs) may preferentially target cells with hypermutated or PSI/MSD defective proviruses [11].
To determine the effect that a particular analysis pipeline may have on defective HIV-1 proviral characterization, we categorized the proviral HIV-1 landscapes of 13 HIV-1 study participants using elimination pipelines from four research groups and explored how the proportion of sequences assigned to each defect category varied as a function of the particular analysis pipeline used. To confirm our findings in this participant cohort and enhance the significance of our conclusions, we performed a similar analysis using sequences retrieved from the Proviral Sequence Database (PSD), a freely available online resource. We found that proviral sequences can typically be assigned into multiple defect categories and the order of elimination of the pipeline matters. Not only does the proportion of sequences within each defect category change with the pipeline used, but differences in how defects are determined by each pipeline has an unexpected effect on the reported percentage of intact HIV-1 proviral sequences.

| Participant cohort
Cryopreserved peripheral blood mononuclear cells (PBMC) were obtained from the HIV Eradication and Latency (HEAL) cohort, a longitudinal repository of samples collected from participants with HIV. Peripheral venipuncture samples were collected between 2015 and 2018 and PBMCs were isolated by Ficoll-Hypaque density gradient centrifugation. All participants gave informed consent prior to enrolment; the HEAL study was approved by the Partners Human Research Committee. Treated virologically suppressed participants with HIV were included in this analysis based on the availability of a requisite number of PBMC.

| Proviral amplification and sequencing
Total DNA isolated from participant PBMC samples was subjected to limiting dilution to determine the dilution at which less than 33% of wells were positive for amplified product prior to near full-genome amplification with Platinum Taq HiFi polymerase (Thermo Fisher Scientific). HIV-1 proviruses were amplified using previously published PCR conditions and primers [7,12]. The primers were as follows: first-round forward primer: 5 0 -AAATCTCTAGCAGTGGCGCCCGAACAG-3 0 (HXB2: 623-649); first-round reverse primer: 5 0 -TGAGGGATCTCTAGTTACCAGAGTC-3 0 (HXB2: 9662-9686); second-round forward primer: 5 0 -GCG-CCCGAACAGGGACYTGAAARCGAAAG-3 0 (HXB2: 638-666); and second-round reverse primer: 5 0 -GCACTCAAGGCAAG-CTTTATTGAGGCTTA-3 0 (HXB2: 9604-9632). A convenience endpoint of 20 amplified NFL sequences (>8000 base pairs [bp]) per participant was set. This goal was attained for all participants with the exception of HEAL 09, where low levels of total HIV-1 DNA limited the number of NFL amplicons to nine. PCR amplicons were directly sequenced by next-generation single-genome sequencing (NG-SGS) by the Massachusetts General Hospital Center for Computational and Integrative Biology [MGH CCIB] DNA Core. Briefly, prior to library preparation, the DNA sample was mechanically sheared and Illumina compatible adapters with unique barcodes were ligated onto each sample during library construction. Libraries were pooled in equimolar concentrations for multiplexed sequencing on the Illumina MiSeq platform with 2x150 run parameters, generating approximately 50,000 -80,000 reads and well greater than 30x coverage of these HIV-1 amplicons. Sequencing runs routinely generated >90% Q score 30 (Q30) data.

| Provirus analysis
Sequenced proviruses were manually analysed using Geneious Prime (www.geneious.com) and classified as either containing a large deletion (<8000 bp in total length), hypermutated, carrying a six nucleotide or longer deletion in the 5 0 -untranslated region (UTR), carrying an internal deletion (INDEL), having inversions, having stop codons, carrying a frameshift mutation, carrying a deletion in one of the four stem loops of the packaging signal (PSI) and/or a mutation in the multiple splice donor (MSD) site, or intact. Only sequences with a length greater than 8000 bp were analysed for the other defect categories. Proviruses with APOBEC3G-mediated hypermutation were identified using the Los Alamos National Laboratory (LANL) HIV Sequence Database Hypermut algorithm [13]. The LANL HIV Sequence Database Gene Cutter tool was used to identify stop codons and frameshift mutations [14]. INDELs were identified using the LANL HIV Sequence Database HIVAlign tool and defined as an equal or greater than 2% difference in the sequence of each HIV gene [15]. 5 0 -UTR deletions (set at 6 or more deletions), PSI deletions and MSD mutations were identified by aligning all the sequences to the HIV-1 HXB2 reference sequence (Geneious Prime 2019.1.1) and checked for deletions and mutations [16]. Inversions were identified by aligning each proviral sequence to HXB2 and comparing them using the Dotplot viewer in Geneious Prime [16]. Proviral sequences that lacked any of these defects were labelled as intact.

| Proviral sequence database and ProSeq-IT
Proviral sequences from virologically suppressed HIV-1 participants on ART were downloaded from the Proviral Sequence Database (https://psd.cancer.gov) (PSD) maintained by the National Cancer Institute (NCI) [17]. A total of 691 proviruses were retrieved; 129 were NFL sequences. Intact and defective sequences were manually determined and the proviral landscape was characterized with the four pipelines. Additionally, the Proviral Sequence Annotation and Intactness Test (ProSeq-IT) of the NCI PSD independently analysed the 282 NFL sequences from the 13 HEAL participants and the 691 NFL sequences from PSD [18].

| Participant characteristics
HIV-1 proviral sequences were amplified from PBMC DNA isolated from thirteen participants. The participants' characteristics are shown in Table 1. These participants were almost exclusively male with a mean CD4 count of 807 cells/mm 3 (SD = 393). Participants were receiving antiretroviral therapy, comprised of at least three active drugs, for a minimum of 6 months; all had plasma HIV-1 RNA levels that were below the limit of detection of a commercially available virus load assay.

| Characterization of HIV-1 proviral defects in the HIV-1 participants
We assessed the proviral characterization pipelines from three research groups, as well as our own, based on their published descriptions and personal communication ( Figure 1) [4,5,7]. We used the same definition of a given defect category across all pipelines, unless noted, but acknowledge that our stringency may differ from that originally used by each research group. For instance Hiener et al. define all proviral sequences below 8400 bp as containing a large deletion, whereas we used an 8000 bp cut off.
To understand how different analysis pipelines affect sequence identification and categorization, we used each of the pipelines to analyse 693 HIV-1 proviral sequences isolated from the thirteen participants. An average of 53 sequences were analysed per participant; 12 of 13 participants had at least 20 NFL sequences. Of the 282 NFL sequences, 229 (81.2%) carried more than one type of defect and the distribution of the number of defect categories per provirus sequence varied across participants ( Figure 1b). On average, a defective sequence could be assigned to 2.7 (95% CI 2.5, 3.0) defect categories (median 3 categories). For some participants, for example HEAL 04, 09 and 58, all defective sequences contained two or more defect categories. For HEAL 16 on the other hand, the majority (63%) of defective NFL sequences were assigned to a single category, with the remaining sequences assigned to two, four or five categories.
Hypermutation leads to stop codons, which is why some pipelines combine the two categories or characterize only hypermutation [5,7]. However, for certain participants (e.g. HEALs 01, 16, 25, 30 and 58), we observed instances where stop codons occurred without hypermutation. In HEALs 16, 25 and 30, for example none of the NFL sequences with stop codons displayed G-to-A hypermutation. For HEAL 01 and 58, two and four NFL sequences, respectively, had stop codons not related to hypermutation.
Among the 13 study participants, there were only two instances where we observed a perfect concordance between sequences containing six or more nucleotide deletions in the 5 0 -UTR and those containing a deletion in the PSI stem loops and/or a mutation at the MSD site: HEAL participants 01 and 38. For the other eleven participants, deletions in PSI and/or mutations in MSD were not captured by the more general definition of a defective 5 0 UTR; the more precise the definition of the defect category, the greater the number of sequences that are considered defective (Figure 2). This is illustrated by proviral sequences obtained from participant HEAL 25, where the number of NFL sequences with six or more deletions in the 5 0 -UTR is one, but that number jumps to 12 when assessing for PSI deletions, in this case, due to a single deletion in the conserved hairpin loop of stem loop 1 (SL1) immediately proximal to the palindromic dimerization sequence. Adjusting this category to include deletions in the PSI increases the number of defective sequences in this category for the majority of participants: 01 (0 to 8), 04 (9 to 10), 24 (1 to 7), 25 (1 to 9), 50 (8 to 13) and 58 (0 to 13). Furthermore, not including MSD mutations, as is the case for two pipelines, also inflates the number of intact proviruses in five HEAL participants: 16 (1 to 2), 24 (0 to 6), 25 (0 to 4), 38 (0 to 1), 44 (7 to 11) and 50 (0 to 2).

| The effect of pipeline on defective sequence
proportions Given that 81.2% (229 of 282 sequences) of NFL proviral sequences contain two or more types of defect, the order of elimination used by a pipeline, as well as the inclusion or exclusion of certain categories, may influence the final proviral landscape. To evaluate this, we analysed, on a perparticipant basis, how the percentage of sequences assigned to each category changes as a function of the pipeline used ( Figure 3). We also examined the effect of the pipeline on the cumulative proviral landscape for all 13 participants (Figure 4). How certain pipelines combine HIV-1 sequence defect categories has an influence on the final landscape. The elimination order of the pipeline also influences the final percentage of sequences in each category and this is particularly evident in the case of hypermutation. For instance for HEAL 38, 19 provirus sequences are hypermutated according to three of the pipelines, but this number falls to 9 when sequences are checked for inversions before hypermutations. The same happens for HEALs 01 (11 to 9), 04 (4 to 1), 09 (6 to 2), 19 (15 to 13), 24 (14 to 5), 44 (7 to 3), 50 (10 to 6) and 57 (25 to 12).
An "Inversions" defect category is included in two pipelines, though the order in which it appears is different in each: inversions are the first defect category in the Hiener et al., The percentage of intact and hypermutated genomes is also similar across the four pipelines (between 2.4% and 4.9% for intact and 13.5 and 15.7% for hypermutated) and in line with what has been previously shown for HIV-1 [1,2,4,8]. For the other categories, however, the number of genomes in each category differs substantially. Looking generally at defects in the 5 0 -UTR, the number of genomes in this region varies from 37 (5.1%) to 106 (14.8%) depending on the specificity of the defect category and when it appears in the pipeline. The same is seen with defect categories involving "Stop Codons, " which can represent anywhere from 6.5 to 12.2% of the proviral landscape.

| Characterization of HIV-1 proviral defects in the proviral sequence database
To assess whether the choice of characterization pipeline continues to have an impact on a collection of provirus sequences coming from a number of research groups, where both experimental and sequence analyses differences may lead to interlaboratory variability, we retrieved provirus sequences from the Proviral Sequence Database (PSD). We selected proviruses from HIV-1 participants on ART with undetectable plasma viral loads. A total of 691 provirus sequences were identified, of which 129 were NFL. We then used the three published pipelines, as well as our own, to characterize the proviral landscape of these 691 proviruses ( Figure 5). All pipelines generated similar percentages of sequences with large deletions (81.2%); none of the NFL proviruses had inversions. The number of proviruses assigned to a defect category varied by the stringency of the category, the order in which it appears in the pipeline, and how different types of defects were grouped into a single category. Using the PSD dataset, we recapitulated our finding that the analysis pipeline may affect the percentage of intact sequences, varying two-fold across the pipelines we studied.
The PSD provides a tool (ProSeqIT) that annotates sequences and infers intactness. ProSeq-IT identified 39 (5.6%) of the 691 PSD provirus sequences as intact, a higher proportion of intact sequences than with the four pipelines we tested. To determine how the ProSeqIT tool characterized our NFL HEAL cohort-derived provirus sequence set, we submitted the 282 NFL sequences to ProSeq-IT, which called 6.0% of sequences as intact, again a higher proportion than was determined in the four analysis pipelines. A more detailed analysis identified that this difference in the proportion of sequences called intact was driven primarily by differences in calling PSI deletions, but there were also differences in calling inversions, stop codons/frameshifts and in one case hypermutation.

| DISCUSSION
In this study, our analysis of four proviral landscape characterization pipelines suggests that the choice of pipeline influences both the proviral landscape of an individual participant as well as the cumulative proviral landscape of a study cohort. The pipeline's elimination order can influence the number of genomes allotted into each defect category, as well as the number of sequences deemed intact. This occurs because the majority of amplified provirus sequences are defective in more than one way. The exact category definitions used also influence the final proviral landscape for each participant and, in general, the more specific a defect category, the fewer sequences that are deemed intact by the end of the pipeline.
While the final step in all pipelines defines which proviral sequences are intact, there is marked heterogeneity in the preceding steps. Large deletions and hypermutated sequences are generally identified early in the process. Inversions are categorized either first, in the penultimate step, or are not explicitly categorized. The category for defects in the 5 0untranslated region (UTR) is defined as six or more deletions anywhere along its length or more specifically as deletions in   the PSI packaging signal and, in some pipelines, mutations at the major splice donor (MSD) site. An evaluation for frameshift mutations, nonsense mutations, stop codons and insertions/deletions (indels) occur in all pathways but there are differences across pipelines. The order of these steps varies, as do the permutations on how they may be performed in one combined sequence step. All pipelines examine for the presence of hypermutated sequences, though when it is assessed in the pipeline varies. We found that the further downstream in the pipeline the "Hypermutation" category appears, the fewer sequences are categorized as hypermutated. Given the different pipelines and that the majority of NFL sequences carry at least two types of defects, it is perhaps less surprising that the cumulative proviral landscape varies according to the pipeline used.
We observed an association between certain defect categories that varied across participants. By looking only at hypermutation, sequences with stop codons that were not the result of APOBEC-mediated G-to-A mutations may be considered intact or placed in a different defect category. Alternatively, combining hypermutation and stop codons into a single category of defects may inflate the percentage of the proviral landscape that is deemed hypermutated. We further observed an association between 5 0 -UTR deletions, PSI deletions and MSD mutations. Considering only 5 0 -UTR deletions led our pipeline to erroneously overestimate the number of intact NFL sequences. This demonstrates that the specific pipeline used can influence not just the number of provirus sequences in each category, but also the number of sequences considered intact. We found that a more stringent approach to evaluate the 5 0 -UTR, using both PSI deletions and MSD mutations, led to fewer intact sequences. Work exploring how CTLs shape the proviral landscape have looked at PSI/MSD defects; we found that the percentage of genomes in the 5 0 -UTR defect group varied from 5% to 15% depending on the pipeline, something to be aware of when looking at landscape results across research groups.
Our study has limitations. We did not assess the replication competence of the NFL proviral sequences deemed intact by the four pipelines. A NFL provirus sequence that is intact via the pipelines may prove replication competent in an in vitro experiment. Conversely, the relationship between smaller deletions in sequences labelled defective, for example the SL1 mutations we identified in proviruses and exemplified in HEAL 25 ( Figure 2) and replication competence was not tested.
We acknowledge that research groups may have different defect definitionsor may have since updated their definitions to increase stringency. Inter-laboratory variability may arise from differences in sequence analysis pipelines, alignment strategies, and/or sample processing and sequencing approaches. The choice of primers, polymerase and Poisson frequency target in limiting dilution could influence the final proviral landscape for a participant or a cohort. Intraparticipant and inter-participant sequence diversity may impact these analyses and could be explored in future studies. We tried to control for inter-laboratory differences by running the pipelines on 693 proviruses amplified using the exact same protocol and approach. Unlike our HEAL cohort, the PSD sequences come from a repository of nearly 5000 proviruses, submitted by various research groups and, potentially, impacted by inter-laboratory variability. We concede, a priori, that the best choice of pipeline for provirus landscape analysis will depend on the precise research question being asked and that pipelines provide a measure of inferred intactness that should be assessed for replication competence experimentally.

| CONCLUSIONS
Our results suggest that the field may benefit from a consensus analysis pipeline. We believe a best practice approach starts with eliminating large deletions (retaining sequences >8000 nucleotides), followed by hypermutations, then a category combining deletions in the PSI and mutation in the MSD, followed by a second combined category of internal deletions, stop codons and frameshift mutations in the HIV ORFs and finally inversions. To improve transparency and mitigate bias, research groups should consider analysing NFL proviral sequences for all defect categories, allowing for the identification of provirus sequences that carry multiple types of defects and generating information about how defect categories overlap. While this approach may lead to more complex results, it will provide a richer illustration of the proviral landscape. This may be especially important if a proviral sequence with multiple defects expands clonally. If such proviruses are only analysed for a single defect, the presence of additional accompanying defects may be underestimated or missed.

A C K N O W L E D G E M E N T S
We gratefully acknowledge the efforts and contributions of the HEAL participants and study team. Proviral sequence data will be available through NCBI GenBank, submission in progress.

F U N D I N G
This work was supported by NIH R01AI108538 and R61DA047038 to A.T. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.