Streamlined Subpopulation, Subtype, and Recombination Analysis of HIV-1 Half-Genome Sequences Generated by High-Throughput Sequencing

The highly recombinogenic nature of human immunodeficiency virus type 1 (HIV-1) leads to recombination and emergence of quasispecies. It is important to reliably identify subpopulations to understand the complexity of a viral population for drug resistance surveillance and vaccine development. High-throughput sequencing (HTS) provides improved resolution over Sanger sequencing for the analysis of heterogeneous viral subpopulations. However, current methods of analysis of HTS reads are unable to fully address accurate population reconstruction. Hence, there is a dire need for a more sensitive, accurate, user-friendly, and cost-effective method to analyze viral quasispecies. For this purpose, we have improved the HIVE-hexahedron algorithm that we previously developed with in silico short sequences to analyze raw HTS short reads. The significance of this study is that our standalone algorithm enables a streamlined analysis of quasispecies, subtype, and recombination patterns from long HIV-1 genome regions without the need of additional sequence analysis tools. Distinct viral populations and recombination patterns identified by HIVE-hexahedron are further validated by comparison with sequences obtained by single genome sequencing (SGS).

widely used to analyze viral quasispecies for small genomic regions (Ͻ500 bp). Minority variants present at Ն1% frequency in a sample can be characterized by HTS when sufficiently large numbers of templates are analyzed (33,43). However, when longrange PCR products (Ͼ4,000 bp) are analyzed, it is difficult to reliably identify subpopulations to understand the complexity of a quasispecies because short sequence reads cannot be reliably assembled into full-length consensus sequences for each subpopulation in the sample.
Specialized sequence analysis is required to study the genetic diversity using HTS data. With advancement of analytical techniques, many algorithms have become available to analyze HTS short-read data, but accurate population reconstruction is still elusive. Despite significant improvements in analytical approaches, there has not been significant effort to address this question, and each of the current approaches has different limitations. For example, traditional algorithms like ShoRah (44) can align only a limited number of reads (up to tens of thousands) and call haplotypes based on the center of clusters. ViSpA (45) and QuRe (46) were designed to analyze reads generated by pyrosequencing and focus on identifying insertions and deletions as well as errors in homopolymeric regions. PredictHaplo (47) and HaploClique (48) can analyze data from methods other than pyrosequencing but still are limited in the number of reads that can be analyzed.
High-performance Integrated Virtual Environment (HIVE) is a robust cloud-based infrastructure designed to handle HTS data (49,50). It provides secure web access to registered users to store, retrieve, annotate, and compute HTS data and visualize the outcome of computations on a user-friendly interface (49,50). HIVE-hexahedron is a novel deterministic algorithm that can overcome the aforementioned limitations and process the HTS data at a much larger scale than before (32). This algorithm, when used with in silico reads, showed an improved performance in sequence reconstruction and identification of discrete populations in heterogeneous samples (51).
Targeted small regions (Ͻ600 bp) of HIV-1 genomes have been widely used to study viral populations in samples using short-read HTS methods (33,43). However, because of the high levels of genetic variation and recombinogenic nature of HIV-1, such short sequences do not reveal the real quasispecies nature of HIV-1 in a given sample (52)(53)(54). When half or whole genomes are sequenced by HTS, the quasispecies cannot be determined with the current computational tools (44,(55)(56)(57)(58). In addition, subtyping and recombination analyses of sequences generated by HTS from long sequences (Ͼ4,000 bp) need to be performed using additional conventional software (46,47). To be able to simultaneously determine viral subpopulations, subtypes, and recombination patterns in a sample with HTS short reads on a single platform, we developed a new HIVE-hexahedron algorithm by incorporating a Nearest Neighbor pipeline and introduced advanced parameters in HIVE-hexagon (aligner). We tested the pipeline by analyzing HTS short reads from HIV-1 3=-half genome sequences (ϳ4,500 bp) generated from 65 chronically infected individuals. More importantly, results from HIVEhexahedron were validated by the consensus sequences generated by the commonly used Geneious software for all samples and multiple individual viral genome sequences generated by SGS from seven samples. Our results demonstrate that HIVE-hexahedron can simultaneously determine subpopulations, subtypes, coinfections, and recombination of a viral population by analyzing long-range sequences generated by short HTS reads without exporting sequences for additional analysis using other computational tools.

RESULTS
Polymorphic sites in a viral population are not resolvable in the consensus sequence generated by Geneious. The raw paired-end HTS reads from each sample were loaded to the Geneious assembler and mapped to 3=-half genome sequence from reference HXB2 using medium/fast sensitivity and five-time iterations of the assemblies with default parameters. After the reads were mapped to the reference, the majority of the samples (52 of 65) had Ͻ2% of unused reads while the rest had 2.22% to 21% unused reads. The mean coverage was between 360 and 63,195 at any nucleotide position. The assembled reads resulted in one contig, and the consensus of the contig was generated. Phylogenetic analysis of the consensus sequences from 65 samples showed 17 subtype A1s (26.15%), 4 subtype Bs (6.15%), 27 subtype Cs (41.5%), 7 CRF02_AGs (10.76%), and 10 URFs (15.38%) ( Table 1). Since the consensus sequence generated by the Geneious assembler is an aggregate of all variants at each site within the population, ambiguous bases were detected in 43 sequences (Table 1). Among them, 11 sequences had more than 50 ambiguous bases and as many as 217 ambiguous bases were found in one sequence (707010585), suggesting the presence of quasispecies in samples with many ambiguous bases.
Identification of distinct HIV-1 subpopulations in the same samples by HIVEhexahedron. We previously developed a HIVE-hexahedron algorithm and determined subpopulations using simulated in silico sequence data sets (32). In order to investigate if the ambiguous bases in the consensus sequences generated by Geneious indicate the presence of viral populations in those samples, we analyzed the same sequence data sets using HIVE-hexahedron. Several modifications were made in the original algorithm to better define viral subpopulations in the real HTS data from 65 HIV-1 samples (see Materials and Methods). Five steps were followed to determine the subtype(s) and the number of subpopulations in the samples (Fig. 1a). First, the quality of HTS reads from each sample was inspected using quality control graphs for "Quality length counts" (Fig. 1b) and "Quality position count" (Fig. 1c) generated in HIVE. All the reads selected for subpopulation analysis had quality length count and quality position count higher than a 30 Phred score. The first alignment step was to select the nearest neighbor references (with Ն10,000 reads per kilobase of transcript per million [RPKM]) for optimal assembling. Each read from a sample was mapped to the best-scored reference out of the reference set using the alignment tool HIVE-hexagon (59). The tool uses some advanced features for trimming loosely aligned ends. This is also a primary step to determine subtypes of the viral sequence. The second alignment was performed on the reads per sample against their nearest neighbor reference(s). If more than one nearest neighbor reference was identified in a sample, a multiple sequence alignment (MSA) of the nearest neighbor references was done using MAFFT (60). The results of the second alignment were remapped with a common coordinate system generated by MAFFT using HIVE-hexahedron. The multiple sequence alignments of the nearest neighbor references were not imperative to the HIVE-hexahedron. However, multiple references covering divergent HIV-1 sequences will provide a better platform for reference-assisted de novo assembly of HIV-1 populations and identify coinfections by different subtypes or recombinant viruses. The results were further improved by postprocessing steps (see Materials and Methods) and visualized using Sankey diagrams (Fig. 1d). HIVE-hexahedron displays subpopulation analysis results in the form of interactive Sankey diagrams, which exhibit the coverage per position, subtype(s) of each viral subpopulation, and number of variations that support each subpopulation. Each band in the Sankey diagram represents a separately reconstructed subpopulation, with vertical gray and red lines representing bifurcation and merging events, respectively (Fig. 2). The bifurcation and merging events depict where a subpopulation is detected and where it ends, respectively, relative to the coordinates of multiple sequence alignment (MSA) or the coordinates of resolved sequences. A population is considered major if it covers the entire length of the amplicon. Minor subpopulations are short contigs that bifurcate from the parent population and have at least five mutations in a 500-bp region (1%) compared to its parent. They have an average depth of coverage of 50 and minimum length of 500 bp. The Sankey plots provide a visualization of the reads that support the subpopulation detection. The algorithm resolves the subpopulation and calculates the depth of coverage for each population at each position. Thus, the width of each Sankey diagram is an indicator of the relative abundance at each position. As part of the interface, the user can postcomputationally retrieve the relative abundance of the inferred globally reconstructed sequences for all populations at any given positions. An example of the major and minor viral subpopulations observed in three samples using Sankey diagrams is shown in Fig. 2. Only one population was detected in PK034 ( Fig. 2a), indicating a relatively homogenous viral population. The light green color of the Sankey diagram represents the subtype of the virus, in this case subtype A1. The width of the Sankey diagram represents the depth of coverage per position along the x axis. Three subpopulations were detected in 707010038: one major and two minor (Fig. 2b). All trajectories following the bifurcation and merging events are assemblies representing possible different subpopulations (32). The subpopulations 2 and 3 differed from the major population by 1.6% and 6.2%, respectively. The detection of three subpopulations in 707010038 also explains the unresolved 16 ambiguous bases detected by Geneious. In PK033, eight subpopulations were identified: one major and seven minor (Fig. 2c). In addition to multiple subpopulations, two nearest neighbor references were identified by HIVE-hexahedron, and they are shown in light green (CRF02_AG) and dark green (A1) in Sankey diagrams (Fig. 2c). The genetic differences among consensus sequences of all viral subpopulations were high (3.63% to 11.8%) and Step 1, raw sequencing reads are uploaded on the High-performance Integrated Virtual Environment (HIVE) platform and concurrently quality control graphs are generated for each read file.
Step 2, identification of nearest neighbor references using HIVE-hexagon alignment tool. Each paired-end read file is aligned with a genotype reference set of 20 consensus sequences for 8 subtypes (A1, A2, B, C, D, F1, G, and H) and 10 major circulating recombinant forms (CRFs) from the Los Alamos HIV-1 sequence database. The references with Ͼ10,000 RPKM are selected as nearest neighbors for further analysis.
Step 3, a multiple sequence alignment of nearest neighbor references is performed using MAFFT in case there is more than one reference with Ͼ10,000 RPKM. Step 4, read sequences are aligned against the nearest neighbor reference(s) using HIVE-hexagon.
Step 5, subtype and quasispecies analysis using the HIVE-hexahedron tool. (b) Quality position count graph depicts the average quality of reads per position with a Phred score of 30 as threshold. (c) Quality length count graph depicts quality of sequences at different lengths. (d) The results of HIVE-hexahedron analysis are visualized using an interactive Sankey diagram. The color of the Sankey diagram indicates the subtype (in this case, subtype C), the length of the genome is represented along the x axis, and thickness represented along the y axis reflects the depth of coverage. Consensus sequences, alignment, composition, and summary of the selected or all viral populations are downloadable for further analysis. To generate consistent and reliable results, the tool allows some postprocessing to filter out the viral populations with length of less than 500 bases, average coverage less than 50, and average variations less than 5 bases. The narrower regions in the Sankey diagram at position 2,000 are due to large insertions and/or deletions at the sites. HIVE-hexahedron filters out those reads which have big insertions/deletions toward the ends. This results in low depth of coverage at those regions. this also explains why there were 71 ambiguous bases that could not be resolved by Geneious (Table 1).
Sequence analysis using HIVE-hexahedron showed that 47.69% (31/65) of samples harbored subpopulations. Twenty-four samples (36.92%) had 2 to 5 subpopulations, and six samples (9.23%) had 6 to 9 subpopulations, while one sample (1.53%) had up  (1) spanning the 3=-half HIV-1 genome and two smaller contigs (2 and 3) representing partial 3=-half HIV-1 genome. A small pop-up window over contig 3 shows the details about the number of sites supporting the population and the P value, when a cursor is placed over a bifurcation or merging lines. (c) The analysis of HTS reads from PK033 indicates two nearest neighbors (subtype A1 and CRF02_AG). The sequence reads are resolved into seven contigs, with the major contig 1 spanning the 3=-half HIV-1 genome and six smaller contigs (2 to 7). Potential recombination patterns between two different neighbor reference sequences are indicated by different shades of green. The black vertical lines mark bifurcation from and merging into the same parent contigs. Gray vertical lines mark bifurcation position from the parent contigs, and red vertical lines mark the merging into nonparent contigs. The region that is represented by CRF02_AG (in contig 2; light green) and A1 (in contig 1; dark green) is indicated by a red box. to 13 subpopulations (Table 1; see also Fig. S1 in the supplemental material). Thirty-four samples that did not have detectable subpopulations by HIVE-hexahedron had 20 or fewer ambiguous bases detected by Geneious. All 22 samples with no ambiguous bases by Geneious and the majority (28/30; 93.33%) of the samples with Յ5 ambiguous bases had only one detectable subpopulation by HIVE-hexahedron. Conversely, the majority (29/35; 82.85%) of the samples with Ն5 ambiguous bases had more than one detectable subpopulation (Table 1 and Fig. S1). All 20 samples with Ͼ20 ambiguous bases had Ն2 subpopulations. The numbers of ambiguous bases identified by Geneious were significantly correlated with the numbers of subpopulations detected in samples by HIVE-hexahedron (nonparametric Spearman correlation r ϭ 0.8015; P Ͻ 0.0001) (Fig. 3).
Determination of HIV-1 subtypes. One of the unique advantages of HIVEhexahedron is that multiple sequences can be simultaneously used as references. This allows it to determine which HIV-1 subtype reference(s) the test sequences are most similar to and thus determine the subtypes of the viral sequences. Due to high variability among viral sequences in each subtype, we used subtype consensus sequences from the Los Alamos HIV sequence database to better define subtypes of the samples. This set of 20 consensus sequences contains eight subtypes (A1, A2, B, C, D, F1, G, and H) and 10 CRFs (which cover all subtypes detected in our samples by analyzing the Geneious-derived sequences) and two outlier controls (group O and SIVcpz). The raw read sequences from each sample were mapped to this set of 20 HIV-1 references using HIVE-hexahedron. The specific algorithm in HIVE-hexahedron was used to select the best-scoring reference for each read sequence in a sample. Thus, the first alignment step in HIVE-hexagon can determine subtype references that are the nearest neighbors for the reads (Fig. 1d). This is defined by the RPKM values greater than 10,000. When more than one nearest neighbor was identified by HIVEhexahedron, an additional alignment step(s) of the reads was carried out with the other nearest neighbor(s). The analysis of the HTS data from all 65 samples by HIVEhexahedron showed similar subtyping results as determined by Geneious (Table 1 and Fig. S1). Phylogenetic tree analysis of the major contig consensus sequences generated from HIVE-hexahedron and Geneious-derived consensus sequences from the same sample showed that both sequences from the same sample always clustered together (Fig. 4). While the differences were generally small between the sequences generated by HIVE and Geneious from the same viruses, larger genetic distances (Ͼ0.4%) were observed for 15 viruses (707010585, 707010134, 704010486, 705010699, 702010133, 706010375, PK002, PK003, PK006, PK008, PK013, PK017, PK018, PK033, and PK040). Examination of these sequences showed that there were many ambiguous bases in the sequences generated by Geneious in all but one sample (706010375). This suggested that higher genetic differences between sequences generated by HIVE and Geneious were caused by more divergent viral populations. Detection of different viruses in the same sample. More than one virus can be frequently detected in an infected individual (27,28,61,62). This can often result in generation of recombinants between the different subtype viruses in such infected individuals. Thus, it is important to know how frequently individuals are infected by two distinct viruses. Although analysis of 65 viruses by HIVE-hexahedron did not identify coinfections with two or more subtypes that differ from each other across the entire 3=-half genome (Fig. S1), regions that were clearly distinct between A1 and CRF02_AG were observed in two viruses, PK033 and PK006 ( Fig. 2c and Fig. S1). The distributions of raw reads from CRF02_AG (light green) and subtype A1 (dark green) varied significantly among different contigs ( Fig. S2 and S3), suggesting that viruses at these regions were derived from different subtypes. Phylogenetic tree analysis of these region sequences demonstrated that they clustered with CRF02_AG and subtype A1 separately ( Fig. S2 and S3). These results showed that distinct viral populations were present in these samples. Although they were different from each other at part of the 3=-half genome, different viral populations were not identified by Geneious.
Detection of recombinant sequences. When the HTS read sequences were from different subtypes or CRFs, they could be correctly defined by aligning to different FIG 4 Phylogenetic tree analysis of sequences assembled from HTS reads by Geneious and HIVE. The consensus sequences of contigs assembled using Geneious (blue) and the consensus sequences for the major full-length viral population assembled using HIVE-hexahedron (red) from all samples were aligned together using CLUSTAL W. The phylogenetic tree was constructed using the neighbor-joining method and the Kimura two-parameter model. The scale bar represents 0.02 nucleotide substitutions per site. Asterisks indicate bootstrap values for the knots that are supported in 95% or more of replicates (out of 1,000). nearest neighbor references in HIVE-hexahedron ( Fig. 2 and Fig. S1). This should allow detection of recombinant genomes and define recombination breakpoints between different subtypes and/or CRFs using HIVE-hexahedron. For example, both subtype A1 and C read sequences were detected in 707010585 and aligned to corresponding reference sequences, shown in the Sankey diagram. The recombination pattern of 707010585 was visualized by the Sankey diagram (Fig. 5a). It contained three recombinant regions: subtype C (vif-tat-rev-vpu-env beginning; nucleotide [nt] 4906 to 6542 [based on positions in HXB2]), subtype A1 (middle part of env; nt 6543 to 8081), and subtype C (end of env-tat-rev-nef; nt 8082 to 9531), as schematically shown in Fig. 5b. Analysis of the 707010585 consensus sequence generated by Geneious, using jpHHM and Simplot, showed similar recombination breakpoints (Fig. 5c). Analysis of all consensus sequences generated by Geneious identified 10 URF viruses (6 CRF02_AG sequences were excluded [ Table 1]). Analysis of the same 10 sequences by HIVEhexahedron confirmed recombinant genomes (Table 1 and Fig. S1). The recombination breakpoints in the major contig sequences identified by HIVE-hexahedron were very similar to those determined by analyzing the consensus sequences generated by Geneious in nine viruses (Table 2). However, recombination breakpoints between the major population determined by HIVE-hexahedron and the Geneious-derived sequences were different in PK006. In the last part of the 3=-half genome of PK006, the Geneious-derived sequence showed a recombination pattern as CRF02-A1-CRF02, while the HIVE-hexahedron-derived major population sequence had a CRF02-A1-CRF02-A1 pattern (Fig. S3). When other partial minor subpopulation sequences were included for analysis, the recombination pattern at the first part of this region was shared between Geneious-derived sequences and HIVE-hexahedron-derived sequences for minor subpopulations 5 and 6 (Fig. S3b). The recombination pattern at the last part of this region was shared between Geneious-derived sequences and HIVE-hexahedronderived sequences for minor subpopulations 5, 7, and 8 (Fig. S3b). Phylogenetic tree analysis of the region among all overlapping HIVE-derived contig sequences showed a The shaded cells represent conflicting recombination patterns between Geneious and HIVE-derived sequences (Fig. S3).
that sequences from contigs 5, 6, 7, and 8 clustered together with the Geneious-derived sequence, while the major contig 1 sequence branched out separately for the two regions (Fig. S3d). These results demonstrate that different viral populations that contain different recombinant patterns in a sample can be reliably detected by HIVE-hexahedron while the Geneious-derived consensus sequences of the overall viral population can detect only one of two or more different subpopulations in the sample. Subpopulation sequences identified by HIVE-hexahedron are confirmed by single genome sequencing. One key question is whether subpopulation sequences determined by HIVE-hexahedron are indicative of viral subpopulations in real samples. To accurately characterize the viral population in a sample, we obtained complete 3=-half genome sequences by SGS using the same PCR primers used for HTS. Since the viral genomes are individually amplified and the amplicons are directly sequenced in bulk, the SGS-derived sequences are not affected by PCR-mediated misincorporation, recombination, and resampling errors (25)(26)(27). Thus, the viral population in a sample can be acurately characterized by the SGS method. Thirty-nine complete 3=-half genome sequences were obtained from PK006 by SGS (Table 3). Phylogenetic tree and Highlighter plot analyses showed that the virus had two subpopulations (A and B) and various recombinants between them (Fig. 6a).
Recombinant sequences could be further classified into two groups; A-like recombinants and B-like recombinants. HIVE-hexahedron analysis identified one major complete population (contig 1) and eight minor short subpopulations (contigs 2 to 8 [ Fig. 6b]). To investigate if those subpopulation sequences represented the actual viral populations in the sample, the consensus sequences of all subpopulations determined by HIVE-hexahedron were exported and aligned together with all SGS sequences (Fig. 6a). We then performed phylogenetic tree and Highlighter plot analyses for each overlapping region to determine the genetic relationship between HIVE-hexahedronderived subpopulation sequences and the SGS sequences. A longer subpopulation consensus sequence(s) was included whenever the overlapping regions covered the entire short subpopulation consensus sequence.
When the major complete consensus sequence (4,719 bp) was analyzed, it clustered with subpopulation B and B-like recombinant sequences but was different from those sequences (Fig. 6c). Thus, the major population sequence represented only subpopulation B and B-like recombinant sequences (about half of the viral population). At the regions where subpopulations 2 and 3 overlapped the major population 1, some of the SGS (A= and B=) sequences formed a distinct cluster in the middle of the trees (Fig. 6d and e). Subpopulations 1 and 2 or 1 and 3 clustered with this group of sequences more closely than other sequences. At a region where resolution among all sequences was limited, subpopulation sequence 4 was highly similar to two of the subpopulation A sequences while subpopulation 1 and 3 sequences clustered within a clade consisting of population B, A=, and B= sequences (Fig. 6f). Subpopulation 5 sequence clustered tightly with the population A= sequences while the major population 1 sequence clustered within a clade consisting of population B and B= sequences (Fig. 6g). At the region where subpopulation 5 and 6 sequences overlapped, SGS sequences formed two main clades. Interestingly, all three subpopulation sequences (1, 5, and 6) together   (Fig. 6h). At the end of the 3=-half genome, four subpopulations (1, 5, 7, and 8) overlapped each other (Fig. 6b). In this region, the SGS sequences formed two clades (A/A= and B/B=). While HIVE contig 1 sequence clustered with B/B= sequences, HIVE contig 5, 7, and 8 sequences clustered with A/A= sequences (Fig. 6i). In fact, these A/A= sequences and HIVE contig 5, 7, and 8 sequences were subtype A1 while the B/B= sequences and HIVE contig 1 sequence were CRF02_AG (Table 4 and Fig. S3d). The detection of partial subtype A1 or CRF02_AG sequences at these regions by SGS sequences also unequivocally confirmed that HIVE-hexahedron can reliably detect distinct viral populations that are missed by population-based consensus sequence methods.  Results similar to those observed in PK006 were obtained by analyzing an additional six viruses (Fig. S4 to S9). Multiple SGS sequences (21 to 59) were obtained from these six samples, and 1 to 7 viral populations were identified ( Table 3). The complete major contig consensus sequence in 707010585 was very close to two highly similar SGS sequences (Fig. S4c), most likely due to the high similarity within the viral populations that allowed HIVE-hexahedron to easily assemble them together. Taken together, analysis of HTS reads from a long genome (Ͼ4,000 bp) by HIVE-hexahedron can identify distinct viral subpopulations when genetic diversity levels are high enough. These results show that all subpopulation sequences determined by HIVE-hexahedron represent either the viral populations (major or minor) at the region or the unique recombinant sequences, demonstrating that HIVE-hexahedron-derived sequences can represent actual viral sequences in the samples.
Ambiguous bases can be resolved by parsing a quasispecies into distinct subpopulations using HIVE-hexahedron. Since HIVE-hexahedron can parse the quasispecies into distinct subpopulations, each subpopulation consensus would not be expected to have ambiguous bases. We next sought to determine how subpopulations identified by HIVE-hexahedron were able to resolve ambiguous bases in consensus sequences generated by Geneious using the SGS sequences generated for the same samples. For example, in a 93-bp region (nt 902 to 995) in the vpr/tat gene of PK006, the consensus sequence obtained by de novo assembly of raw reads using Geneious had seven ambiguous bases, indicating multiple bases were present at each of those sites (Fig. 7).
HIVE-hexahedron analysis of sequences in the same region identified three distinct viral populations (A, B, and C), with no ambiguous bases present in any consensus sequences for all populations. The analysis of 39 SGS sequences showed three subpopulations. All 12 SGS sequences, except one with an A/T mutation in subpopulations A=, were identical to the HIVE subpopulation A sequence. All 25 SGS sequences in subpopulation B= were identical to the HIVE subpopulation B sequence. The other two SGS sequences constituted the last subpopulation C=, with the same bases at three sites and different bases at four other sites. Neither of them was the same as any HIVE subpopulation sequences. Thus, HIVE subpopulation C was not represented by the SGS sequences. This is most likely due to the limited SGS sequences available for analysis. Moreover, those two SGS sequences represent rare viral variants in the sample. These results demonstrate that HIVE-hexahedron can accurately detect different viral populations separately and generate consensus sequences without ambiguous bases for those subpopulations of HIV-1 quasispecies in a sample.

DISCUSSION
Half (Ͼ4,000-bp) and complete (Ͼ9,000-bp) HIV-1 genomes have been increasingly analyzed by HTS. However, there are no computational algorithms that can reliably identify distinct subpopulation sequences generated by HTS from samples with quasispecies. Moreover, subsequent subtyping and recombination analyses require sequences to be exported and determined using additional softwares. To streamline all analyses in a single environment, we developed a new HIVE-hexahedron tool (32,49,50). After uploading the HTS raw reads in the HIVE platform, HIVE-hexahedron can identify different subpopulations, subtype all subpopulations, detect recombinant genomes, and determine recombinant breakpoints without exporting the sequences for further analysis.
When the short HTS raw reads from a sample with a complex HIV-1 quasispecies are analyzed, HIVE-hexahedron uses advanced algorithms to progressively align each read to multiple reference sequences. It will generate one major complete contig which represents either overall consensus of the viral population, clonally expanded viral population, or one of the viral subpopulations, while partial genomes where genetic diversity levels are higher than the preset threshold (1%) will be represented as subpopulations. Thus, the quasispecies in a sample will be parsed by subpopulations. The algorithm allows a user to select the threshold of the minimum accepted depth of coverage. In our study, we have selected the threshold of at least 50 reads. Therefore, the minimum sequencing depth required could be lower, but the user can select the threshold for confidence levels. Other tools, such as QuRe, ShoRah, and PredictHaplo, do not provide the granular option of selecting minimum coverage. However, due to the nature of the hexahedron algorithm, the main limiting factor in identifying a subpopulation is the length of the aligned read combined with the depth of coverage. Higher coverages (Ͼ1,000) and longer reads (Ͼ300 bp) will increase the likelihood to detect subpopulations. Thus, the detection of subpopulations in a sample will depend on both the depth of the sequence coverage and the sensitivity of the computer software algorithm (51). By comparing to SGS sequences, which characterize individual 3=-half genome sequences in one complete piece, we found that subpopulation sequences determined by HIVE-hexahedron were representative of distinct viral populations at the full 3=-half genome level or partial genome levels. Although the resolution of quasispecies determined by HIVE-hexahedron is not as high as that by SGS, the detection of subpopulations is a good indication of the presence of distinct viral subpopulations in a sample. Importantly, HIVE-hexahedron is much more cost-effective and faster than SGS. Moreover, each subpopulation sequence is FIG 7 Ambiguous bases in the consensus sequence generated by Geneious are resolved by HIVE. Seven ambiguous bases in a 93-bp consensus sequence region generated by Geneious from PK006 are shown in red. All these ambiguous bases were clearly resolved among three viral populations (A in blue, B in yellow, and C in green) identified by HIVE using the same HTS data. Analysis of 39 SGS-derived sequences shows two major viral populations (A= and B=, identical to the HIVE contig A and B sequences, respectively) and one minor population (C=) which is represented by only two sequences and does not represent any HIVE contig sequences. clearly defined since a quasispecies is parsed into distinct subpopulations by HIVE-hexahedron. This may be particularly important when a drug-resistant quasispecies is analyzed since this can reveal if drug-resistant viruses were generated from the same or different subpopulations. These results demonstrate that HIVEhexahedron is a powerful tool that can identify subpopulations in samples analyzed by short-read HTS. To our knowledge, this is the only algorithm with this capability and confidence level.
Unlike other computational algorithms, HIVE-hexahedron progressively aligns each read to multiple reference sequences. Thus, the HIVE-hexahedron pipeline allows us to rapidly determine HIV-1 subtypes by identifying the nearest neighbors (32,49,50) without the prior knowledge of the HIV-1 classification. The results are visualized as Sankey diagrams. The subtyping results of complete major population consensus sequences from 65 samples were confirmed by conventional subtype analysis of consensus sequences obtained by Geneious using standard HIV-1 subtyping tools. Importantly, the sequences in the reference set can be any number and exchanged for optimal alignment to identify the best-matched reference sequence. Thus, one of the unique features of HIVE-hexahedron is that it can automatically determine the subtype of each subpopulations.
Since subtypes of tested HIV-1 sequences can be easily determined using HIVE-hexahedron, the pipeline was optimized to detect recombinant sequences among different subtypes and define the recombination breakpoints. Using this pipeline, we found 10 of 65 viruses were recombinants, which were the same as those detected by Geneious-derived sequences. In eight samples, the recombination breakpoints determined by HIVE-hexahedron were similar to those determined by analyzing Geneious-derived sequences using the jpHHM and Simplot tools. In the last part of the 3=-half genomes of PK023 and PK006, the recombination breakpoints were different between the results from analysis of major population sequences generated by HIVE-hexahedron and Geneiouis. Analysis of subpopulation HIVE-hexahedron sequences showed that all recombination patterns identified by Geneious-derived sequences can be found in HIVE-hexahedron subpopulation sequences. This demonstrates that HIVE-hexahedron can accurately detect not only different coinfected HIV-1 strains but also distinct recombinant viral populations in the samples than the population sequencing-based methods like Geneious. The analysis of 39 SGS sequences from PK006 unequivocally confirmed the presence of distinct recombinant viral sequences. These results demonstrate that HIVEhexahedron is powerful and accurate to determine coinfections and recombination patterns of a quasispecies.
Some HIVE subpopulations did not cluster with any major groups. They showed mosaic patterns among different subpopulations and could represent recombinant sequences. This might be because these recombinant populations identified by HIVE were unique and are not represented by limited SGS sequences. It is cautioned that recombination may be artificially generated during assembling of reads by HIVEhexahedron because sequence diversity at some parts of the viral genome is too low and different subpopulations are genetically indistinguishable from each other. When HIVE-hexahedron progressively assembles the reads into contigs, it may switch between different viral subpopulations and generate artificial recombinant sequences. However, it is also possible that these recombinant sequences may be generated during PCR, which can result in high levels of PCR-mediated recombination (25,63). Based on our previous study results (63), we used relatively low template input in each PCR to minimize the likelihood to generate PCR-mediated recombinants. However, this will not affect the ability of HIVE-hexahedron to detect different subpopulations at highly divergent subgenome regions. When HIVEhexahedron was compared against other Quasispecies Spectrum Reconstruction (QSR) algorithms including QuRe, ShoRah, and PredictHaplo, it was found to either match or outperform the rest. In the in silico studies, type 2 error was less than the other QSR algorithms and in fact it was the only algorithm that, given enough depth of coverage and adequate length of the reads, was able to accurately reconstruct all subpopulations (51). HIVE-hexahedron should be further optimized to ensure that artificial recombination is minimized during assembly of short HTS reads. We plan to introduce a postcomputational step for statistical inference of potential artificial recombinants in HIVE-hexahedron to ensure that recombination is minimized in processing and assembly of short HTS reads.
The HIVE-hexahedron algorithm is powerful for streamline analysis of viral diversity, subtype, and recombination of short HTS data. It has a broad application by providing analysis of any quasispecies. In addition, HIVE as a platform can also be used for storing data and archiving results. This pipeline can be adapted readily to other less diverse RNA viruses (32).

MATERIALS AND METHODS
Plasma samples. Plasma samples obtained from 33 HIV-1-infected individuals in a chronic infection cohort supported by the Centre for HIV/AIDS Vaccine Immunology (CHAVI) and from 32 HIV-1-infected individuals from a chronic infection cohort in Pakistan. All individuals were not treated with antiretroviral drugs. Written informed consent forms were obtained from all participants, and the studies were approved by the Duke University Institutional Review Board and the ethics committee of Bridge Consultants Foundation. Viral loads in these 65 individuals ranged from 610 to 760,000 copies per ml.
High-throughput sequencing. The purified PCR product (1 ng) from each sample was indexed to make the sequencing library (5 l) using the Nextera XT kit (Illumina, San Diego, CA). The normalized libraries were pooled and further quantified by quantitative PCR (qPCR) using the KAPA library quantification kit (Kapa Biosystems, Woburn, MA). The libraries were sequenced using the MiSeq reagent kit v3 (2 ϫ 150 cycles) on an Illumina MiSeq. Raw sequence reads were filtered for Q-scores above 30. The filtered high-quality sequences from each sample were parsed based on the unique sequence tags using "BaseSpace Sequence Hub" (Illumina, San Diego, CA). The sequences from both ends for the same cluster were paired and exported as Fastq files for subsequent analyses.
Sequence assembling by Geneious. Paired sequence reads from each sample were assembled using Geneious, a suite of the HTS analysis tools (Biomatters, Auckland, New Zealand) (65). A "map to reference" assembly was performed for each read pool using the Geneious assembler. The algorithm used is a seed and expand style mapper followed by an optional fine-tuning step to better align reads around indels to each other rather than the reference sequence. HXB2 was used as the reference to map the reads generated for each sample, and assembly was performed using a medium/fast sensitivity followed by fine-tuning of the assemblies with five times iteration. In this process, the highest-scoring sequence and its closest matching sequence are merged together into a contig. The consensus sequence for each sample was generated from the alignment of the reads and exported for subsequent analysis.
Sequence assembling and analysis using HIVE-hexahedron. The same paired end sequence reads used for analysis by Geneious were uploaded into HIVE. Additional quality control checking of raw reads such as average quality of bases per position was performed in HIVE (49,50). In addition, reads were also examined for their quality based on average quality per read length and average quality per position. Only reads with average quality greater than a 30 Phred score were included for analysis.
After quality control of all reads, each paired-end sequence set was mapped to a set of reference sequences using the alignment tool, HIVE-hexagon (59). The parameters were set as default except some general and advanced parameters. To avoid the high variability among different HIV-1 subtype sequences and defective genes that are often detected in wild-type sequences, we used 20 consensus sequences of group M subtype (A1, A2, B, C, D, F1, G, and H), circulating recombinant forms (CRF_01, 02, 04, 06, 07, 08, 10, 11, 12, and 14), group O, and SIVcpz that were readily available from the Los Alamos HIV-1 sequence database (https://www.hiv.lanl.gov/). From the first HIVE-hexagon alignment, the references with reads per kilobase of transcript per million (RPKM) mapped value of 10,000 or more were selected as nearest neighbors for that sequence. To resolve the conflict for selection of one good reference among many closely related references, the analysis parameters used in our previous study (59) were modified. In addition, a specific algorithm that identified the nearest neighbor reference(s) to the raw reads from a pool of closely related references was implemented to HIVE-hexagon (59). The second alignment was done by aligning the raw reads from each sample to their nearest neighbor reference(s). The identified nearest neighbors were also indicative of the subtype of the sequence. If only one reference was the nearest neighbor, the sequence would represent the subtype of the sample. If there were more than one nearest neighbor, the individual was infected with different viruses or a recombinant between these nearest neighbors. For each sample with more than one identified nearest neighbor reference, multiple sequence alignments (MSAs) of nearest neighbor references were done using Multiple Alignment Fast Fourier Transform (MAFFT) (60) that was integrated in HIVE. The MSA of nearest neighbor references is used to create a common coordinate system for comparing the alignments of short reads from different references.
The final part of the pipeline is HIVE-hexahedron, a specialized tool in HIVE meant for referenceassisted de novo assembly of viral quasispecies (32). The input for HIVE-hexahedron is an alignment of sequences from a sample against its nearest neighbor reference sequence(s), and a multiple sequence alignment of the nearest neighbor reference sequences, in case there is more than one nearest neighbor. The output of HIVE-hexahedron is a Sankey diagram that uniquely shows if there are one or more viral populations in the sample and if recombinant genomes are present. It scans all variations above a given threshold in all available sequences and detects correlated groups of single nucleotide variation (SNV) patterns within each specific viral subpopulation. The subtypes and the recombination patterns are visualized using the interactive Sankey diagrams. Reads mapped against different sequences are remapped to a common coordinate system defined by the MSA (60) so they are piled up together. The consensuses of all subpopulation sequences can be exported in the FASTA format for further analysis. To improve the quality of the results, novel features were introduced in the tool that allow some postprocessing, which involves filtering out the small viral subpopulations with very low coverages. The following filters were used to define a subpopulation: at least 500 bp long, average depth of coverage of more than 50, and at least five SNVs.
Sequence analysis. The consensus sequences of viral subpopulations spanning the 3=-half genome (ϳ4,500 bp) of all samples were exported from Geneious and HIVE-hexahedron. The sequences were aligned together with references from the HIV Sequence Database using CLUSTAL W (66), and manual adjustments for optimal alignments were done using SEAVIEW. Phylogenetic trees were constructed using the neighbor-joining (NJ) method with the Kimura two-parameter model (67,68), and the reliability of topologies was estimated by bootstrap analysis with 1,000 replicates. Recombination patterns were initially analyzed by REGA HIV-1 & 2 Automated Subtyping Tool (version 3.41; http://krisp.ukzn.ac.za/app/typingtool/virus/) and the jumping profile Hidden Markov Model (jpHMM) at the GOBICS website (http://jphmm.gobics.de/) (69). Highlighter plots were generated using the Highlighter tool at the Los Alamos HIV sequence database (https://www.hiv.lanl.gov/content/sequence/ HIGHLIGHT/highlighter_top.html). The recombination breakpoints were confirmed by BootScan implemented in Simplot version 3.5.1 (70). The recombination pattern map for each viral subpopulation was generated using RecDraw (71).
Single genome sequencing. The cDNA was made by reverse transcription of viral RNA using primer 1.R3.B3R as described previously (64). The sequences of individual 3=-half genomes were obtained by single genome sequencing (SGS) from seven samples (26,64). The first-round PCR primers were 2.R3.B6R and 07For7, and the second-round PCR primers were VIF1 and Low2C. The PCR products amplified from individual 3=-half genomes were purified and directly sequenced by HTS on an Illumina MiSeq.
Accession number(s). The GenBank accession numbers for sequences are MT395382 to MT395511 and MT419970 to MT420255.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.

ACKNOWLEDGMENTS
This study was supported by the National Institute of Allergy and Infectious Diseases (NIAID), National Institutes of Health (NIH) for External Quality Assurance Program Oversight Laboratory (EQAPOL; HHSN272201000045C and HHSN272201700061C).
We declare no conflict of interest.